Grover lightning-kokkos CPU

Hi all ,

i am trying to run Grover’s algorithm (see below for a implementation) multithreaded in an arm node with 32GB 48 cores. I was expecting to be able to run it for 30 qubits. However, i am noticing that memory at some point goes beyond 32GB and crashes. Can you help me figure it out ?

Btw neither qml.probs() nor qml.probs() work analytically . However, through shots qml.probs() is able to run with small shot budget though.

Thank you :folded_hands:

import pennylane as qml
import numpy as np

import os 

os.environ["OMP_NUM_THREADS"] = "48"
os.environ["OMP_PROC_BIND"] = "true"
os.environ["OMP_PLACES"] = "cores"

from argparse import ArgumentParser
# ---- Args ----
parser = ArgumentParser()
parser.add_argument("--n_qubits", type=int, default=1, help="Number of qubits")
args = parser.parse_args()

# ---- Parameters ----
NUM_QUBITS = args.n_qubits

omega = np.array([np.ones(NUM_QUBITS)])

M = len(omega)
N = 2**NUM_QUBITS
wires = list(range(NUM_QUBITS))

dev = qml.device("lightning.kokkos", wires=NUM_QUBITS)#, shots = 10**9)

@qml.qnode(dev)
def circuit():
    #iterations = int(np.round(np.sqrt(N / M) * np.pi / 4))
    iterations = 10
    # Initial state preparation
    for w in wires:
        qml.Hadamard(wires=w)

    # Grover's iterator
    for _ in range(iterations):
        for omg in omega:
            qml.FlipSign(omg, wires=wires)

        qml.templates.GroverOperator(wires)

    return qml.probs(wires=wires)

if __name__ == "__main__":
    probs = circuit()
    print(f"Probabilities for {NUM_QUBITS} qubits: {probs}")

Hi @Andre_Sequeira , thanks for posting your question here!

I had a look at your issue, and I can confirm that when analytically computing qml.probs(), the step to copy the array from our Lightning-Kokkos C++ backend to Python involves creating (somewhat unnecessary) memory copies. From my testing, returning qml.probs() with 29 qubits already require 36 GB of memory, which exceeds the memory on your system and will likely cause a crash.

This is definitely not ideal, and our team is currently working on this issue for the next release along with an updated Python/C++ binding interface, which should significantly reduce the memory overhead of returning qml.probs(). In the meantime, as you discovered, returning qml.probs() with finite number of shots is much more memory-efficient, and you will be able to return the result.

Please let us know if you have any other questions!

1 Like

Hi @joseph.lee , thank you for your help. Is this problem solved in the version 0.42.0?
We noticed that multinode also do not work with shots independently of using or not using mpi gathering .

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

import pennylane as qml
import numpy as np

from argparse import ArgumentParser
# ---- Args ----
parser = ArgumentParser()
parser.add_argument("--n_qubits", type=int, default=1, help="Number of qubits")
args = parser.parse_args()

# ---- Parameters ----
NUM_QUBITS = args.n_qubits

omega = np.array([np.ones(NUM_QUBITS)])

M = len(omega)
N = 2**NUM_QUBITS
#wires = list(range(NUM_QUBITS))

dev = qml.device("lightning.kokkos", wires=NUM_QUBITS, shots = 10**6, mpi=True)

@qml.qnode(dev)
def circuit():
    #iterations = int(np.round(np.sqrt(N / M) * np.pi / 4))
    iterations = 10
    # Initial state preparation
    for w in list(range(NUM_QUBITS)):
        qml.Hadamard(wires=w)

    # Grover's iterator
    for _ in range(iterations):
        for omg in omega:
            qml.FlipSign(omg, wires=list(range(NUM_QUBITS)))

        qml.templates.GroverOperator(list(range(NUM_QUBITS)))

    return qml.probs(wires=list(range(NUM_QUBITS)))

local_probs = circuit()
#print(f"Probabilities for {NUM_QUBITS} qubits: {probs}")
#For data collection across MPI processes.
recv_counts = comm.gather(len(local_probs),root=0)
if rank == 0:
    probs = np.zeros(2**NUM_QUBITS)
else:
    probs = None

comm.Gatherv(local_probs,[probs,recv_counts],root=0)
if rank == 0:
    print(probs)

we are using the following jobscript :

#!/bin/bash
#SBATCH --job-name=PL33
#SBATCH --account=i20240010x
#SBATCH --partition=normal-x86
#SBATCH --nodes=2
# SBATCH --ntasks-per-node=1
## SBATCH --ntasks=2
#SBATCH --cpus-per-task=128
#SBATCH --time=48:00:00
#SBATCH --array=33 # six independent array tasks
#SBATCH -o grover_10its_%a_%j.out          # %a = array index (= n_qubits here)
#SBATCH -e grover_10its_%a_%j.err


# Load environment
ml OpenMPI/
ml Python/
source /projects/I20240010/venv_kokkos_mpi_x86/bin/activate

# Set OpenMP environment variables
export OMP_NUM_THREADS=128
export OMP_PLACES=cores
export OMP_PROC_BIND=spread

# ---- EXECUTE ----------------------------------------------------------
# SLURM_ARRAY_TASK_ID takes the value 20 / 22 / … / 30 for each task
/usr/bin/time -f "elapsed=%E cpu=%P maxrss=%MKB" \
              -o time_10its_flexiblas_${SLURM_ARRAY_TASK_ID}_${SLURM_JOB_ID}.txt \
    srun python grover_pennylane.py --n_qubits ${SLURM_ARRAY_TASK_ID}

and we get the error:

MPI_Sendrecv( sendBuf.data(), sizeInt, datatype, destInt, sendtag, recvBuf.data(), sizeInt, datatype, sourceInt, recvtag, this->getComm(), &status): MPI_ERR_COUNT: invalid count argument
MPI_Sendrecv( sendBuf.data(), sizeInt, datatype, destInt, sendtag, recvBuf.data(), sizeInt, datatype, sourceInt, recvtag, this->getComm(), &status): MPI_ERR_COUNT: invalid count argument
slurmstepd: error: *** STEP 495186.0 ON cnx009 CANCELLED AT 2025-07-22T10:51:46 ***

Thank you!

Hi @Andre_Sequeira, the fix is on the horizon for version 0.43. But you don’t need to wait for the official release to benefit from it.

Let me provide more context: the memory peak you’re seeing close to the end of the execution for qml.probs is somewhat expected! This is the heap usage of your script with 29 qubits for the reference:

This occurs because the probability data transferred from our C++ backend to Python is explicitly cast to a NumPy array. You wouldn’t see this behaviour for other supported measurements. We are currently addressing this and a few other problems by rewriting our python bindings using a lightweight and highly efficient library. You can track the progress of this work in [nanobind] Nanobind Feature Branch PR [Base] by AmintorDusko · Pull Request #1176 · PennyLaneAI/pennylane-lightning · GitHub.

Once the work merges to our master branch in a few weeks, you’ll be able to access this new, optimized feature by installing the dev version of Lightning from TestPyPI or by following our from source installation guidelines.

Regarding your experience with Lightning-Kokkos + MPI, I’m sorry to hear it was difficult. We’ve just introduced MPI for this device, and we highly encourage you to provide any feedback you might have. This helps us immensely in planning for missing features or addressing bugs in future updates.

Currently, shots are fully supported and tested with qml.sample and qml.state measurements. Unfortunately, qml.probs with shots is not yet maintained for our multi-GPU devices (Lightning-GPU and Lightning-Kokkos). You can, however, still test qml.probs analytically. We would greatly appreciate learning more about your specific needs for qml.probs with shots, as this insight will help us consider maintaining it in upcoming Lightning releases.

Here are a few links you might find useful:

1 Like