Lightning-gpu failing on multi-node/multi gpus

Hello! If applicable, put your complete code example down below. Make sure that your code:

  • is 100% self-contained — someone can copy-paste exactly what is here and run it to
    reproduce the behaviour you are observing
  • includes comments

I am trying to run the scripts describe in blog - Distributing quantum simulations using lightning.gpu with NVIDIA cuQuantum | PennyLane Blog on NERSC machines, but facing problem running the script. Followed the script to install pennylane-lightning-gpu from source code.

from mpi4py import MPI
import pennylane as qml
from pennylane import numpy as np
from timeit import default_timer as timer

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Set number of runs for timing averaging
num_runs = 3

# Choose number of qubits (wires) and circuit layers
n_wires = 32
n_layers = 2

# Instantiate CPU (lightning.qubit) or GPU (lightning.gpu) device
# mpi=True to switch on distributed simulation
# batch_obs=True to reduce the device memory demand for adjoint backpropagation
dev = qml.device('lightning.gpu', wires=n_wires, mpi=True, batch_obs=True)

# Create QNode of device and circuit
@qml.qnode(dev, diff_method="adjoint")
def circuit_adj(weights):
    qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires)))
    return qml.math.hstack([qml.expval(qml.PauliZ(i)) for i in range(n_wires)])

# Set trainable parameters for calculating circuit Jacobian at the rank=0 process
if rank == 0:
    params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires))
else:
    params = None

# Broadcast the trainable parameters across MPI processes from rank=0 process
params = comm.bcast(params, root=0)

# Run, calculate the quantum circuit Jacobian and average the timing results
timing = []
for t in range(num_runs):
    start = timer()
    jac = qml.jacobian(circuit_adj)(params)
    end = timer()
    timing.append(end - start)

# MPI barrier to ensure all calculations are done
comm.Barrier()

if rank == 0:
    print("num_gpus: ", size, " wires: ", n_wires, " layers ", n_layers, " time: ", qml.numpy.mean(timing)) 

If you want help with diagnosing an error, please put the full error message below:

mpirun -np 4 python test.py
*** The MPI_Comm_rank() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[nid008340:1176249] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_rank() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[nid008340:1176248] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_rank() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[nid008340:1176246] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Comm_rank() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[nid008340:1176247] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [prterun-nid008340-1176242@1,0]
  Exit code:    14
--------------------------------------------------------------------------

And, finally, make sure to include the versions of your packages. Specifically, show us the output of qml.about().

>>> qml.about()
Name: PennyLane
Version: 0.32.0
Summary: PennyLane is a Python quantum machine learning library by Xanadu Inc.
Home-page: https://github.com/PennyLaneAI/pennylane
Author: 
Author-email: 
License: Apache License 2.0
Location: /global/u1/p/prmantha/.local/perlmutter/python-3.10/lib/python3.10/site-packages
Requires: appdirs, autograd, autoray, cachetools, networkx, numpy, pennylane-lightning, requests, rustworkx, scipy, semantic-version, toml, typing-extensions
Required-by: PennyLane-Lightning, PennyLane-Lightning-GPU

Platform info:           Linux-5.14.21-150400.24.81_12.0.86-cray_shasta_c-x86_64-with-glibc2.31
Python version:          3.10.12
Numpy version:           1.23.5
Scipy version:           1.11.3
Installed devices:
- default.gaussian (PennyLane-0.32.0)
- default.mixed (PennyLane-0.32.0)
- default.qubit (PennyLane-0.32.0)
- default.qubit.autograd (PennyLane-0.32.0)
- default.qubit.jax (PennyLane-0.32.0)
- default.qubit.tf (PennyLane-0.32.0)
- default.qubit.torch (PennyLane-0.32.0)
- default.qutrit (PennyLane-0.32.0)
- null.qubit (PennyLane-0.32.0)
- lightning.qubit (PennyLane-Lightning-0.32.0)
- lightning.gpu (PennyLane-Lightning-GPU-0.33.0.dev0)
>>> 
>>> 

CUDA details

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2022 NVIDIA Corporation

Built on Tue_May__3_18:49:52_PDT_2022

Cuda compilation tools, release 11.7, V11.7.64

Build cuda_11.7.r11.7/compiler.31294372_0

Hi @QuantumMan,

Thank you for your question.
Unfortunately each HPC system needs its own MPI implementation. I see that your error mentions MPI_INIT so I would think this is either an initialization error or an installation error.

@mlxd might have more insights on this.

1 Like

Thanks @CatalinaAlbornoz - As per this blog NERSC HPC Perlmutter machine has been used, and i am trying the same machine. So probably i am missing something. @mlxd any help will be appreciated. thanks in advance.

1 Like

Hi @QuantumMan thank you for the clarification. We’re looking into this. We may need several days to come back to you but we will be back.

I resolved this. You are right, this is mostly dependent on the HPC software. One of the things i made sure is to maintain the same version of pennylane pip installed version and lightning-gpu plugin installation. I am not sure if some documentation/error messages make it clear. thanks for your help.

I’m glad you managed to solve the issue @QuantumMan !

Hi @CatalinaAlbornoz - when i am trying to run the same script on the same HPC machine with 4 nodes. i see the script fails beyond 24 quibits(20,22 qubits work fine) with [/global/u1/p/prmantha/pennylane-lightning-gpu/pennylane_lightning_gpu/src/util/DataBuffer.hpp][Line:48][Method:DataBuffer]: Error in PennyLane Lightning: out of memory error.

I am tyring to run the script on 16 A100 GPUs with multi-node/multi-gpu support. below is the command

mpirun -np 16 python testmpi.py 24(qubits passed as argument)

Any thoughts why this could happen? also how do i verify that multi-gpus are being utlized? anyway to capture the gpu metrics…

Here is the full error message

  File "/global/homes/p/prmantha/pilot-streaming/examples/scripts/task-parallelism/testmpi.py", line 22, in <module>
    dev = qml.device('lightning.gpu', wires=n_wires, mpi=True, batch_obs=True)
  File "/global/homes/p/prmantha/.local/perlmutter/python-3.10/lib/python3.10/site-packages/pennylane/__init__.py", line 345, in device
    dev = plugin_device_class(*args, **options)
  File "/global/homes/p/prmantha/.local/perlmutter/python-3.10/lib/python3.10/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 287, in __init__
    self._gpu_state = _gpu_dtype(c_dtype, mpi)(
pennylane_lightning_gpu.lightning_gpu_qubit_ops.PLException: [/global/u1/p/prmantha/pennylane-lightning-gpu/pennylane_lightning_gpu/src/util/DataBuffer.hpp][Line:48][Method:DataBuffer]: Error in PennyLane Lightning: out of memory```

Full script 

```from mpi4py import MPI
import pennylane as qml
from pennylane import numpy as np
from timeit import default_timer as timer
import datetime
import sys

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Set number of runs for timing averaging
num_runs = 3

# Choose number of qubits (wires) and circuit layers
n_wires = int(sys.argv[1])
n_layers = 2

# Instantiate CPU (lightning.qubit) or GPU (lightning.gpu) device
# mpi=True to switch on distributed simulation
# batch_obs=True to reduce the device memory demand for adjoint backpropagation
dev = qml.device('lightning.gpu', wires=n_wires, mpi=True, batch_obs=True)

# Create QNode of device and circuit
@qml.qnode(dev, diff_method="adjoint")
def circuit_adj(weights):
    qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires)))
    return qml.math.hstack([qml.expval(qml.PauliZ(i)) for i in range(n_wires)])

# Set trainable parameters for calculating circuit Jacobian at the rank=0 process
if rank == 0:
    params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires))
else:
    params = None

# Broadcast the trainable parameters across MPI processes from rank=0 process
params = comm.bcast(params, root=0)

# Run, calculate the quantum circuit Jacobian and average the timing results
timing = []
for t in range(num_runs):
    start = timer()
    jac = qml.jacobian(circuit_adj)(params)
    end = timer()
    timing.append(end - start)

# MPI barrier to ensure all calculations are done
comm.Barrier()

if rank == 0:
    run_timestamp=datetime.datetime.now()
    RESULT_FILE= "mpi-script-" + run_timestamp.strftime("%Y%m%d-%H%M%S") +  ".csv"
    result_str="num_gpus: {}, wires: {}, layers: {}, time: {}s".format(size, n_wires, n_layers, qml.numpy.mean(timing))
    print(result_str)
    with open(RESULT_FILE, "w") as f:
        f.write(result_str)
        f.flush()```



The script fails when initializing device with lightning.gpu, why is it failing with error without loading anything into memory?

Hi @QuantumMan,

PennyLane lightning starts by initializing a state with the full size of the system you specified. Even if you don’t perform any computation on the 24 qubits, this error is telling you that your computer doesn’t have enough memory to store the state generated by so many qubits. Normally if you have 80GB RAM you should be able to reach about 30qubits but this can change depending on your specific settings.

1 Like

I am curious, How did you come up with 80GB? Also i have 160GB across 4 GPUs(40GB/GPU), so my understanding is with multi-gpu/multi-node, i should be easily scale number of quibits.

Hi @QuantumMan, I mentioned 80GB as an example. In your case with 40GB GPUs you should be able to reach 28 or 29 qubits. Depending on the number of observables that you have you can split the computation into the different GPUs but here the problem happens at initialization so I don’t think there’s much that works here. For you to be able to run your problem you need to be able to run the statevector in a single GPU at least once.

1 Like

Thanks. I am a newbie, i still didn’t understand why i can fit only 28qubits to 40GB GPUs, what is the math behind it? so i can’t leverage multi-gpus for scaling quibits with one device? i think what you are referring to is to create multiple devices and defvide the circuit across the devices?

Hi @QuantumMan

I had some time to test this out today, and was able to run your example code across single GPUs, and multiple GPUs, depending on the workload. I wasn’t able to hit the memory wall you reported on 24 qubits, so I figured we can try to make a similar build, and start from there.

First, to make sure you have a working build with the CUDA aware MPI-env, I’d suggest ensuring these steps are followed:


# Ensure all necessary modules are loaded up-front

module load PrgEnv-gnu cray-mpich cudatoolkit craype-accel-nvidia80 evp-patch gcc/11.2.0

# Required due to a potentially missing lib for the CUDA-aware MPICH library

export LD_LIBRARY_PATH=${CRAY_LD_LIBRARY_PATH}:/opt/cray/pe/mpich/8.1.25/ofi/gnu/9.1/lib/:$LD_LIBRARY_PATH

# Create a python virtualenv

python -m venv pyenv && source ./pyenv/bin/activate

# install the following dependencies

python -m pip install cmake ninja custatevec-cu11 wheel pennylane~=0.32.0 pennylane-lightning~=0.32.0

# build mpi4py against the system's CUDA-aware MPICH

MPICC="cc -shared" python -m pip install --force --no-cache-dir --no-binary=mpi4py mpi4py

# clone and checkout lightning.gpu at version 0.32.0

git clone https://github.com/PennyLaneAI/pennylane-lightning-gpu

cd pennylane-lightning-gpu && git checkout v0.32.0

# build the extension module with MPI support, and package it all into a wheel

python setup.py build_ext --define="PLLGPU_ENABLE_MPI=ON;-DCMAKE_CXX_COMPILERS=$(which mpicxx);-DCMAKE_C_COMPILER=$(which mpicc)"

python setup.py bdist_wheel

# Assuming the above steps are reproduced, your wheel can be installed as needed

python -m pip install ./dist/*.whl

# Grab an allocation for however many GPUs you want; I'd start with a single interactive node/4 GPUs

salloc ...

# and launch over multiple processes

srun -n 4 python script.py

With the above, I was able to run your adjoint-diff script up to about 28-29 qubits. The reason for this limit is how the script is defined. This script is designed to evaluate the Jacobian of the circuit, relative to one observable per qubit in the circuit. When calculating the Jacobian of the quantum circuit using the adjoint diff method several copies of the state-vector are required, with the number of copies scaling with the number of observables you require. We aim to linearise the required observables with the batch_obs keyword, but that only allows us to save memory up to a point, as multiple copies are still needed for the method to work. If you take a single node (4 GPUs) with an interactive allocation, you can launch tmux, and split the session, with your srun command on the left, and watch -n1 nvidia-smi on the right to see what is happening to memory usage. You should get images similar to the following:

Note, the above example should also work without MPI (for a single node with 4+ GPUs), as the batching support allows the adjoint to use std::thread to concurrently evaluate the Jacobian terms. I’d expect 28 qubits to be faster without MPI. However, if your problem grows (29 qubits+, with 29+ observables) you’ll hit the memory wall quickly.

If you want to simulate a larger number of qubits, and don’t care about gradients, you can simply evaluate the circuit in the forward execution, disabling the gradient pipeline by setting diff_method=None, and removing the qml.jacobian call. In this situation, you will have a single state-vector, which is reused to evaluate the observables, and should allow up to 33 qubits (16 bytes per complex coefficient x 2^33 coefficients), coming in around 128GB of GPU memory required, which will be split to around 32GB per card (plus some overheads), as in the following image:

I’d suggest setting up your env again with the above steps, and trying to repeat running scripts with the following steps:

  • Set diff_method=None, and remove qml.jacobian to see how the forward pass scales. You should be able to hit the same as I did (33 qubits)

  • Try adding more nodes/GPUs and increase the count further (assume every qubit you add will require multiplying your required GPU count by 2)

  • Lower your qubit count by 2-8 (depends on your problem/number of observables/etc), and try to evaluate the Jacobians again. For a single node, you should be able to hit 29 qubits with batch_obs=True, at the expense of a slightly longer runtime. If your problem works without this, it’ll be much faster (I hit 28 qubits without this).

Hope this helps to understand what’s happening under the hood — in essence, gradients require a lot more memory than a general forward execution due to the additional needs of multiple state-vector copies per observable and overheads needed in the evaluations. Let us know if we can help any more here.

2 Likes

Thanks @mlxd for the detailed instructions. It did work finally following your instructions. I could run 33 qubits on one node with 4GPUs/160GB memory without jacobian call and diff_method=None. But when i try to scale beyond 33qubits to 40 qubits on 4 nodes, i see OOM errors. Is it possible to linearly scale the nunber of qubits, or am i missing something. thanks.

Hi @QuantumMan

Unfortunately linear scaling is not possible. For statevector simulators, the memory requirements grow exponentially with the qubit count.

For 40 qubits, you’ll need at a minimum 16 (bytes per complex double) x 2^40 (coefficients for 40 qubits), requiring on the order of hundreds of nodes to represent the state. You can think of it as a doubling in the required GPU count for every qubit increase.

Thats right, so with 4 nodes = 4(nodes) * 4(gpus) * 40GB(mem/gpu) = 640GB memory available , based on the above math for 35 qubits we need 16 * (2^35) = 549 GB , but when i try to scale beyond 33 Qubits i.e 34, 35 i see OOM errors.

srun -n 16 python penny-no-jacob-mpi.py

srun: Job 18174449 step creation temporarily disabled, retrying (Requested nodes are busy)

srun: Step created for StepId=18174449.1

slurmstepd: error: Detected 1 oom_kill event in StepId=18174449.1. Some of the step tasks have been OOM Killed.

srun: error: nid004013: task 8: Out Of Memory

srun: Terminating StepId=18174449.1

slurmstepd: error: Detected 1 oom_kill event in StepId=18174449.1. Some of the step tasks have been OOM Killed.

slurmstepd: error: Detected 1 oom_kill event in StepId=18174449.1. Some of the step tasks have been OOM Killed.

slurmstepd: error: Detected 1 oom_kill event in StepId=18174449.1. Some of the step tasks have been OOM Killed.

Is it because of some memory overhead this is failing?

1 Like

@QuantumMan am joining in a little late here, but just because your circuit has N qubits doesn’t mean that the memory used will be exactly 2^N scaled by the data type (e.g., 16 bytes per coeff). 2^N scaled by the data type is really the lower bound on your memory usage.

It would help to see what else you’re trying to calculate in penny-no-jacob-mpi.py :slight_smile:

Hi,

Here is the script

import pennylane as qml
import numpy as np
from timeit import default_timer as timer

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Set number of runs for timing averaging
num_runs = 3

# Choose number of qubits (wires) and circuit layers
n_wires = 35
n_layers = 2

# Instantiate CPU (lightning.qubit) or GPU (lightning.gpu) device.
# mpi=True to switch on distributed simulation
dev = qml.device('lightning.gpu', wires=n_wires, mpi=True)

# Set target wires for probability calculation
prob_wires = range(n_wires)

# Create QNode of device and circuit
@qml.qnode(dev)
def circuit(weights):
    qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires)))
    return qml.probs(wires=prob_wires)

# Set trainable parameters for calculating circuit Jacobian at the rank=0 process
if rank == 0:
    params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires))
else:
    params = None

# Broadcast the trainable parameters across MPI processes from rank=0 process 
params = comm.bcast(params, root=0)

# Run, calculate the quantum circuit Jacobian and average the timing results
timing = []
for t in range(num_runs):
    start = timer()
    local_probs = circuit(params)
    end = timer()
    timing.append(end - start)

# MPI barrier to ensure all calculations are done
comm.Barrier()

if rank == 0:
    print("num_gpus: ", size, " wires: ", n_wires, " layers ", n_layers, " time: ", qml.numpy.mean(timing))```

Your code will definitely use more memory than what’s required to store a complex state of size 2^N :slight_smile:. If you want a good idea for how your code is going to scale up as you increase the number of qubits (i.e., what your RAM requirements are as a function of the number of qubits) then you can use one of the many profilers available for Python programs. Here’s a decent resource!

Let me know if this helps :slight_smile: