Pennylane multi-gpu script fails with error even there are enough gpus

Hello! If applicable, put your complete code example down below. Make sure that your code:

  • is 100% self-contained — someone can copy-paste exactly what is here and run it to
    reproduce the behaviour you are observing
  • includes comments
from mpi4py import MPI
import pennylane as qml
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
dev = qml.device('lightning.gpu', wires=8, mpi=True)
@qml.qnode(dev)
def circuit_mpi():
    qml.PauliX(wires=[0])
    return qml.state()
local_state_vector = circuit_mpi()
#rank 0 will collect the local state vector
state_vector = comm.gather(local_state_vector, root=0)
if rank == 0:
    print(state_vector)

If you want help with diagnosing an error, please put the full error message below:

> nvidia-smi --list-gpus | wc -l
4



srun -n 4 python testmpi2.py 
Traceback (most recent call last):
  File "/global/u1/p/prmantha/PilotQuantumBenchmarks/dist_mem_pennylane_mpi/testmpi2.py", line 5, in <module>
Traceback (most recent call last):
  File "/global/u1/p/prmantha/PilotQuantumBenchmarks/dist_mem_pennylane_mpi/testmpi2.py", line 5, in <module>
Traceback (most recent call last):
  File "/global/u1/p/prmantha/PilotQuantumBenchmarks/dist_mem_pennylane_mpi/testmpi2.py", line 5, in <module>
Traceback (most recent call last):
  File "/global/u1/p/prmantha/PilotQuantumBenchmarks/dist_mem_pennylane_mpi/testmpi2.py", line 5, in <module>
    dev = qml.device('lightning.gpu', wires=8, mpi=True)
    dev = qml.device('lightning.gpu', wires=8, mpi=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane/__init__.py", line 370, in device
    dev = qml.device('lightning.gpu', wires=8, mpi=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane/__init__.py", line 370, in device
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane/__init__.py", line 370, in device
    dev = qml.device('lightning.gpu', wires=8, mpi=True)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane/__init__.py", line 370, in device
    dev = plugin_device_class(*args, **options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 266, in __init__
    dev = plugin_device_class(*args, **options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 266, in __init__
    dev = plugin_device_class(*args, **options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 266, in __init__
    dev = plugin_device_class(*args, **options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 266, in __init__
    self._mpi_init_helper(self.num_wires)
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 308, in _mpi_init_helper
    self._mpi_init_helper(self.num_wires)
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 308, in _mpi_init_helper
    self._mpi_init_helper(self.num_wires)
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 308, in _mpi_init_helper
    self._mpi_init_helper(self.num_wires)
  File "/global/homes/p/prmantha/.local/lib/python3.11/site-packages/pennylane_lightning_gpu/lightning_gpu.py", line 308, in _mpi_init_helper
    raise ValueError(
ValueError: Number of devices should be larger than or equal to the number of processes on each node.
    raise ValueError(
ValueError: Number of devices should be larger than or equal to the number of processes on each node.
    raise ValueError(
ValueError: Number of devices should be larger than or equal to the number of processes on each node.
    raise ValueError(
ValueError: Number of devices should be larger than or equal to the number of processes on each node.
srun: error: nid200325: tasks 1-2: Exited with exit code 1
srun: Terminating StepId=20928774.12
slurmstepd: error: *** STEP 20928774.12 ON nid200325 CANCELLED AT 2024-01-28T09:11:54 ***
srun: error: nid200325: tasks 0,3: Exited with exit code 1

And, finally, make sure to include the versions of your packages. Specifically, show us the output of qml.about().

Name: PennyLane
Version: 0.32.0
Summary: PennyLane is a Python quantum machine learning library by Xanadu Inc.
Home-page: https://github.com/PennyLaneAI/pennylane
Author: 
Author-email: 
License: Apache License 2.0
Location: /global/homes/p/prmantha/.local/lib/python3.11/site-packages
Requires: appdirs, autograd, autoray, cachetools, networkx, numpy, pennylane-lightning, requests, rustworkx, scipy, semantic-version, toml, typing-extensions
Required-by: PennyLane-Lightning, PennyLane-Lightning-GPU

Platform info:           Linux-5.14.21-150400.24.81_12.0.87-cray_shasta_c-x86_64-with-glibc2.31
Python version:          3.11.7
Numpy version:           1.23.5
Scipy version:           1.12.0
Installed devices:
- default.gaussian (PennyLane-0.32.0)
- default.mixed (PennyLane-0.32.0)
- default.qubit (PennyLane-0.32.0)
- default.qubit.autograd (PennyLane-0.32.0)
- default.qubit.jax (PennyLane-0.32.0)
- default.qubit.tf (PennyLane-0.32.0)
- default.qubit.torch (PennyLane-0.32.0)
- default.qutrit (PennyLane-0.32.0)
- null.qubit (PennyLane-0.32.0)
- lightning.qubit (PennyLane-Lightning-0.32.0)
- lightning.gpu (PennyLane-Lightning-GPU-0.32.0)

Hey @QuantumMan,

Thank you for your question! We have forwarded this question to members of our technical team who will be getting back to you within a week. Feel free to post any updates to your question here in the thread in the meantime!

Hi @QuantumMan

Just some follow-up questions that may help us identify the issue:

  • I assume this is running on Perlmutter (or an equivalent system). If so, can you confirm your mpi4py installation is correctly built against and running atop the Cray-MPICH library? You may need to visit the system specific documentation to validate this.
  • Does your build of lightning.gpu succeed when using the system-provided MPI compiler toolchains?
  • For a given allocation, are you specifying the number of GPUs matching the required number of processes spawned? In essence, there should be 1 GPU per spawned MPI process.
  • Is the problem you are composing large-enough to use the MPI pipeline? You can try scaling the number of wires to something like 20?
  • Assuming everything is built and set as above, are you launching the processes directly with srun, and not mpirun/mpiexec?

If setting things up fresh, I’d also recommend using the PL 0.34 version for better gradient performance.

Let us know the above, and we should be able to help identify the cause.

Thank you for the comments. Here are the instructions I tried https://github.com/pradeepmantha/PilotQuantumBenchmarks/blob/main/dist_mem_pennylane_mpi/DIST_MEM_PENNYLANE_NERSC_SETUP_README.md, multiple times. Either Perlmutter base infra changed or something happened. I still see the same error message. I was able to get it working before with your instructions. But now it stopped working. I tried Qiskit AER Gpu without any issue, on the same machine.

If you don’t mind can you retry the steps above and let me know on Nersc Perlmutter.

“I’d also recommend using the PL 0.34 version for better gradient performance.” - How do I get the latest version, the PL GPU GitHub seems to have only a 0.32.0 tag.

Hey @QuantumMan! Sorry for the delay but we’ll get back to you as soon as we can :slight_smile:

Hi @QuantumMan , just to be sure, did you request a node and 4 GPUs with SLURM? For example,

salloc --nodes 1 -c 32 --gpus 4 --qos interactive --time 01:00:00 --constraint gpu --account=xxxxx

Without --gpus 4, you might see the GPUs with nvidia-smi but won’t be able to use them.

Hi,

Yes, its on slurm and I tried… torch to see number of gpus available on the compute node… it seems 4 gpus are available and i am using following slurm command

salloc -N 1 --qos interactive --time 02:00:00 --gpus=4 --ntasks-per-node=4 --gpus-per-task=1 --constraint gpu --account=

~/pennylane-lightning> python

Python 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] on linux

Type “help”, “copyright”, “credits” or “license” for more information.

import torch

num_of_gpus = torch.cuda.device_count();

print(num_of_gpus);

4

Finally figured out the problem

I need to run the salloc with first with

salloc -N 1 --qos interactive --time 02:00:00 --gpus=4 --constraint gpu --account=

So all the 4 gpus are visible,

and then i run the script once the node is allocated.

srun -n 4 python dist.py

2 Likes

Great that you worked it out @QuantumMan! Let us know if we can help with anything else.

Hi @QuantumMan , glad this was solved. Just to let you know, with the next release of lightning.gpu we are migrating from CUDA 11 to CUDA 12, as this brings additional benefits for the stack. The newest CrayPE (23.12) defaults to working with Cuda 12.2, and Cray-MPICH 8.1.28 natively expects this. As such, the following instructions will help you get started with the latest Cray environment and libraries once the release is public. The following snippet will allow you to explore this from the current pre-released package, and we will be making this more widely available for use on Perlmutter in the coming weeks:

module load PrgEnv-gnu cray-mpich cudatoolkit craype-accel-nvidia80 python

# Set up your python environment, preferably on the SCRATCH partition
python -m venv lgpu_env && source lgpu_env/bin/activate

# Build mpi4py 
MPICC="cc -shared" pip install --force --no-cache-dir --no-binary=mpi4py mpi4py

# Clone PennyLane Lightning and set-up your Python virtualenv
git clone https://github.com/PennyLaneAI/pennylane-lightning
cd pennylane-lightning && git checkout latest_release
python -m pip install -r requirements-dev.txt

# Build and install Lightning-Qubit
CXX=$(which CC) python -m pip install -e . --verbose

# Build and install Lightning-GPU+MPI using the CrayPE toolchain, Cray-MPICH and CUDA 12
CXX=$(which CC) PL_BACKEND="lightning_gpu" CMAKE_ARGS="-DENABLE_MPI=on -DCMAKE_CXX_COMPILER=$(which CC)" python -m pip install -e . --verbose

# Ensure the CRAY library paths are used for the MPI enviroment to be visible by the NVIDIA cuQuantum libraries
# https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00113984en_us&page=Modify_Linking_Behavior_to_Use_Non-default_Libraries.html
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:LD_LIBRARY_PATH

# Follow NERSCs recommendations for using GPU-aware MPI
# https://docs.nersc.gov/development/languages/python/using-python-perlmutter/#using-mpi4py-with-gpu-aware-cray-mpich
export MPICH_GPU_SUPPORT_ENABLED=1

# Allocate your nodes/GPUs, and ensure each process can see all others on each respective local node.
salloc -N 2 -c 32 --qos interactive --time 0:30:00 --constraint gpu --ntasks-per-node=4 --gpus-per-task=1 --gpu-bind=none --account=<Account ID>

# Run you workload, using a power-of-2 number of processes (and GPUs)
srun -n 8 python ./my_workload.py

Feel free to reach out if there are any questions.