Pennylane multi-gpu script fails with error even there are enough gpus

QuantumMan · April 8, 2025, 5:58am

@CatalinaAlbornoz here is the whole Dockerfile for perlmutter


USER root

ENV DEBIAN_FRONTEND=noninteractive
ENV PATH=/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Install build tools
RUN apt-get update && \
    apt-get install -y build-essential wget m4 autoconf automake libtool pkg-config git cmake python3-dev python3-pip && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Install MPICH
WORKDIR /tmp
ARG MPICH_VERSION=4.2.2
ARG MPICH_PREFIX=mpich-$MPICH_VERSION

RUN wget https://www.mpich.org/static/downloads/$MPICH_VERSION/$MPICH_PREFIX.tar.gz
RUN tar xvzf $MPICH_PREFIX.tar.gz
RUN cd $MPICH_PREFIX && \
./configure && \
make -j 8 && \
make install && \
make clean && \
cd .. && \
rm -rf $MPICH_PREFIX

RUN /sbin/ldconfig
# Switch back to default user
USER cuquantum

ENV CONDA_DEFAULT_ENV=cuquantum-24.08
ENV PATH=/opt/conda/envs/cuquantum-24.08/bin:$PATH

RUN git clone https://github.com/PennyLaneAI/pennylane-lightning.git
RUN cd pennylane-lightning && git checkout latest_release
RUN cd pennylane-lightning && pip install -r requirements.txt
RUN cd pennylane-lightning && python -m pip install .
RUN cd pennylane-lightning && PL_BACKEND="lightning_gpu" python scripts/configure_pyproject_toml.py
RUN cd pennylane-lightning && CMAKE_ARGS="-DENABLE_MPI=ON" python -m pip install . -vv

# # Environment variables
ENV MPICH_GPU_SUPPORT_ENABLED=1
ENV LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

# Copy dist_mem_jacobian.py into the container and make it executable
COPY --chown=cuquantum:cuquantum dist_mem_jacobian.py /home/cuquantum/dist_mem_jacobian.py
RUN chmod +x /home/cuquantum/dist_mem_jacobian.py

CatalinaAlbornoz · April 8, 2025, 10:34pm

Hi @QuantumMan ,

From your error message it seems that you’re missing the path to libmpi.so , which should be found in LD_LIBRARY_PATH.

Please make sure to add the path/to/libmpi.so to LD_LIBRARY_PATH, and let us know if this resolves your issue.

If this doesn’t work then the issue might be with the installation of MPI on your system.

I hope this helps!

QuantumMan · April 9, 2025, 10:16pm

Hi, perlmutter stil uses 3.4v as mpich comptible and at the same time uses mpi4py(3.1.3) version .. Is there any cuquantum_appliance comptable with these verions? otherwise there is lot of adhoc hacks had to do.

CatalinaAlbornoz · April 10, 2025, 9:37pm

Hi @QuantumMan ,

Just to confirm, did you add the path/to/libmpi.so to LD_LIBRARY_PATH , and are you still seeing the same error after doing so?

I’ve asked our team here to take a look in case the error persists after adding this step.

QuantumMan · April 12, 2025, 5:23am

Hi, i was able to resolve with setting the ld_library_path; but further stuck with srun -n .. as it fails with different errors on Perlmutter. I think the current verison of pennylane failed on HPC environments like Perlmutter.

I had. a discussion with nersc people and they mentioned there is no Cray PE version that supports cuda 12.2

Below is the discussion. I would say it would be good for @mlxd to try it out the latest version and provide support for HPC envs

ME:

Ok, jjust one last question - “/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/targets/x86_64-linux/lib/libcudart.so.11.0:/usr/lib64/libcudart.so.11.0” is this always constant, as i use cuda/12.2 should this be pointing to 12.2 library?

HPC:

Right now yes because the cuda mpi library is linked against that.

ME:

Is there any version of CUDA mpi that uses 12.2 ?

NERSC
Not from HPE as far as I know.

CatalinaAlbornoz · April 14, 2025, 3:57pm

Hi @QuantumMan

My colleague Lee has confirmed that the issue seems to be in the environment setup. There is nothing wrong with Cray MPICH and CUDA 12 on Perlmutter with Lightning GPU.

These steps produce a working Lightning GPU install (as of now 0.41.0-rc)

Ensure Python 3.10+ is used to create the venv (module load python/3.11)
Ensure the appropriate modules are loaded (module load PrgEnv-gnu cray-mpich cudatoolkit craype-accel-nvidia80)
Clone pennylane-lightning, install the requirements-dev.txt file and then install Lightning Qubit as python -m pip install -r requirements-dev.txt && CC=$(which cc) CXX=$(which CC) python -m pip install . --verbose) with the CC env-vars used to set the CrayPE compilers, which default to the GNU env
Change the package to Lightning GPU and install that using the CrayPE compiler for MPI support (CMAKE_ARGS="-DENABLE_MPI=ON" CC=$(which mpicc) CXX=$(which mpicxx) python -m pip install . --verbose)
Install mpi4py using CrayPE MPICH (MPICC="cc -shared" pip install --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py)
Ensure the Cray MPICH libraries are available for dynamic loading by NVIDIA custatevec (export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH)
Allocate a debug session with 4 GPUs (salloc -N 1 -c 32 --qos interactive --time 0:30:00 --constraint gpu --ntasks-per-node=4 --gpus-per-task=1 --gpu-bind=none --account=XYZ)
Create an MPI-friendly script and run it with the above allocation (srun -n 4 python myscript.py)

Note: we will release a new version of PennyLane in the next couple of days so if you want to wait until Wednesday to do this you can use PennyLane v0.41

QuantumMan · April 14, 2025, 10:49pm

Wow, this works. thank you so much. Not sure what changed from earlier instructions to this. But this is super helpful.

The previous instructions i tried are also from Lee and they worked before, and recently it broke. Not sure how we can avoid this in future; Any thoughts? but for now i am unblocked. thank you @CatalinaAlbornoz for all the help.

QuantumMan · April 14, 2025, 10:59pm

CatalinaAlbornoz:

My colleague Lee has confirmed that the issue seems to be in the environment setup. There is nothing wrong with Cray MPICH and CUDA 12 on Perlmutter with Lightning GPU.

These steps produce a working Lightning GPU install (as of now 0.41.0-rc)

Ensure Python 3.10+ is used to create the venv (module load python/3.11)

Ensure the appropriate modules are loaded (module load PrgEnv-gnu cray-mpich cudatoolkit craype-accel-nvidia80)

Clone pennylane-lightning, install the requirements-dev.txt file and then install Lightning Qubit as python -m pip install -r requirements-dev.txt && CC=$(which cc) CXX=$(which CC) python -m pip install . --verbose) with the CC env-vars used to set the CrayPE compilers, which default to the GNU env

Change the package to Lightning GPU and install that using the CrayPE compiler for MPI support (CMAKE_ARGS="-DENABLE_MPI=ON" CC=$(which mpicc) CXX=$(which mpicxx) python -m pip install . --verbose)

Install mpi4py using CrayPE MPICH (MPICC="cc -shared" pip install --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py)

Ensure the Cray MPICH libraries are available for dynamic loading by NVIDIA custatevec (export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH)

Allocate a debug session with 4 GPUs (salloc -N 1 -c 32 --qos interactive --time 0:30:00 --constraint gpu --ntasks-per-node=4 --gpus-per-task=1 --gpu-bind=none --account=XYZ)

Create an MPI-friendly script and run it with the above allocation (srun -n 4 python myscript.py)

Note: we will release a new version of PennyLane in the next couple of days so if you want to wait until Wednesday to do this you can use PennyLane v0.41

Just sharing if anyone is interested - Here is the link where we use Distributed state vector for running some quantum mini applications - quantum-mini-apps/src/mini_apps/quantum_simulation/distributed_state_vector/README.md at main · radical-cybertools/quantum-mini-apps · GitHub

CatalinaAlbornoz · April 17, 2025, 2:57pm

I’m glad to hear this worked @QuantumMan ! And thank you for sharing the link to your project.

Regarding your question about avoiding this in the future, unfortunately these things are part of the process of working with HPC. As you get more experience using HPC systems it will get easier to debug issues! When that day comes hopefully you can help others here too . In the meantime you can also reach out to the system administrators if you encounter issues that are specifically about the system, or let us know here if you have questions about PennyLane.

Topic		Replies	Views
Lightning-gpu failing on multi-node/multi gpus PennyLane Help	23	1247	November 29, 2023
Pennylane lightning.gpu error PennyLane Help	2	181	May 22, 2024
Pennylane Lightning GPU not working PennyLane Help	3	468	August 1, 2023
Pennylane-Lightning GPU PennyLane Lightning	15	1887	July 20, 2023
Error in PennyLane Lightning: an illegal memory access was encountered PennyLane Help	5	30	March 19, 2025

Pennylane multi-gpu script fails with error even there are enough gpus

Related topics