Pennylane Multi-GPU support

Hi, I want to understand how the multi-GPU support works in PennyLane v0.31. I read the blog about the new Pennylane v0.31 and the multi-GPU support. Can anyone tell me what new libraries I must download and import for this support?
So I am currently using the NVIDIA DGX A100 GPU and CudaToolkit 11.7.0

The circuit I want to use is:

wires=4
dev4 = qml.device('lightning.gpu', wires=wires )
@qml.qnode(dev4)
def CONVCircuit(phi, wires, i=0):
    """
    quantum convolution Node
    """
    # parameter
    theta = np.pi / 2
    qml.Rot(phi[0]*2*np.pi/255,phi[1]*2*np.pi/255,phi[2]*2*np.pi/255, wires=0)
    qml.Rot(phi[3]*2*np.pi/255,phi[4]*2*np.pi/255,phi[5]*2*np.pi/255, wires=1)
    qml.Rot(phi[6]*2*np.pi/255,phi[7]*2*np.pi/255,phi[8]*2*np.pi/255, wires=2)
    qml.Rot(phi[9]*2*np.pi/255,phi[10]*2*np.pi/255,phi[11]*2*np.pi/255, wires=3)

    qml.RX(np.pi, wires=0)
    qml.RX(np.pi, wires=1)
    qml.RX(np.pi, wires=2)
    qml.RX(np.pi, wires=3)

    qml.CRZ(theta, wires=[1, 0])
    qml.CRZ(theta, wires=[3, 2])
    qml.CRX(theta, wires=[1, 0])
    qml.CRX(theta, wires=[3, 2])
    qml.CRZ(theta, wires=[2, 0])
    qml.CRX(theta, wires=[2, 0])

    # Expectation value
    measurement = qml.expval(qml.PauliZ(wires=0))

    return measurement

Please tell me what extra lines I must write in the above code for its multi-GPU support. I use Jupyter Notebook; the codes are in .ipynb file format.

The output of qml.about():

Name: PennyLane
Version: 0.31.0
Summary: PennyLane is a Python quantum machine learning library by Xanadu Inc.
Home-page: https://github.com/PennyLaneAI/pennylane
Author: 
Author-email: 
License: Apache License 2.0
Location: /dgxb_home/se21pphy004/miniconda3/envs/myenv/lib/python3.8/site-packages
Requires: appdirs, autograd, autoray, cachetools, networkx, numpy, pennylane-lightning, requests, rustworkx, scipy, semantic-version, toml
Required-by: PennyLane-Lightning, PennyLane-Lightning-GPU

Platform info:           Linux-5.4.0-144-generic-x86_64-with-glibc2.17
Python version:          3.8.17
Numpy version:           1.24.3
Scipy version:           1.10.0
Installed devices:
- default.gaussian (PennyLane-0.31.0)
- default.mixed (PennyLane-0.31.0)
- default.qubit (PennyLane-0.31.0)
- default.qubit.autograd (PennyLane-0.31.0)
- default.qubit.jax (PennyLane-0.31.0)
- default.qubit.tf (PennyLane-0.31.0)
- default.qubit.torch (PennyLane-0.31.0)
- default.qutrit (PennyLane-0.31.0)
- null.qubit (PennyLane-0.31.0)
- lightning.qubit (PennyLane-Lightning-0.31.0)
- lightning.gpu (PennyLane-Lightning-GPU-0.31.0)

Hello @mass_of_15 !

Would you mind giving me a bit more context?

Meanwhile, I strongly suggest taking a look at the Lightning documentation. You will probably find your answers there. :slight_smile:

Besides, I also recommend taking a look at this post discussion.

Does it help? :slight_smile:

So the above code is used for quantum image processing. Can this circuit be used on multiple GPUs for faster image processing where multiple images are processed on multiple GPUs using the same circuit, thereby reducing the runtime?

@mass_of_15,

A recent nVIDIA post on LinkedIn that might help. Best regards.

Thank you for the link. The above link is for another library, “CUDA Quantum 0.4”.
However, can I use Pennylane v0.31 for circuits that can be used on multiple GPUs for faster image processing, where multiple images are processed on multiple GPUs using the same circuit, thereby reducing the runtime?

Hi @mass_of_15

The new lightning.gpu support is for distributed execution to enable larger system sizes to be explored. In this case, for circuits that do not fit onto 1 GPU, we can through MPI use more than 1 GPU to store the statevector. This design will not improve performance for problems like you are investigating, primarily as there is no advantage to distributing the problem like this using the cuQuantum MPI distribution mechanism.

For your problem, given the small numbers of qubits, I’d suggest using jax.jit through default.qubit and its support for CUDA backends. LightningGPU is built primarily as a HPC-focused simulator, and works best for circuits beyond 20 qubits in register size, and with high depth.

Hope this helps.

2 Likes