I have been able to run quantum circuits with 27 qubits using pennylane’s lightning.gpu. My circuit only returns one observation expectation value. I checked the GPU usage and found that only one GPU and its memory (40 GiB for A100) was used although I set ‘#SBATCH --gpus-per-task=2’ (or even set it to 3, 4, or more).
The reason for needing multiple GPU’s memory is that if we want to run circuits with 29 qubits or more, the required GPU’s memory exceeds 40 GiB, which causes out of memory error.
Assume we have 8 GPUs at one compute node, each has 40 GiB memory, then we have 320 GiB vRAM. If we can use their memory together, the lightning.gpu should be able to run 31-qubit quantum circuits.
I was wondering is there any settings to allow pennylane’s lightning.gpu to run 31-qubit quantum circuits?
Unfortunately we do not support splitting one observable into multiple GPUs. If you had more observables you could split them into multiple GPUs by following the instructions here.
Something else you could try is using circuit cutting. I’m not sure whether it will work with GPUs but it could potentially help you reduce the memory needs (although with a time overhead). You can learn more about it in this demo.
Please let me know if this helps!
Thank you for your response! Hopefully, pennylane can support splitting one observable into multiple GPUs in the future such that it can simulate more qubits.
Thank you for the information on circuit cutting which looks interesting. I will take a more close look.
I appreciate your help!
Just to jump in also. We have decided not to add single-node, multi-GPU support to our current roadmap. While I understand this can be beneficial to increase the qubit count by 1 or 2, we see this as limited utility compared to the experiences we have with circuit cutting. As Catalina suggested, the demo is a good place to start, and I will also add the repo and paper where this was demonstrated over 128 GPUs in a distributed manner.
We are however exploring a better approach to distributed multi-node multi-GPU statevector simulation, which generalizes the single-node approach you discuss. We see this more useful overall, as one can request a large number of HPC nodes and run qubit counts into the 30s-40s, depending on the available resources.
Likely the single-node approach may be developed as part of the multi-node solution, though we do not have an estimated release window for this in our current roadmap.
Also, as suggested in the other thread, the circuit-cutting code and paper can be found here and here, and my talk on the numerics at NERSC here.
Feel free to let us know if you need any more information on the above.
Thanks a lot for your useful information. I agree that multi-node multi-GPU statevector simulation is great as one node is just a subset of it. I am very much looking forward to a pennylane version that can run quantum circuits with 30s-40s qubits using a large number of HPC.
Also, the circuit cutting is very interesting. I will definitely take a closer look at it.