I have been able to run quantum circuits with 27 qubits using pennylane’s lightning.gpu. My circuit only returns one observation expectation value. I checked the GPU usage and found that only one GPU and its memory (40 GiB for A100) was used although I set ‘#SBATCH --gpus-per-task=2’ (or even set it to 3, 4, or more).
The reason for needing multiple GPU’s memory is that if we want to run circuits with 29 qubits or more, the required GPU’s memory exceeds 40 GiB, which causes out of memory error.
Assume we have 8 GPUs at one compute node, each has 40 GiB memory, then we have 320 GiB vRAM. If we can use their memory together, the lightning.gpu should be able to run 31-qubit quantum circuits.
I was wondering is there any settings to allow pennylane’s lightning.gpu to run 31-qubit quantum circuits?
Unfortunately we do not support splitting one observable into multiple GPUs. If you had more observables you could split them into multiple GPUs by following the instructions here.
Something else you could try is using circuit cutting. I’m not sure whether it will work with GPUs but it could potentially help you reduce the memory needs (although with a time overhead). You can learn more about it in this demo.
Just to jump in also. We have decided not to add single-node, multi-GPU support to our current roadmap. While I understand this can be beneficial to increase the qubit count by 1 or 2, we see this as limited utility compared to the experiences we have with circuit cutting. As Catalina suggested, the demo is a good place to start, and I will also add the repo and paper where this was demonstrated over 128 GPUs in a distributed manner.
We are however exploring a better approach to distributed multi-node multi-GPU statevector simulation, which generalizes the single-node approach you discuss. We see this more useful overall, as one can request a large number of HPC nodes and run qubit counts into the 30s-40s, depending on the available resources.
Likely the single-node approach may be developed as part of the multi-node solution, though we do not have an estimated release window for this in our current roadmap.
Also, as suggested in the other thread, the circuit-cutting code and paper can be found here and here, and my talk on the numerics at NERSC here.
Feel free to let us know if you need any more information on the above.
Thanks a lot for your useful information. I agree that multi-node multi-GPU statevector simulation is great as one node is just a subset of it. I am very much looking forward to a pennylane version that can run quantum circuits with 30s-40s qubits using a large number of HPC.
Also, the circuit cutting is very interesting. I will definitely take a closer look at it.
Hi @Jim, yes - you only need to pip install bluequbit after which you can submit your cirq or qiskit circuits to our CPU (up to 34 qubits) or GPU (up to 35 qubits) simulators.
More info in our docs.
You can also email me at firstname.lastname@example.org your questions / feedback!
Hi @CatalinaAlbornoz ,
That is correct - the circuits will be run in our CPU or GPU. Locally you would need very large machines (128GB+ memory) to run 32+ qubit circuits. That’s the value that bluequbit brings actually - you code in your laptop but use our Large Machines to simulate large circuits.
Hope this helps and let me know if you have more questions!
Thank you so much for your question, that is exactly what I was asking.
Thank you very much for your answer, which is very clear now. I agree that running 32+ qubit circuits needs large memory and is not suitable for a laptop or PC. I also noted that bluequbit can run 35-qubit circuit, which is excellent! However, for researchers who need to run 32+ qubit circuits, in most cases, they have access to HPCs, locally or remotely, that have enough CPU memory and multiple GPUs. That means, if the bluequbit can run on users’ own computers, that would be greatly appreciated.
@CatalinaAlbornoz@mlxd In my opinion, one great strength of Pennylane is that it can run on users’ own computers and the code can be run on most simulators and quantum devices, which greatly helped my research. I greatly appreciate the work from Pennylane! Right now, I am very much looking forward to Pennylane being able to run 30+ qubits. Again, from my point of view, dividing a large matrix into several ones such that they can be processed by different GPUs in parallel seems straightforward, which means we can use more GPUs to handle more qubits; hopefully, Pennylane will have this feature in the near future.
Right now you can immediately make use of our lightning.qubit or our newly released lightning.kokkos (pip install pennylane-lightning-kokkos) backends — if a HPC node has a large pool of memory, and/or CPU cores, you should be able to hit 30-32 qubits in the forward pass, and maybe a few less if taking gradients into the mix, which are natively supported by PennyLane’s C++ backed simulators. We have observed excellent scaling with OpenMP in both devices, depending on the workload. We are also natively supported under AWS BraKet, so if you are inclined to run PennyLane on cloud platforms, I’d suggest taking a look here.
For using distributed GPUs on HPC systems, there is no widely available general distribution strategy for all problems that can still make the best utilization of the network and avoids the issue of loss in performance for qubits with non-local memory access (as in, due to the strided memory layouts, we need to ensure coalesced access for the best performance) — right now you can use the NVIDIA docker containers for cuQuantum, but those will still require careful choice of your qubit indices and matchings to avoid loss in performance; as mentioned previously we do not support these yet with PennyLane.
The issue becomes how to best enable transmission of the indices across the network, such that the operations between qubits are localised, and can make the best use of the local hardware resources (CPU, GPU, TPU, etc). Right now, tensor-network methods scale incredibly well, due to the inherent parallelization of the paradigm, but statevector data-couplings require some care to ensure good scalability — an under-fed GPU is an unhappy GPU.
We expect multi-node multi-GPU support to be available sometime in the coming months and will be happy to share this when it is available. We don’t want our users to simply throw more nodes at a problem, and not get better performance too — both strong and weak scaling are our priorities
Feel free to reach out if we can offer any further suggestions on scaling your PennyLane workloads, as we will be more than happy to help.
Hello, we use “lightning.qubit” in the GPU of the server, but there is a problem that has troubled us for a long time. The running of the program only occupies one GPU, and the utilization rate is very low, but the CPU utilization rate is high; if there are other programs running , the GPU utilization rate of this program is only 1%, which will be squeezed out by other programs and run very slowly. How can we solve this problem?
Hi @zj-lucky lightning.qubit is a CPU-only simulator. For GPU simulation on NVIDIA devices we use lightning.gpu. I think to avoid jumping into the existing thread, it may be best to post a separate forum issue. If you have an example of your code, we’d be happy to assist there!
Thank you so much for sharing the progress and the plan. I am really excited to see this. I will explore around and greatly look forward to the multi-node multi-GPU support! Right now, I am looking into the circuit cutting, which is mentioned by you several months ago. Thanks!
Sorry, we have a wrong writting, we use the “lightning.gpu”.
As long as there are other programs running, although it does not occupy the GPU which I am using, it will still be very slow.
Is it because of the code?
I would suggest you post the result of qml.about() in a separate forum issue, and describe as many details as possible. Then @CatalinaAlbornoz@mlxd and others would be able to better identify where are the possible problems. Hope this hellps.
lightning.gpu will indeed be very slow if you use less than 20 qubits. For over 20 qubits you will start seeing significant improvements compared to lightning.qubit. However, most current machines will still be unable to run more than 28 qubits.
Also, note that if you’re only measuring a single expectation value you can only run it on a single GPU, so this can affect your performance too.
If you want to share more details about your problem it’s better to open a new topic here in the Forum. Please let me know if case you have trouble opening a new topic