Performance of lightning.gpu vs lightning.qubit

leolettuce · June 15, 2022, 7:24am

Hello, I worked on QAOA using PennyLane, and thus, I was excited as version 0.22 supports NVIDIA cuQuantum.

The release blog post shows a bar chart showing the superior performance of the lightning.gpu device, as one can see here: https://pennylane.ai/blog/2022/03/pennylane-v022-released/#accelerate-your-simulations-with-cuquantum-gpu-support

Unfortunately, I could not reproduce the same performance behavior with my simulations. I even saw the opposite behavior. For my QAOA simulations using 16 qubits, the lightning.qubit device performed one order of magnitude better than the lightning.gpu. I am confused about these results and cannot explain them. Consequently, it would be great to know which reference problem and which machine (CPU and GPU) was used to create the plot in the blog post. Then I could run the same problem to see if my CPU is “too good” or maybe cuQuantum is not installed correctly.

I ran all experiments on an NVIDIA DGX A100 machine

with eight A100 GPUs with 40 GB VRAM
with a Dual AMD Rome 7742 CPU with 128 cores and 2 TB RAM
with PennyLane version 0.22.1

Thank you in advance!

mlxd · June 15, 2022, 7:30pm

Hi @leolettuce, thanks for the question.

For the LightningGPU runs, anything under 20 qubits will be faster on the CPU — the GPU execution is optimal for 21-30 range workloads, and especially when calculating multiple expectation values. The plot range from 21 - 27 qubits on the blog post should hold for the given GPUs and comparing against the same number of CPU threads. This is largely due to overheads in setting up of the GPU device, as well as internal to the cuQuantum library. It is a known issue, and we should see improvements with future versions of the NVIDIA cuQuantum library releases.

The data was collected on a DGXA100 box, so you should be able to reproduce it over the range the plot shows. For QAOA, depending on how you have setup the circuit, the depth may also not be sufficient to take advantage of the GPU.

We should see the best performance for circuits with multiple expectation values, beyond 20 qubits, and with a number of layers of depth. The blog-post problem should be listed below the plot, and shows the evaluation of a Jacobian over multiple parameters for a strongly-entangling layered circuit.

Feel free to let me know if there are any follow-up questions.

leolettuce · June 29, 2022, 2:27pm

Thank you for your comprehensive answer @mlxd! Unfortunately, we could not reproduce the plot with the code snippet below the blog post.

We tried to reproduce the plot with the following code:
lightning.py (1.8 KB)

However, with our system and for 28 qubits, we get this output:

lighting.gpu:
3418.212 ms
lighting.qubit with 1 thread:
1.438 ms
lighting.qubit 32 threads:
1.072 ms

As additional info, we use CUDA version 11.0. Maybe we are missing something fundamental. Do you get a different output from our code?

maliasadi · June 29, 2022, 7:10pm

Hi @leolettuce. I just gave your script a try on a p3.2xlarge node and noticed that jacobians are not calculated correctly! In fact, the differentiation of the given circuit performs with a wrong number of trainable parameters. To fix this, you only need to update the params variable as follows,

from pennylane import numpy as pnp
...
params = pnp.array(pnp.random.random(param_shape), requires_grad=True)
...

I could collect some results with running your script after this fix,

n_wires=24 and n_layers=1:

   lighting.gpu:
   17285.844 ms
   lighting.qubit with 1 thread: 
   65557.119 ms
   lighting.qubit 4 threads:
   65269.742 ms

n_wires=24 and n_layers=2:

   lighting.gpu:
   20855.491 ms
   lighting.qubit with 1 thread:
   122505.711 ms
   lighting.qubit 4 threads:
   120864.459 ms

One more suggestion I have is that of computing qml.jacobian(circuit)(params) several many times and print the average time instead to get a more accurate/stable result for each case.

I hope you find this helpful and feel free to let us know in case of any further issues or concerns.

leolettuce · June 30, 2022, 8:31am

Wow, thank you @maliasadi! Your answer is super helpful, and it makes totally sense. Now I can reproduce the results

Dictatords · March 14, 2024, 11:09am

It’s frustrating when real-world results don’t match up with expectations, especially after getting excited about new features like cuQuantum support. Have you considered reaching out to the PennyLane community for insights or troubleshooting tips? Sometimes collaboration can lead to unexpected solutions.
Also, if you’re interested in exploring more about quantum computing performance, you might find the concept of quantum volume score intriguing. It’s a useful metric for evaluating the effectiveness of quantum systems. Here’s a link for further reading at quantum ai. Wishing you the best in resolving your performance discrepancies!

isaacdevlugt · March 14, 2024, 8:40pm

Hey @Dictatords,

This is the place to reach the PennyLane community!

Topic		Replies	Views
Why doesn't execution in lightning.gpu work? PennyLane Help	5	583	December 20, 2022
Is lightning.gpu really faster than lightning.qubit? PennyLane Help	3	504	December 21, 2022
NVIDIA cu-Quantum support PennyLane Help	4	799	June 10, 2022
"lightning.gpu" will be squeezed out by other programs running on the server PennyLane Help	2	393	March 15, 2023
Lightning.gpu taking longer time w.r.t circuit depth PennyLane Help	7	459	March 23, 2023

Performance of lightning.gpu vs lightning.qubit

Related topics