Performance of lightning.gpu vs lightning.qubit

Hello, I worked on QAOA using PennyLane, and thus, I was excited as version 0.22 supports NVIDIA cuQuantum.

The release blog post shows a bar chart showing the superior performance of the lightning.gpu device, as one can see here:

Unfortunately, I could not reproduce the same performance behavior with my simulations. I even saw the opposite behavior. For my QAOA simulations using 16 qubits, the lightning.qubit device performed one order of magnitude better than the lightning.gpu. I am confused about these results and cannot explain them. Consequently, it would be great to know which reference problem and which machine (CPU and GPU) was used to create the plot in the blog post. Then I could run the same problem to see if my CPU is “too good” or maybe cuQuantum is not installed correctly.

I ran all experiments on an NVIDIA DGX A100 machine

  • with eight A100 GPUs with 40 GB VRAM
  • with a Dual AMD Rome 7742 CPU with 128 cores and 2 TB RAM
  • with PennyLane version 0.22.1

Thank you in advance!

1 Like

Hi @leolettuce, thanks for the question.

For the LightningGPU runs, anything under 20 qubits will be faster on the CPU — the GPU execution is optimal for 21-30 range workloads, and especially when calculating multiple expectation values. The plot range from 21 - 27 qubits on the blog post should hold for the given GPUs and comparing against the same number of CPU threads. This is largely due to overheads in setting up of the GPU device, as well as internal to the cuQuantum library. It is a known issue, and we should see improvements with future versions of the NVIDIA cuQuantum library releases.

The data was collected on a DGXA100 box, so you should be able to reproduce it over the range the plot shows. For QAOA, depending on how you have setup the circuit, the depth may also not be sufficient to take advantage of the GPU.

We should see the best performance for circuits with multiple expectation values, beyond 20 qubits, and with a number of layers of depth. The blog-post problem should be listed below the plot, and shows the evaluation of a Jacobian over multiple parameters for a strongly-entangling layered circuit.

Feel free to let me know if there are any follow-up questions.

1 Like

Thank you for your comprehensive answer @mlxd! Unfortunately, we could not reproduce the plot with the code snippet below the blog post.

We tried to reproduce the plot with the following code: (1.8 KB)

However, with our system and for 28 qubits, we get this output:

3418.212 ms
lighting.qubit with 1 thread:
1.438 ms
lighting.qubit 32 threads:
1.072 ms

As additional info, we use CUDA version 11.0. Maybe we are missing something fundamental. Do you get a different output from our code?

Hi @leolettuce. I just gave your script a try on a p3.2xlarge node and noticed that jacobians are not calculated correctly! In fact, the differentiation of the given circuit performs with a wrong number of trainable parameters. To fix this, you only need to update the params variable as follows,

from pennylane import numpy as pnp
params = pnp.array(pnp.random.random(param_shape), requires_grad=True)

I could collect some results with running your script after this fix,

  • n_wires=24 and n_layers=1:
   17285.844 ms
   lighting.qubit with 1 thread: 
   65557.119 ms
   lighting.qubit 4 threads:
   65269.742 ms
  • n_wires=24 and n_layers=2:
   20855.491 ms
   lighting.qubit with 1 thread:
   122505.711 ms
   lighting.qubit 4 threads:
   120864.459 ms

One more suggestion I have is that of computing qml.jacobian(circuit)(params) several many times and print the average time instead to get a more accurate/stable result for each case.

I hope you find this helpful and feel free to let us know in case of any further issues or concerns.