Unfortunately, I could not reproduce the same performance behavior with my simulations. I even saw the opposite behavior. For my QAOA simulations using 16 qubits, the lightning.qubit device performed one order of magnitude better than the lightning.gpu. I am confused about these results and cannot explain them. Consequently, it would be great to know which reference problem and which machine (CPU and GPU) was used to create the plot in the blog post. Then I could run the same problem to see if my CPU is “too good” or maybe cuQuantum is not installed correctly.
I ran all experiments on an NVIDIA DGX A100 machine
with eight A100 GPUs with 40 GB VRAM
with a Dual AMD Rome 7742 CPU with 128 cores and 2 TB RAM
For the LightningGPU runs, anything under 20 qubits will be faster on the CPU — the GPU execution is optimal for 21-30 range workloads, and especially when calculating multiple expectation values. The plot range from 21 - 27 qubits on the blog post should hold for the given GPUs and comparing against the same number of CPU threads. This is largely due to overheads in setting up of the GPU device, as well as internal to the cuQuantum library. It is a known issue, and we should see improvements with future versions of the NVIDIA cuQuantum library releases.
The data was collected on a DGXA100 box, so you should be able to reproduce it over the range the plot shows. For QAOA, depending on how you have setup the circuit, the depth may also not be sufficient to take advantage of the GPU.
We should see the best performance for circuits with multiple expectation values, beyond 20 qubits, and with a number of layers of depth. The blog-post problem should be listed below the plot, and shows the evaluation of a Jacobian over multiple parameters for a strongly-entangling layered circuit.
Feel free to let me know if there are any follow-up questions.
Hi @leolettuce. I just gave your script a try on a p3.2xlarge node and noticed that jacobians are not calculated correctly! In fact, the differentiation of the given circuit performs with a wrong number of trainable parameters. To fix this, you only need to update the params variable as follows,
from pennylane import numpy as pnp
...
params = pnp.array(pnp.random.random(param_shape), requires_grad=True)
...
I could collect some results with running your script after this fix,
n_wires=24 and n_layers=1:
lighting.gpu:
17285.844 ms
lighting.qubit with 1 thread:
65557.119 ms
lighting.qubit 4 threads:
65269.742 ms
n_wires=24 and n_layers=2:
lighting.gpu:
20855.491 ms
lighting.qubit with 1 thread:
122505.711 ms
lighting.qubit 4 threads:
120864.459 ms
One more suggestion I have is that of computing qml.jacobian(circuit)(params) several many times and print the average time instead to get a more accurate/stable result for each case.
I hope you find this helpful and feel free to let us know in case of any further issues or concerns.
It’s frustrating when real-world results don’t match up with expectations, especially after getting excited about new features like cuQuantum support. Have you considered reaching out to the PennyLane community for insights or troubleshooting tips? Sometimes collaboration can lead to unexpected solutions.
Also, if you’re interested in exploring more about quantum computing performance, you might find the concept of quantum volume score intriguing. It’s a useful metric for evaluating the effectiveness of quantum systems. Here’s a link for further reading at quantum ai. Wishing you the best in resolving your performance discrepancies!