Hi Team, I’m performing 19 qubits circuit and implementing Q-RAM circuit over there. I’m performing 2^14 Multi-Control NOT Gates preceded by X-Gates. I’m using lightning.gpu for executing the circuit, but 5-6 hours, it still doesn’t provide any output. I have also tried the same device for smaller circuit, it works fine, but if the circuit depth is higher, then what can be steps folloed to reduce the time, can you guys help with it?
The following is the device i have been workign and the circuit return in counts.
Hi @roysuman088
Unfortunately lightning.gpu does not support the NVIDIA M40 Tesla cards, as these are not compatible with NVIDIA cuQuantum, which we rely on for the GPU calls. The minimum supported GPU generation for cuQuantum is the SM7.0 cards, which are V100s and newer. The M40 series are SM5.2. See WARNING: INSUFFICIENT SUPPORT DETECTED FOR GPU DEVICE WITH `lightning.gpu` - #7 by mlxd for another case where this is discussed.
I suspect this may be partly the reason that the runtime is taking a long time, as it is likely falling back to CPU-only operation.
Do you see a warning message when instantiating a lightning.gpu device on that machine? It should let you know that a compatible GPU device is not found, and so will fall-back to CPU-only operation.
What could potentially be a good work-around is trying to use our lightning.kokkos research device. This card may support the older GPU generation, and allow you to get better performance. To try this out, feel free to try the following guide, though, you may need to ensure both CUDA libraries and binaries are located on your PATH and LD_LIBRARY_PATH first. I will assume you are using Ubuntu, and add the paths as they tend to be found on that OS:
The above process creates a local Python virtual environment, installs PennyLane and will try to build Lightning Kokkos for your given GPU architecture. If the generation is too old, it may not be supported.
Hi @mlxd …while running i didn’t get any warning for GPU to CPU transition i.e. falling back to CPU as no GPU found…but thanks for your clarification on GPU device…also can you confirm that if I get upto date GPU for running lightning.gpu on it…how much time it’ll take…a bit approximate from your end will be helpful…
If you have an interactive session on that server, can you try:
import pennylane as qml
dev = qml.device("lightning.gpu", wires=2)
and it should give you a warning such as:
/pyenv/lib/python3.10/site-packages/pennylane_lightning_gpu/lightning_gpu.py:93: UserWarning: CUDA device is an unsupported version: (6, 1)
warn(str(e), UserWarning)
/pyenv/lib/python3.10/site-packages/pennylane_lightning_gpu/lightning_gpu.py:828: RuntimeWarning:
!!!#####################################################################################
!!!
!!! WARNING: INSUFFICIENT SUPPORT DETECTED FOR GPU DEVICE WITH `lightning.gpu`
!!! DEFAULTING TO CPU DEVICE `lightning.qubit`
!!!
!!!#####################################################################################
warn(
The above warning was run on a Pascal (SM6.1) device which shows what the warning should look like.
Unfortunately we cannot predict how long a circuit will take to run from problem to problem as there are so many factors that can affect execution time. The general rule-of-thumb we see is that for qubit counts > 20, the GPU runtimes should be about 10x faster than runtimes on large server CPUs.
Though, that is just a generalization, and may not reflect real-world examples. The only real way to know is to run your code on a given GPU at different scales and aim to extrapolate from that. For your problem size of 19 qubits, the ratio of overheads (CPU-GPU transfers, synchronization points, etc) to computation may be larger than for a 26 qubit workload, where the GPU is being fully used.