Lightning.gpu taking longer time w.r.t circuit depth

Hi Team, I’m performing 19 qubits circuit and implementing Q-RAM circuit over there. I’m performing 2^14 Multi-Control NOT Gates preceded by X-Gates. I’m using lightning.gpu for executing the circuit, but 5-6 hours, it still doesn’t provide any output. I have also tried the same device for smaller circuit, it works fine, but if the circuit depth is higher, then what can be steps folloed to reduce the time, can you guys help with it?
The following is the device i have been workign and the circuit return in counts.

dev= qml.device("lightning.gpu", wires=19, shots=10000000)

The Following is the circuit details.

{'gate_sizes': defaultdict(int, {1: 32790, 15: 16384, 3: 4}),
 'gate_types': defaultdict(int,
             {'Hadamard': 18, 'PauliX': 32772, 'C(RY)': 16387, 'CSWAP': 1}),
 'num_operations': 49178,
 'num_observables': 1,
 'num_diagonalizing_gates': 0,
 'num_used_wires': 19,
 'depth': 32771,
 'num_trainable_params': 0,
 'num_device_wires': 19,
 'device_name': 'lightning.gpu',
 'expansion_strategy': 'gradient',
 'gradient_options': {},
 'interface': 'auto',
 'diff_method': 'best',
 'gradient_fn': 'pennylane.gradients.parameter_shift.param_shift',
 'num_gradient_executions': 0}

Thanks,
Suman

Hi @roysuman088
Can you post the details from qml.about() so we can see which version of PennyLane and lightning.gpu you are using.

If you can also post a smaller test circuit that we can look into that would help to identify if there is anything we can suggest.

Thanks.

image

PFA for qml.about(). I’ll be trying to make some smaller version of the circuit, so that the problem can be identified, but it’ll take some time.

Thanks

1 Like

Thanks for the above information.

Also, can you share what generation of GPU you are running the example on?

Thanks.

Is it fine? As im running the circuit on a server, which has this configuration for GPU.

Thanks.

Hi @roysuman088
Unfortunately lightning.gpu does not support the NVIDIA M40 Tesla cards, as these are not compatible with NVIDIA cuQuantum, which we rely on for the GPU calls. The minimum supported GPU generation for cuQuantum is the SM7.0 cards, which are V100s and newer. The M40 series are SM5.2. See WARNING: INSUFFICIENT SUPPORT DETECTED FOR GPU DEVICE WITH `lightning.gpu` - #7 by mlxd for another case where this is discussed.

I suspect this may be partly the reason that the runtime is taking a long time, as it is likely falling back to CPU-only operation.

Do you see a warning message when instantiating a lightning.gpu device on that machine? It should let you know that a compatible GPU device is not found, and so will fall-back to CPU-only operation.

What could potentially be a good work-around is trying to use our lightning.kokkos research device. This card may support the older GPU generation, and allow you to get better performance. To try this out, feel free to try the following guide, though, you may need to ensure both CUDA libraries and binaries are located on your PATH and LD_LIBRARY_PATH first. I will assume you are using Ubuntu, and add the paths as they tend to be found on that OS:

git clone https://github.com/PennyLaneAI/pennylane-lightning-kokkos
cd pennylane-lightning-kokkos & git checkout v0.29.1
export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64

python -m venv pyenv && source ./pyenv/bin/activate
python -m pip install pennylane
BACKEND="CUDA" python -m pip install -e .

The above process creates a local Python virtual environment, installs PennyLane and will try to build Lightning Kokkos for your given GPU architecture. If the generation is too old, it may not be supported.

Hi @mlxd …while running i didn’t get any warning for GPU to CPU transition i.e. falling back to CPU as no GPU found…but thanks for your clarification on GPU device…also can you confirm that if I get upto date GPU for running lightning.gpu on it…how much time it’ll take…a bit approximate from your end will be helpful…

Thanks again for your clarification…

Hi @roysuman088
Thanks for letting us know.

If you have an interactive session on that server, can you try:

import pennylane as qml
dev = qml.device("lightning.gpu", wires=2)

and it should give you a warning such as:

/pyenv/lib/python3.10/site-packages/pennylane_lightning_gpu/lightning_gpu.py:93: UserWarning: CUDA device is an unsupported version: (6, 1)
  warn(str(e), UserWarning)
/pyenv/lib/python3.10/site-packages/pennylane_lightning_gpu/lightning_gpu.py:828: RuntimeWarning: 
            !!!#####################################################################################
            !!!
            !!! WARNING: INSUFFICIENT SUPPORT DETECTED FOR GPU DEVICE WITH `lightning.gpu`
            !!!          DEFAULTING TO CPU DEVICE `lightning.qubit`
            !!!
            !!!#####################################################################################
            
  warn(

The above warning was run on a Pascal (SM6.1) device which shows what the warning should look like.

Unfortunately we cannot predict how long a circuit will take to run from problem to problem as there are so many factors that can affect execution time. The general rule-of-thumb we see is that for qubit counts > 20, the GPU runtimes should be about 10x faster than runtimes on large server CPUs.

Though, that is just a generalization, and may not reflect real-world examples. The only real way to know is to run your code on a given GPU at different scales and aim to extrapolate from that. For your problem size of 19 qubits, the ratio of overheads (CPU-GPU transfers, synchronization points, etc) to computation may be larger than for a 26 qubit workload, where the GPU is being fully used.

1 Like