Lightning.gpu with multiple gpus

Hey pennylane community,

does lightning.gpu support scaling to multiple gpus?
I noticed that cuQuantum supports multi-node and multi-GPU execution, is this also implemented in lightning.gpu ?

If so could someone point to a simple example of setting this up?


Hi @ToMago
Do you have a specific use-case in mind? lightning.gpu natively supports multi-GPU execution across observables in the adjoint differentiation method. If you are creating circuits with multiple expectation values, the gradients will be carried out by splitting the work-load over multiple GPUs. Feel free to see here on how to enable this.

For an example of the improvements we get when using this functionality, you can see the release notes of lightning.gpu 0.25.

If you are looking to split a single state-vector over multiple GPUs, we do not currently support these offloads to cuQuantum.

Let me know if you have any follow-on questions.

Hey @mlxd and thanks for you answer!

I’m looking to train e.g. a PQC with a single expectation value measurement with with adjoint differentiation.
So lets say I have a Qnode circuit with a single measurement and train it with Adam on some cost function.
Does batch_obs=True batch the different circuit evaluation even when I only have a single measurement?
So I assume I use batch_input and the different circuit evaluation are distributed automatically ?

Or does batch_obs only work for multi measurements ?


Hi @ToMago,

Unfortunately batch_obs only works for multiple measurements. If you have a single measurement you cannot split it into more than one GPU.

Something else you could try is using circuit cutting. I’m not sure whether it will work with GPUs but it could potentially help you reduce the memory needs (although with a time overhead). You can learn more about it in this demo .

Please let me know if this helps!

Hi @mlxd and @CatalinaAlbornoz

Is that any plans that pennylane will support splitting a single state-vector or splitting a single measurement over multiple GPUs?

For this case it sounds straightforward to do the splitting:
for qml.expval(qml.PauliZ(0)), the measurement expectation is the dot product of two vectors: sum over i of (a_i + b_i). This can be easily splitted into multiple GPUs: sum over k of (a_k + b_k) + sum over j of (a_j + b_j) + …

Here are two links about matrix multiplication for multiple GPUs:

Hi @ToMago

Just to add onto what @CatalinaAlbornoz has suggested. If your return value is a Hamiltonian composed on sums of terms, this will allow multi-GPU batching as of v0.26.2. However, for the upcoming release, we also add support for qml.SparseHamiltonian as part of the single-GPU adjoint pipeline. We have found for many Hamiltonians this often beats the multi-GPU distributed approach in runtime, at the expense of generating the sparse-Hamiltonian value up-front.

For the circuit cutting data, we have successfully cut a 62-qubit QAOA problem into many 20-30 qubit circuits allowing us to distribute the work over 100s of GPUs. Some additional details can be found in this repo and this paper. I recently gave a talk on this too here if you are interested.

Feel free to let us know if we can be of further assistance.


Hi everyone,

thanks for all your input, this is very helpful!
I will also look further into circuit cutting.