Backprop for Lightning.gpu

Hello,

Are there plans to support backprop differentiation method with lightning.gpu?
I can use lightning.gpu with adjoint diff and all is good.
In my use case I have lots of memory and care about the speed - so would be nice to have lightning.gpu with backprop diff method.

P.S.
In some of my tests I see default.qubit + backprop being much faster than lightning.gpu + adjoint up until 18 qubits. Knowing how much faster lightning.gpu is than default.qubit on regular simulations, I would expect lightning.gpu + backprop be much much faster than lightning.gpu + adjoint

P.P.S
Actually, seems like adjoint doesn’t scale well with the number of output observables in the circuit:

  • when I only have 1 observable expectation as the output of my quantum layer, lightning.gpu + adjoint is ~10x faster than default.qubit + backprop.
  • However when I have 18 observables as the output, default.qubit + backprop only slows down by 3x, lightning.gpu + adjoint slows down by ~30x and they end up being roughly the same.

Hey @Hayk_Tepanyan!

In some of my tests I see default.qubit + backprop being much faster than lightning.gpu + adjoint up until 18 qubits.

Typically you’ll see improvements for >20 qubits with lightning gpu. There are overheads that make it slower for smaller systems. If it isn’t faster for you in those regimes, it would be great if we could see your code and package versions to see what’s going on :slight_smile:

Currently we don’t support backprop here, as you mentioned. I’m not sure that this is on the horizon for us, but you can make a feature request to our github repository if you want to put it on our radar more formally.

Hope this helps!

1 Like

Thanks @isaacdevlugt,

We do see faster runtimes with lightning.gpu for >20 as you suggested.

On a related note, is there a way to configure lightning.gpu such that it can “treat” thousands of small circuits as 1 large circuit?
In other words, if instead of single 20 qubit circuit I have 1024 10-qubit circuits, all based on the same parametrized QNode just having different params - is there a way to use lightning.gpu here and achieve a similar runtime as for a single 20 qubit circuit?

This scenario is very common in qml use cases where the 1024 above is the batch_size. Currently we see clear linear dependency on the batch_size when using GPUs as opposed to classical ML cases.

Nice!

On a related note, is there a way to configure lightning.gpu such that it can “treat” thousands of small circuits as 1 large circuit?

Yep! I think it makes more sense based on how our GPU-distributed support works to think about it as one circuit parallelized over many GPUs. Check out our blog post for more info on how to do that: PennyLane v0.31 released | PennyLane Blog

Let me know if that helps!