Backprop for Lightning.gpu

Hayk_Tepanyan · December 19, 2023, 11:36pm

Hello,

Are there plans to support backprop differentiation method with lightning.gpu?
I can use lightning.gpu with adjoint diff and all is good.
In my use case I have lots of memory and care about the speed - so would be nice to have lightning.gpu with backprop diff method.

P.S.
In some of my tests I see default.qubit + backprop being much faster than lightning.gpu + adjoint up until 18 qubits. Knowing how much faster lightning.gpu is than default.qubit on regular simulations, I would expect lightning.gpu + backprop be much much faster than lightning.gpu + adjoint

P.P.S
Actually, seems like adjoint doesn’t scale well with the number of output observables in the circuit:

when I only have 1 observable expectation as the output of my quantum layer, lightning.gpu + adjoint is ~10x faster than default.qubit + backprop.
However when I have 18 observables as the output, default.qubit + backprop only slows down by 3x, lightning.gpu + adjoint slows down by ~30x and they end up being roughly the same.

isaacdevlugt · December 21, 2023, 8:48pm

Hey @Hayk_Tepanyan!

In some of my tests I see default.qubit + backprop being much faster than lightning.gpu + adjoint up until 18 qubits.

Typically you’ll see improvements for >20 qubits with lightning gpu. There are overheads that make it slower for smaller systems. If it isn’t faster for you in those regimes, it would be great if we could see your code and package versions to see what’s going on

Currently we don’t support backprop here, as you mentioned. I’m not sure that this is on the horizon for us, but you can make a feature request to our github repository if you want to put it on our radar more formally.

Hope this helps!

Hayk_Tepanyan · January 13, 2024, 1:15am

Thanks @isaacdevlugt,

We do see faster runtimes with lightning.gpu for >20 as you suggested.

On a related note, is there a way to configure lightning.gpu such that it can “treat” thousands of small circuits as 1 large circuit?
In other words, if instead of single 20 qubit circuit I have 1024 10-qubit circuits, all based on the same parametrized QNode just having different params - is there a way to use lightning.gpu here and achieve a similar runtime as for a single 20 qubit circuit?

This scenario is very common in qml use cases where the 1024 above is the batch_size. Currently we see clear linear dependency on the batch_size when using GPUs as opposed to classical ML cases.

isaacdevlugt · January 17, 2024, 12:24am

Nice!

On a related note, is there a way to configure lightning.gpu such that it can “treat” thousands of small circuits as 1 large circuit?

Yep! I think it makes more sense based on how our GPU-distributed support works to think about it as one circuit parallelized over many GPUs. Check out our blog post for more info on how to do that: PennyLane v0.31 released | PennyLane Blog

Let me know if that helps!

Topic		Replies	Views
Does lightning.gpu really have acceleration PennyLane Help	10	801	November 29, 2023
Qlm GradientDescent very slow with lightning.gpu PennyLane Help	9	502	June 28, 2023
Backpropagation with lightning.qubit and PyTorch PennyLane Help	3	1745	March 15, 2022
Which device is fastest? PennyLane Help	6	1889	January 11, 2024
Lightning.gpu with multiple gpus PennyLane Help	6	1491	November 11, 2022

Backprop for Lightning.gpu

Related topics