GPU does not help qml.AngleEmbedding for batched training data

weiyinchiang · March 24, 2023, 9:25am

I use ‘qml.AngleEmbedding’ to embed my training data before BasicEntanglerLayers.
The whole circuit is converted by “qml.qnn.TorchLayer” and was trained with other classical torch layer. When I double the batch size, the training time is also doubled.

The question is, although I understand the “lightning.gpu” outperform “lightning.qubit” only when the number of qubit >24, should I use “lightning.gpu” enable the GPU support to the function of “qml.AngleEmbedding” ?

CatalinaAlbornoz · March 28, 2023, 1:45pm

Hi @weiyinchiang, welcome to the forum!

As you mention “lightning.gpu” outperforms “lightning.qubit” for about 20 qubits and more. The device that you use will be used for the entire circuit, so both qml.AngleEmbedding and qml.BasicEntanglerLayers would run on this device. If your question is about compatibility then yes, lightning.gpu is compatible with qml.AngleEmbedding. If your question is about the relative performance of lightning.gpu being lower compared to other embeddings, it is possible that this is the case. GPU use will optimize some processes more than others and it is possible that for your specific case you’re not seeing a big performance increase.

Does this answer your question?

weiyinchiang · March 29, 2023, 2:04am

Thanks for the answer.
I guess I am trying to understand the unreasonable GPU usage and aim to accelerate the training process.
In classical (pytorch) training, increasing the batch size normally increases the memory usage on GPU and the training time won’t increase (or that much). Here I am experiencing something different from classical training. Although my memory usage on GPU is less than 10%, increases in batch size seem double the training time and this comes from the circuit layer. The engineer from NVIDIA suspected the angle embedding function is not supported by cuQuantum, so that’s why I asked the question. However, according to my test, both “lightening.qubit” and “lightening.gpu” show the same issue.
When I use “lightening.qubit”, I explicitly assign .to(‘cuda’) for the model composed of torchlayer and classical mlp. So I think my question should better transform into:

Will GPU support “qml.qnn.TorchLayer” if I explicitly assign .to(‘cuda’) for the model(when the node is assigned to “lightening.qubit”)?
To identify the issue of GPU usage when batch size increases, could you give me a suggestion about which function or test I should take a look at first?

Thanks for your help!

CatalinaAlbornoz · March 30, 2023, 7:22pm

Hi @weiyinchiang ,

It’s great that you asked NVIDIA! They should be the best people to know whether cuQuantum supports angle embedding or not. In case cuQuantum doesn’t support it this just means that this part of the computation will happen in the CPU instead of the GPU.

Regarding your specific questions we may be able to provide a better answer if you share a minimal working example of your code that reproduces the problem. If you can also share some information about your system and the output of qml.about() we can also use this to try to uncover what’s happening here.

christina · March 30, 2023, 7:33pm

Hi @weiyinchiang ,

In PennyLane, two possible things can happen with batched parameters internally:

One circuit is simulated with multiple parameters simultaneously. This is what default.qubit does.
The single circuit is split up into multiple executions, such that each execution only has one set of parameters. This is what most other devices, including lightning.qubit, tend to do.

Simulations with large numbers of qubits tend to be take up a lot of memory. In this situation, simulating with more than one set of parameters at the same time will slow things down overall, not speed it up. Too much time is spent moving large chunks of memory around. This is why the high performance devices do not support native parameter broadcasting. They optimize for 20+ qubit simulations instead.

If you have a circuit with fewer wires but many parameters, you can try using default.qubit.torch with cuda. This device supports execution on the GPU through torch and supports native parameter broadcasting.

dev = qml.device("default.qubit.torch", wires=1, torch_device='cuda')

weiyinchiang · April 3, 2023, 3:29am

Thanks…will try after long holiday in Taiwan!

weiyinchiang · April 12, 2023, 6:29am

Ya, I got what you mean and here is the benchmark result for the time cost/per training iteration. But still no idea about why the time is doubled when batch size increases.

isaacdevlugt · April 14, 2023, 2:13pm

Hey @weiyinchiang! The behaviour for lightning.qubit and lightning.gpu is expected. But, the torch behaviour is a little strange. Could you run a tracker over your code with the three different devices? This will tell us whether or not the torch device is properly broadcasting.

We have a built-in tracker that you can access here:

https://docs.pennylane.ai/en/stable/code/api/pennylane.Tracker.html

Let us know what this gives!

Topic		Replies	Views
Does lightning.gpu really have acceleration PennyLane Help	10	745	November 29, 2023
CPU faster than GPU ; GPU utilization not increasing PennyLane Help	3	513	January 19, 2023
How to use GPU to accelerate the hybrid QNN PennyLane Help	6	585	March 8, 2024
Is lightning.gpu really faster than lightning.qubit? PennyLane Help	3	500	December 21, 2022
Problems with using lightning.gpu PennyLane Help	6	351	January 12, 2024

GPU does not help qml.AngleEmbedding for batched training data

Related topics