GPU does not help qml.AngleEmbedding for batched training data

I use ‘qml.AngleEmbedding’ to embed my training data before BasicEntanglerLayers.
The whole circuit is converted by “qml.qnn.TorchLayer” and was trained with other classical torch layer. When I double the batch size, the training time is also doubled.

The question is, although I understand the “lightning.gpu” outperform “lightning.qubit” only when the number of qubit >24, should I use “lightning.gpu” enable the GPU support to the function of “qml.AngleEmbedding” ?

Hi @weiyinchiang, welcome to the forum!

As you mention “lightning.gpu” outperforms “lightning.qubit” for about 20 qubits and more. The device that you use will be used for the entire circuit, so both qml.AngleEmbedding and qml.BasicEntanglerLayers would run on this device. If your question is about compatibility then yes, lightning.gpu is compatible with qml.AngleEmbedding. If your question is about the relative performance of lightning.gpu being lower compared to other embeddings, it is possible that this is the case. GPU use will optimize some processes more than others and it is possible that for your specific case you’re not seeing a big performance increase.

Does this answer your question?

1 Like

Thanks for the answer.
I guess I am trying to understand the unreasonable GPU usage and aim to accelerate the training process.
In classical (pytorch) training, increasing the batch size normally increases the memory usage on GPU and the training time won’t increase (or that much). Here I am experiencing something different from classical training. Although my memory usage on GPU is less than 10%, increases in batch size seem double the training time and this comes from the circuit layer. The engineer from NVIDIA suspected the angle embedding function is not supported by cuQuantum, so that’s why I asked the question. However, according to my test, both “lightening.qubit” and “lightening.gpu” show the same issue.
When I use “lightening.qubit”, I explicitly assign .to(‘cuda’) for the model composed of torchlayer and classical mlp. So I think my question should better transform into:

  1. Will GPU support “qml.qnn.TorchLayer” if I explicitly assign .to(‘cuda’) for the model(when the node is assigned to “lightening.qubit”)?
  2. To identify the issue of GPU usage when batch size increases, could you give me a suggestion about which function or test I should take a look at first?

Thanks for your help!

Hi @weiyinchiang ,

It’s great that you asked NVIDIA! They should be the best people to know whether cuQuantum supports angle embedding or not. In case cuQuantum doesn’t support it this just means that this part of the computation will happen in the CPU instead of the GPU.

Regarding your specific questions we may be able to provide a better answer if you share a minimal working example of your code that reproduces the problem. If you can also share some information about your system and the output of qml.about() we can also use this to try to uncover what’s happening here. :slight_smile:

Hi @weiyinchiang ,

In PennyLane, two possible things can happen with batched parameters internally:

  1. One circuit is simulated with multiple parameters simultaneously. This is what default.qubit does.

  2. The single circuit is split up into multiple executions, such that each execution only has one set of parameters. This is what most other devices, including lightning.qubit, tend to do.

Simulations with large numbers of qubits tend to be take up a lot of memory. In this situation, simulating with more than one set of parameters at the same time will slow things down overall, not speed it up. Too much time is spent moving large chunks of memory around. This is why the high performance devices do not support native parameter broadcasting. They optimize for 20+ qubit simulations instead.

If you have a circuit with fewer wires but many parameters, you can try using default.qubit.torch with cuda. This device supports execution on the GPU through torch and supports native parameter broadcasting.

dev = qml.device("default.qubit.torch", wires=1, torch_device='cuda')

Thanks…will try after long holiday in Taiwan!

1 Like

Ya, I got what you mean and here is the benchmark result for the time cost/per training iteration. But still no idea about why the time is doubled when batch size increases.

Hey @weiyinchiang! The behaviour for lightning.qubit and lightning.gpu is expected. But, the torch behaviour is a little strange. Could you run a tracker over your code with the three different devices? This will tell us whether or not the torch device is properly broadcasting.

We have a built-in tracker that you can access here:

https://docs.pennylane.ai/en/stable/code/api/pennylane.Tracker.html

Let us know what this gives!