Backward function takes long and batches

barthelemymp · June 18, 2019, 5:17pm

Hello,

I’m working on a project where I use the pennylane/PyTorch interface.
To do so I have build a model inheriting the nn.Module, in which I have defined my circuit inside my forward function.

I’m working on a8 qbits simulation, with 3 simple layers. Each layer have around 20 rotations and 8 CNOT. I optimize some parameters of the rotations. My data is quite simple (8 features).
However when it comes to the backwards function, it take long a few minutes for each sample of my data.

Is it normal or I did something wrong somewhere ?
Can I use a DataLoader with batch for a simulation ?
Is the gradient calculated (for simulation) as a classical object using autograd or the quantum way (meaning I can only access measurement and all the problems that comes along).

Best regards and thank you for your job on this great library,

Barthélémy

nathan · June 20, 2019, 4:38pm

Hi @barthelemymp,

To answer your questions (as best I can):

If you have m parameters, it takes O(2m) circuit evaluations to compute the gradients (needed for backpropagation through the quantum part using pytorch). We have noticed that the timing usually reflects this pretty well. How does your backwards function evaluation compare in run-time with the forward evaluation?
Batching is currently not supported (since none of the simulators support it), but it is a planned feature for the future
All gradients are calculated the quantum way, but we recognize that this can be sped up if i) you are using a simulator which supports (classical) automatic differentiation, and ii) we were to add this awareness to PL (i.e., it recognizes when the simulator can do gradient calculations classically). This is also on our roadmap, but first requires better simulator support

mamadpierre · September 28, 2020, 8:17pm

Hi @nathan Thanks for your answer, I am wondering is there any update on batch-size calculations instead of using for loops (Second bullet point above) or still one should use some sort of looping over batch-size? To give you an example please look at the tutorial named, " Quantum transfer learning

q_out = torch.Tensor(0, n_qubits)
q_out = q_out.to(device)
for elem in q_in:
    q_out_elem = quantum_net(elem, self.q_params).float().unsqueeze(0)
    q_out = torch.cat((q_out, q_out_elem))

This is the bottle-neck of the code which corresponds to a for loop over batch-size for the quantum circuit.

nathan · September 29, 2020, 2:12pm

Hi @mamadpierre,

Your question is well-timed. Support for batching circuits is our next major priority to implement in PennyLane. You should expect to see some progress on that pretty soon.

The interesting thing about batching in the case of quantum computations is that there are so many possible things that you could “batch over”, e.g., parameter values (the most common use of batching from ML), measurement settings, size of circuits (as in your example), hyperparameter choices, ansatz choices, etc. One reason we have been holding off with implementing batching is because we want to make sure all these cases are covered as naturally as possible.

mamadpierre · September 29, 2020, 2:39pm

Thanks for your answer,

erinaldiq · January 6, 2023, 6:10am

@nathan batching circuits is supported for QNodes, but it seems to be missing from the TorchLayer and the KerasLayer classes. Is that correct?

I tested it and I can see that running the QNode on a batch X of inputs submits 1 job to the device with a number of circuits equal to the inputs in the batch. However, running the TorchLayer or the KerasLayer is looping over the elements of the batch and submitting 1 job per circuits. This is slowing down the computation on actual QPUs because of the overhead of queuing etc…

Is batching in these layers something that can be supported?

See the following code:

# %%
import pennylane as qml
from pennylane import numpy as np

# %%
import os 

n_wires = 4

# Set up your device
b = "ibm_perth"
dev = qml.device(
    "qiskit.ibmq",
    wires=n_wires,
    backend=b,
    ibmqx_token=os.getenv("PYTKET_QA_QISKIT_TOKEN"),
)

# %%
dev.capabilities()

# %%
# Create your qnode
@qml.qnode(dev)
def circuit(feature_vector, parameters):
    qml.AngleEmbedding(features=feature_vector, wires=range(n_wires), rotation="Z")
    qml.StronglyEntanglingLayers(weights=parameters, wires=range(n_wires))
    # Return the expectation value on the computational basis for every qubit
    return [qml.expval(qml.PauliZ(i)) for i in range(n_wires)]


# %%
# Create your features vector. Here X has 10 sets of 4 features. Parameter broadcasting happens under the hood.
X = np.random.random(size=(10, n_wires), requires_grad=False)

# Create your trainable parameters
shape = qml.StronglyEntanglingLayers.shape(n_layers=2, n_wires=4)
weights = np.random.random(size=shape, requires_grad=True)



# %%
import matplotlib.pyplot as plt 
fig, ax = qml.draw_mpl(circuit, expansion_strategy="device")(X, weights)
plt.show()

# %% [markdown]
# Run on a batch of inputs with fixed weights:

# %%
circuit(X, weights)

# %% [markdown]
# Time for this `circuit` is ~35 seconds. The circuits for each element of `X` are batched into a single job and submitted to the device queue.

# %% [markdown]
# Rewrite the function to have the right signature for a `TorchLayer` and a `KerasLayer`

# %%
# Create your qnode
@qml.qnode(dev)
def circuit2(inputs, weights):
    qml.AngleEmbedding(features=inputs, wires=range(n_wires), rotation="Z")
    qml.StronglyEntanglingLayers(weights=weights, wires=range(n_wires))
    # Return the expectation value on the computational basis for every qubit
    return [qml.expval(qml.PauliZ(i)) for i in range(n_wires)]

# %% [markdown]
# ### `TorchLayer` run

# %%
import torch

# %%
weight_shapes = {"weights": shape}
init_method = {
    "weights": torch.tensor(weights),
}
qlayer = qml.qnn.TorchLayer(
    circuit2, weight_shapes=weight_shapes, init_method=init_method
)


# %%
qlayer(torch.tensor(X))

# %% [markdown]
# Time for this `qlayer` is ~2 minutes. The circuits for each element of `X` are individually sent to the device queue as different jobs

# %% [markdown]
# ### `KerasLayer` run

# %%
import tensorflow as tf 


# %%
def my_init(shape, dtype=None):
    return tf.Variable(weights, shape, dtype=dtype)

# %%
weight_shapes = {"weights": shape}
weight_specs = {"weights": {"initializer": my_init}}

qlayer = qml.qnn.KerasLayer(
    circuit2, weight_shapes=weight_shapes, output_dim=n_wires, weight_specs=weight_specs
)


# %%
qlayer(tf.constant(X))


# %% [markdown]
# Time for this `qlayer` is ~2 minutes. The circuits for each element of `X` are individually sent to the device queue as different jobs

Tom_Bromley · January 11, 2023, 1:04pm

Hi @erinaldiq,

batching circuits is supported for QNodes, but it seems to be missing from the TorchLayer and the KerasLayer classes. Is that correct?

Yes that’s right, we haven’t added batching support for TorchLayer and KerasLayer yet. This shouldn’t be too challenging to do though, and it’s on our radar.

To help you now with some quick prototyping with the code you shared, the code below gives you a hacked-together solution that overrides the forward() method of TorchLayer:

import pennylane as qml
import torch
import numpy as np

qml.enable_return()

n_wires = 3
batch_size = 5
dev = qml.device("default.qubit", wires=n_wires)

@qml.qnode(dev)
def circuit(inputs, weights):
    qml.AngleEmbedding(features=inputs, wires=range(n_wires), rotation="X")
    qml.StronglyEntanglingLayers(weights=weights, wires=range(n_wires))
    return [qml.expval(qml.PauliZ(i)) for i in range(n_wires)]

np.random.seed(0)
shape = qml.StronglyEntanglingLayers.shape(n_layers=2, n_wires=n_wires)
input_shape = (batch_size, n_wires)
inputs = torch.tensor(np.random.random(input_shape))

weight_shapes = {"weights": shape}
weights = np.random.random(size=shape)
init_method = {
    "weights": torch.tensor(weights),
}
qlayer = qml.qnn.TorchLayer(
    circuit, weight_shapes=weight_shapes, init_method=init_method
)

def forward(inputs):
    kwargs = {
        **{qlayer.input_arg: inputs},
        **{arg: weight.to(inputs)
           for arg, weight in qlayer.qnode_weights.items()},
    }
    return torch.stack(qlayer.qnode(**kwargs)).T.type(inputs.dtype)

qlayer.forward = forward
qlayer(inputs)

assert dev.num_executions == 1

Though this may not work more generally.

erinaldiq · January 12, 2023, 6:51am

Thanks for the reply @Tom_Bromley

I am using pennylane-0.28.0 and for a circuit with batched inputs I still get dev.num_executions==10 if the batch size is 10 when using the qiskit.ibmq device. Your example gives dev.num_executions==1 with the default.qubit device, and I can reproduce that.

The transpose in the return of the forward function is rearranging the output and changing the first dimension from being the batch size to being the number of wires. Removing the .T seems to be correct but I may be wrong.

I see that the TorchLayer forward pass is unstacking the first dimension:

        if len(inputs.shape) > 1:
            # If the input size is not 1-dimensional, unstack the input along its first dimension,
            # recursively call the forward pass on each of the yielded tensors, and then stack the
            # outputs back into the correct shape
            reconstructor = [self.forward(x) for x in torch.unbind(inputs)]
            return torch.stack(reconstructor)

while in your example you call qlayer.qnode on the entire batch.
This seems to do what I expect and a single job is submitted to the IBMQ backend with a number of circuits equal to the batch size.

Can you explain (some of) the reason(s) why this is not the default behavior?

Thank you again for the quick reply!

Tom_Bromley · January 24, 2023, 5:41pm

Hi @erinaldiq!

Sorry for the late response.

I am using pennylane-0.28.0 and for a circuit with batched inputs I still get dev.num_executions==10 if the batch size is 10 when using the qiskit.ibmq device. Your example gives dev.num_executions==1 with the default.qubit device, and I can reproduce that.

Thanks, this is good to know. We have taken a closer look at the plugin and it looks like there is a bug. Although executions are being dispatched as a batch to Qiskit backends, the num_executions counter is being incorrectly iterated due to being inside a post-processing for loop. @Romain_Moyard is looking to fix this shortly.

Can you explain (some of) the reason(s) why this is not the default behavior?

This is legacy behaviour due to TorchLayer not being updated. Currently, TorchLayer ensures that one input is passed to a QNode at a time (hence the unstacking of the first dimension). However, PennyLane now supports an outer batch dimension in parameters passed to a QNode, so we just need to update TorchLayer by removing the for loop.

Tom_Bromley · June 6, 2023, 4:52pm

@erinaldiq - to follow up here, with 4131 merged, TorchLayer and KerasLayer will have batching/broadcasting support with version 0.31 of PennyLane - due to come out at the end of June. If you do try this out, please let us know if you have any issues!

Alexandru_Paler · July 1, 2023, 8:42am

I would like to speed up the computation of the gradients by using multiple cores of a CPU, for example, in parallel. Is this possible through batching?

isaacdevlugt · July 4, 2023, 1:11pm

Hey @Alexandru_Paler! Welcome to the forum

Yep! This is possible with the pennylane-lightning plugin. Here’s some info:

installation: Installation — PennyLane-Lightning 0.31.0 documentation
some guidance on parallelization: Lightning Qubit device — PennyLane-Lightning 0.31.0 documentation

If you have more questions, it might be best to move this conversation to an entirely new forum post . But, let me know if this information helps!

isaacdevlugt · July 4, 2023, 6:14pm

@Alexandru_Paler I think I may have misunderstood your question — if you’re asking for parallelization with parameter broadcasting, pennylane-lightning currently doesn’t support that. It’s something that we may or may not work on if it provides a good enough performance improvement.

Hope this helps! Please let me know if you’re still confused.

Alexandru_Paler · July 4, 2023, 6:39pm

Thank you, @isaacdevlugt .

I was asking in order to parallelize the O(2m) circuit evaluations on 64 cores, for example, and getting a approx. 64x speed-up.

isaacdevlugt · July 4, 2023, 7:50pm

I see! An alternative you can use would be native parameter-broadcasting. Not every device, function, etc., has native broadcasting support, but, where applicable, it can lead to a nice speedup .

Check out these bits of documentation for more information:

Quantum circuits — PennyLane 0.31.0 documentation
Release notes — PennyLane 0.31.0 documentation (scroll down to parameter broadcasting section — v0.24 was when this was introduced )
v0.24 blog post: PennyLane v0.24 released

Topic		Replies	Views
Batching in TorchLayer PennyLane Help	35	2330	February 23, 2024
Batching Inputs to Quantum Circuit PennyLane Help	14	3308	September 10, 2019
Behaviour of pennylane torch layer with batched inputs PennyLane Help	3	49	November 15, 2024
Batch circuits training with QCBM style circuit and Torchlayer PennyLane Help	15	1028	April 12, 2024
Is parameter broadcasting of qml.qnn.TorchLayer available in the inherited class? PennyLane Help	3	340	July 5, 2023

Backward function takes long and batches

Related topics