Pytorch benchmarks, different devices, and computing resources

Hi everyone. I have run some tests with the following simple hybrid network, using PyTorch and treating the qnode as a TorchLayer. The code is the following (as a new user I cannot upload a file):

import numpy as np
from sklearn.datasets import make_moons
import torch
import matplotlib.pyplot as plt
import torch.nn as nn
import pennylane as qml
import sys
from time import perf_counter

class Model(nn.Module):
    def __init__(self, dev, diff_method="backprop"):

        self.cnet_in = self.cnet()
        self.qcircuit = qml.qnode(dev, interface="torch", 
        weight_shape = {"weights":(2,)}
        self.qlayer = qml.qnn.TorchLayer(self.qcircuit, weight_shape)
        self.cnet_out = self.cnet()

    def cnet(self):
        layers = [nn.Linear(2,10), nn.ReLU(True), nn.Linear(10,2), nn.Tanh()]
        return nn.Sequential(*layers)   

    def qnode(self, inputs, weights):
        # Data encoding:
        for x in range(len(inputs)):
            qml.RZ(2.0 * inputs[x], wires=x)
        # Trainable part:
        qml.RY(weights[0], wires=0)
        qml.RY(weights[1], wires=1)
        return [qml.expval(qml.PauliZ(wires=0)), qml.expval(qml.PauliZ(wires=1))]

    def forward(self, x):
        x1 = self.cnet_in(x)
        x2 = self.qlayer(x1)
        x_output = self.cnet_out(x2)
        return x_output

def train(X, y_hot, dev_name, diff_method):
    dev = qml.device(dev_name, wires=2, shots=None)
    model  = Model(dev, diff_method)
    # Train the model
    opt = torch.optim.SGD(model.parameters(), lr=0.2)
    loss = torch.nn.L1Loss()

    X = torch.tensor(X, requires_grad=False).float()
    y_hot = y_hot.float()

    batch_size = 5
    batches = 200 // batch_size

    data_loader =
        list(zip(X, y_hot)), batch_size=batch_size, shuffle=True, drop_last=True

    epochs = 6

    for epoch in range(epochs):

        running_loss = 0

        for xs, ys in data_loader:

            loss_evaluated = loss(model(xs), ys)


            running_loss += loss_evaluated

        avg_loss = running_loss / batches
        print("Average loss over epoch {}: {:.4f}".format(epoch + 1, avg_loss))

    y_pred = model(X)
    predictions = torch.argmax(y_pred, axis=1).detach().numpy()

    correct = [1 if p == p_true else 0 for p, p_true in zip(predictions, y)]
    accuracy = sum(correct) / len(correct)
    print(f"Accuracy: {accuracy * 100}%")

if __name__ == "__main__":
    X, y = make_moons(n_samples=200, noise=0.1)
    y_ = torch.unsqueeze(torch.tensor(y), 1)  # used for one-hot encoded labels
    y_hot = torch.scatter(torch.zeros((200, 2)), 1, y_, 1)
    begin_time = perf_counter()
    train(X, y_hot, str(sys.argv[1]), str(sys.argv[2]))
    end_time = perf_counter()
    runtime = end_time-begin_time
    print(f'Runtime: {runtime:.2e} s or {(runtime/60):.2e} min.')

The conda environment contains:

  • pytorch=1.10.2
  • pennylane=0.21.0
  • numpy=1.22.2


I ran tests with different combinations of devices and differentiation methods. Of course, not all of the combinations are possible, e.g., lightning.qubit does not support backprop at the time of writing this post. Measuring each time the runtime, and monitoring the usage of memory and number of cores that are utilised. The results can be reproduced with the code above, but have been consistent with other tests on larger and more complex hybrid networks.

  1. adjoint differenatiation method (vs. backprop) has always been faster for my tests. For both lightning.qubit and default.qubit.
  2. adjoint method consumed significantly less memory than backprop.
  3. For lightning.qubit + adjoint, I observe that many sub-procceses are created and run in parallel. However for default.qubit + adjoint, only one core is utilised.
  4. When using backprop + default.qubit I get the following warning:
/work/vabelis/miniconda3/envs/ae_qml_pnl/lib/python3.8/site-packages/torch/autograd/ UserWarning: Casting complex values to real discards the imaginary part (Triggered internally at  /opt/conda/conda-bld/pytorch_1640811757556/work/aten/src/ATen/native/Copy.cpp:244.)

As mentione also in this post.


a. Do you think that the above observations are universal for hybrid networks? For example, will the adjoint method always be faster compared to backprop for larger networks and different measurement operators (e.g. 1 qubit measurements)? Is there some threshold after/before which backprop is better?

b. When one uses the adjoint method, is the classical part of the network still trained with backprop in PyTorch? If that is true, I find observations 2. and 3. counterintuitive. That is, when using backprop PyTorch utilises more cores via multithreading. Hovewer, with default.qubit and adjoint only one core was used (less memory too), and it still performed better than backprop + default.qubit, which consumed significantly more memory and number of cores. I would be grateful for any insights on this matter :slight_smile:

c. If we are interested in the best possible balance between training time and resources required, is the recommended option always lightning.qubit + adjoint?

Hi @vabelis, great questions!

a. Adjoint differentiation should always be better. The problem with adjoint differentiation is that it can only be used on simulators (same as backprop) and not on real hardware. I recommend that you take a look at the graphs at the end of this demo to get an idea on how the different methods compare in time performance.

b. The same demo on Adjoint differentiation will give you some insight on how it works and why it’s so efficient. For your specific question you should notice that adjoint diff is defined as the differentiation method for a particular qnode. Anything outside from the qnode will not be differentiated with the adjoint method.

c. Yes, lightning.qubit + adjoint is the recommended combination if you need better performance.

I hope this helps!