Hi everyone. I have run some tests with the following simple hybrid network, using PyTorch and treating the qnode as a `TorchLayer`

. The code is the following (as a new user I cannot upload a file):

```
import numpy as np
from sklearn.datasets import make_moons
import torch
import matplotlib.pyplot as plt
import torch.nn as nn
import pennylane as qml
import sys
from time import perf_counter
class Model(nn.Module):
def __init__(self, dev, diff_method="backprop"):
super().__init__()
self.cnet_in = self.cnet()
self.qcircuit = qml.qnode(dev, interface="torch",
diff_method=diff_method)(self.qnode)
weight_shape = {"weights":(2,)}
self.qlayer = qml.qnn.TorchLayer(self.qcircuit, weight_shape)
self.cnet_out = self.cnet()
def cnet(self):
layers = [nn.Linear(2,10), nn.ReLU(True), nn.Linear(10,2), nn.Tanh()]
return nn.Sequential(*layers)
def qnode(self, inputs, weights):
# Data encoding:
for x in range(len(inputs)):
qml.Hadamard(x)
qml.RZ(2.0 * inputs[x], wires=x)
# Trainable part:
qml.CNOT(wires=[0,1])
qml.RY(weights[0], wires=0)
qml.RY(weights[1], wires=1)
return [qml.expval(qml.PauliZ(wires=0)), qml.expval(qml.PauliZ(wires=1))]
def forward(self, x):
x1 = self.cnet_in(x)
x2 = self.qlayer(x1)
x_output = self.cnet_out(x2)
return x_output
def train(X, y_hot, dev_name, diff_method):
dev = qml.device(dev_name, wires=2, shots=None)
model = Model(dev, diff_method)
# Train the model
opt = torch.optim.SGD(model.parameters(), lr=0.2)
loss = torch.nn.L1Loss()
X = torch.tensor(X, requires_grad=False).float()
y_hot = y_hot.float()
batch_size = 5
batches = 200 // batch_size
data_loader = torch.utils.data.DataLoader(
list(zip(X, y_hot)), batch_size=batch_size, shuffle=True, drop_last=True
)
epochs = 6
for epoch in range(epochs):
running_loss = 0
for xs, ys in data_loader:
opt.zero_grad()
loss_evaluated = loss(model(xs), ys)
loss_evaluated.backward()
opt.step()
running_loss += loss_evaluated
avg_loss = running_loss / batches
print("Average loss over epoch {}: {:.4f}".format(epoch + 1, avg_loss))
y_pred = model(X)
predictions = torch.argmax(y_pred, axis=1).detach().numpy()
correct = [1 if p == p_true else 0 for p, p_true in zip(predictions, y)]
accuracy = sum(correct) / len(correct)
print(f"Accuracy: {accuracy * 100}%")
if __name__ == "__main__":
torch.manual_seed(42)
np.random.seed(42)
X, y = make_moons(n_samples=200, noise=0.1)
y_ = torch.unsqueeze(torch.tensor(y), 1) # used for one-hot encoded labels
y_hot = torch.scatter(torch.zeros((200, 2)), 1, y_, 1)
begin_time = perf_counter()
train(X, y_hot, str(sys.argv[1]), str(sys.argv[2]))
end_time = perf_counter()
runtime = end_time-begin_time
print(f'Runtime: {runtime:.2e} s or {(runtime/60):.2e} min.')
```

The conda environment contains:

- pytorch=1.10.2
- pennylane=0.21.0
- numpy=1.22.2

## Observations

I ran tests with different combinations of devices and differentiation methods. Of course, not all of the combinations are possible, e.g., `lightning.qubit`

does not support `backprop`

at the time of writing this post. Measuring each time the runtime, and monitoring the usage of memory and number of cores that are utilised. The results can be reproduced with the code above, but have been consistent with other tests on larger and more complex hybrid networks.

- adjoint differenatiation method (vs. backprop) has always been faster for my tests. For both
`lightning.qubit`

and`default.qubit`

. - adjoint method consumed significantly less memory than backprop.
- For
`lightning.qubit`

+`adjoint`

, I observe that many sub-procceses are created and run in parallel. However for`default.qubit`

+`adjoint`

, only one core is utilised. - When using
`backprop`

+`default.qubit`

I get the following warning:

```
/work/vabelis/miniconda3/envs/ae_qml_pnl/lib/python3.8/site-packages/torch/autograd/__init__.py:154: UserWarning: Casting complex values to real discards the imaginary part (Triggered internally at /opt/conda/conda-bld/pytorch_1640811757556/work/aten/src/ATen/native/Copy.cpp:244.)
Variable._execution_engine.run_backward(
```

As mentione also in this post.

## Questions

a. Do you think that the above observations are universal for hybrid networks? For example, will the adjoint method always be faster compared to backprop for larger networks and different measurement operators (e.g. 1 qubit measurements)? Is there some threshold after/before which backprop is better?

b. When one uses the adjoint method, is the classical part of the network still trained with backprop in PyTorch? If that is true, I find observations 2. and 3. counterintuitive. That is, when using `backprop`

PyTorch utilises more cores via multithreading. Hovewer, with `default.qubit`

and `adjoint`

only one core was used (less memory too), and it still performed better than `backprop`

+ `default.qubit`

, which consumed significantly more memory and number of cores. I would be grateful for any insights on this matter

c. If we are interested in the best possible balance between training time and resources required, is the recommended option always `lightning.qubit`

+ `adjoint`

?