Hi,

I have been working lately on two QML projects which make use of PennyLane’s KerasLayer and TorchLayer to integrate quantum circuits in larger networks. In both cases (see this one for example), although I set up the parameters so that there are no more than 4 qubits, the time needed to train the networks is in the order of days *for a single epoch*.

In all cases, the networks have four circuits define like this:

```
self.device = qml.device(self.backend, wires=self.wires, shots=shots, gpu=self.use_gpu)
def _circuit(inputs, weights):
qml.templates.AngleEmbedding(inputs, wires=range(n_qubits))
qml.templates.BasicEntanglerLayers(weights, wires=range(n_qubits))
return [qml.expval(qml.PauliZ(wires=i)) for i in range(n_qubits)]
self.qlayer = qml.QNode(_circuit, self.dev, interface="tf")
```

I wonder what options do I have to speed up. By switching the backend, Keras reported the following predicted time-to-complete for one epoch:

- default.qubits: 100 hrs
- qulacs+GPU: 46 hrs
- qiskit.basicaer: 120 hrs
- JAX+GPU: 292 hrs

How do I trace back where the bottleneck is? Would it speed up if I reduced the number of shots? (I set it to 10 and 100 but it didn’t change much).