I have been working lately on two QML projects which make use of PennyLane’s KerasLayer and TorchLayer to integrate quantum circuits in larger networks. In both cases (see this one for example), although I set up the parameters so that there are no more than 4 qubits, the time needed to train the networks is in the order of days for a single epoch.
In all cases, the networks have four circuits define like this:
self.device = qml.device(self.backend, wires=self.wires, shots=shots, gpu=self.use_gpu)
def _circuit(inputs, weights):
qml.templates.AngleEmbedding(inputs, wires=range(n_qubits))
qml.templates.BasicEntanglerLayers(weights, wires=range(n_qubits))
return [qml.expval(qml.PauliZ(wires=i)) for i in range(n_qubits)]
self.qlayer = qml.QNode(_circuit, self.dev, interface="tf")
I wonder what options do I have to speed up. By switching the backend, Keras reported the following predicted time-to-complete for one epoch:
default.qubits: 100 hrs
qulacs+GPU: 46 hrs
qiskit.basicaer: 120 hrs
JAX+GPU: 292 hrs
How do I trace back where the bottleneck is? Would it speed up if I reduced the number of shots? (I set it to 10 and 100 but it didn’t change much).
This is a great question and we’re certainly focusing on speeding up PennyLane at the moment.
One low-hanging fruit for achieving a speedup is to switch your differentiation method. Depending on your version of PennyLane, you may be using the parameter-shift rule to evaluate gradients. This method is a good option for hardware, but simulators can actually harness faster methods that leverage knowledge of the internal system state. In particular:
backprop: builds a computational graph of the simulation and differentiates in the same way as standard ML libraries
adjoint: works by reversing back through the circuit after a forward pass (needs PL version >= 0.14.0 and doesn’t yet support all gates and return types)
*If you are using PennyLane version >= 0.14.0, backprop will likely be the default-selected differentiation method (see here).
I’d be curious to see your current choice of differentiation method. If you are already using backprop or adjoint, we may have to think more carefully Regarding the choice of shots, I’d typically just run without explicitly setting the value on a simulator (and hence running in analytic mode).
thanks for the amazing reply. It helped to clarify a few things. I changed the parameters in code to reflect yours suggestions, i.e. I leave shots to its default value, use default.qubit.tf as backend. The result is that the ETA goes from 120 hrs with backprop to 50 hrs with adjoint. It’s more than a factor 2 but not yet what I hoped to achieve (less than 1 hrs per epoch). That is running on my MacBook Air laptop, which is not the best. I also tried with a more powerful computer also equipped with a GPU. The same code shows an ETA of 75 hrs with adjoint if the GPU is used, while it still reports something in the order of 60 hrs if I set CUDA_VISIBLE_DEVICES=-1.
Btw, in an unfair comparison, using the TF-2.4 LSTM implementation, one epoch takes 1 minute to train.
Thanks @rdisipio, this is really useful feedback for us too! Would you be able to share some of your code so that we can troubleshoot or identify the bottleneck more easily?
thank you again for looking into this. I updated my code on this repo: https://github.com/rdisipio/hai-q
If you follow the README it shouldn’t be too hard to make it run. As a matter of fact, I noticed that if I switch to default.qubit.tf I get a rather weird TF error:
ValueError: Tensor conversion requested dtype float32 for Tensor with dtype float64: <tf.Tensor: shape=(4,), dtype=float64, numpy=array([0.68470202, 0.94917192, 0.72625363, 0.65615942])>
I’ve probably been looking at the code for too long to spot the error…
Hey @rdisipio! I’ve encountered this error before, I believe it is because default.qubit.tf defaults to float64, so all gate arguments should also be float64 (unless the precision of the simulator is decreased).
Coming back to this issue…I couldn’t really solve the problem yet, but I think I figured out the main difference is in the for-loop over the element of the input sequence (i.e. token embeddings in this case). If I try to use the TensorFlow backend, it won’t let me iterate on tensor’s element. The solution adopted in the TF-2.x implementation of LSTM really flies above my head. I then tried to code everything with Torch, so at least I don’t have the issue with tensors. This way at least I managed to try the backprop differentiation method with the default.qubit.autograd interface and indeed there is some little speedup, but not enough to make it viable for prototyping.
While I haven’t found yet a solution, I wonder if you have any insight concerning for-loops acting on tensors, and if they are known in general to slow down this sort of calculations.
We encourage the best practise of trying not to iterate over tensor objects in pennylane as this won’t work in the tensorflow interface since tensorflow variables are not iterables (as you have found out ).
Generally vectorization is a good way to remove loop-based bottlenecks. By utilising operations borrowed from linear algebra such as scalar products and element-wise multiplication, this can lead to dramatic speed-ups!
Hope this helps and let us know if you have any more questions!
thanks for your reply. I agree that in general one should avoid for-loops as much as possible. However, in my case the for-loop is necessary to account for the sequential nature of the calculation i.e. the result at step t depends on the results at steps (0, …t-1). If anyone knows how to solve this for recurrent networks…I’m listening!
Thanks for the additional insight on your problem! While the trouble with loops in an RNN is difficult to resolve there are other possible approaches to speed-up in your experiment:
Use qulacs as a backend which should give around a 10x speedup for executing the quantum circuits
Profile the code using a tool such as cProfile to identify other bottlenecks. Some key areas to monitor would be torch, the torch-pennylane interaction, the pennylane simulation itself and how often circuits are being called in each epoch
We are currently working to speed up pennylane so this presents a great use case for us! Let us know how it goes and if you have any more questions!