Utilizing multi-core CPU clusters for QNN training


I am working on training a QNN with 8 layers on the MNIST classification task using 10-15 qubits. I also have access to the Compute Canada CPU cluster that allows me to use up to 40 Intel Gold 6148 Skylake@2.4 GHz cores for running jobs with a maximum allowed running time of 168 hours (1 week).

Until this point, I had been running tests on my PC using the lightning.qubit device, but due to the lack of support for using multiple CPU cores, as answered in https://discuss.pennylane.ai/t/lightning-qubit-cpu-utilisation/833, the model is currently taking too long to train.

I also considered using the pennylane-qulacs to make use of parallelization. However, the benchmarks shown in https://docs.pennylane.ai/projects/qulacs/en/latest/ seem to show no promise of improvement for the range of qubits I am working with.

In short: I am looking for a way to capitalize on the availability of multiple cores to run my program in a meaningful time frame.

Any help/suggestions regarding this would be greatly appreciated. I apologize in advance if anything is unclear from my query.


Hi @BlackFyre
From what I understand, you have only a single output observable, and so the adjoint gradient differentiation method with lightning.qubit won’t allow the use of OpenMP threaded observable calculations as part of your workload. Would that be a correct statement? If you do use multiple observables, making sure to enable diff_method="adjoint" with lightning.qubit will allow spinning up a thread per observable in the gradient evaluation pipeline.

A lot has changed in the time since that blog-post — lightning.qubit does support multithreading, but only in the gradient evaluation pipeline with the adjoint diff method.

If your workload doesn’t benefit from this, you may want to look into our lightning.kokkos (link) device (python -m pip install pennylane-lightning-kokkos). This supports OpenMP at the gate-level, rather than the observable level, and, assuming you have few observables, should allow you to scale the performance by controlling the OpenMP theading environment.

I’d suggest trying this out with OMP_NUM_THREADS=<number of physical cores> OMP_PROC_BIND=false python yourscript.py and seeing how long it takes. Feel free to reach out if we can be any further help.


Hi @mlxd. Thanks a lot for your time in suggesting solutions. They helped me greatly!

1 Like

Hello @BlackFyre,
A good paper by Kordzanganeh, M., et al. that discussing how to optimize settings on different devices. Also, the quantum transfer learning is a good model to try for image classification.

1 Like