Utilizing multi-core CPU clusters for QNN training

BlackFyre · July 4, 2023, 12:53pm

Hi!

I am working on training a QNN with 8 layers on the MNIST classification task using 10-15 qubits. I also have access to the Compute Canada CPU cluster that allows me to use up to 40 Intel Gold 6148 Skylake@2.4 GHz cores for running jobs with a maximum allowed running time of 168 hours (1 week).

Until this point, I had been running tests on my PC using the lightning.qubit device, but due to the lack of support for using multiple CPU cores, as answered in https://discuss.pennylane.ai/t/lightning-qubit-cpu-utilisation/833, the model is currently taking too long to train.

I also considered using the pennylane-qulacs to make use of parallelization. However, the benchmarks shown in https://docs.pennylane.ai/projects/qulacs/en/latest/ seem to show no promise of improvement for the range of qubits I am working with.

In short: I am looking for a way to capitalize on the availability of multiple cores to run my program in a meaningful time frame.

Any help/suggestions regarding this would be greatly appreciated. I apologize in advance if anything is unclear from my query.

Thanks.

mlxd · July 4, 2023, 4:43pm

Hi @BlackFyre
From what I understand, you have only a single output observable, and so the adjoint gradient differentiation method with lightning.qubit won’t allow the use of OpenMP threaded observable calculations as part of your workload. Would that be a correct statement? If you do use multiple observables, making sure to enable diff_method="adjoint" with lightning.qubit will allow spinning up a thread per observable in the gradient evaluation pipeline.

A lot has changed in the time since that blog-post — lightning.qubit does support multithreading, but only in the gradient evaluation pipeline with the adjoint diff method.

If your workload doesn’t benefit from this, you may want to look into our lightning.kokkos (link) device (python -m pip install pennylane-lightning-kokkos). This supports OpenMP at the gate-level, rather than the observable level, and, assuming you have few observables, should allow you to scale the performance by controlling the OpenMP theading environment.

I’d suggest trying this out with OMP_NUM_THREADS=<number of physical cores> OMP_PROC_BIND=false python yourscript.py and seeing how long it takes. Feel free to reach out if we can be any further help.

BlackFyre · August 6, 2023, 6:23pm

Hi @mlxd. Thanks a lot for your time in suggesting solutions. They helped me greatly!

kevinkawchak · August 6, 2023, 8:25pm

Hello @BlackFyre,
A good paper by Kordzanganeh, M., et al. that discussing how to optimize settings on different devices. Also, the quantum transfer learning is a good model to try for image classification.

BlackFyre · November 7, 2023, 4:41am

Hi @kevinkawchak,
Thanks a lot for the interesting transfer learning suggestion. The benchmarking paper was also very helpful.

Thanks again to everyone for contributing!

Topic		Replies	Views
Quantum circuits cannot utilize multiple CPUs for computation PennyLane Help	5	289	March 12, 2024
Lightning.qubit CPU utilisation PennyLane Help	7	1088	February 16, 2021
Optimising hybrid quantum classical machine learning code PennyLane Help	1	54	October 18, 2024
Lightning parallel adjoint: more threads than measurements? PennyLane Help	2	302	February 9, 2024
Pennylane lightning.gpu, 28+ qubits quantum circuit PennyLane Help	20	2304	March 16, 2023

Utilizing multi-core CPU clusters for QNN training

Related topics