Lightning parallel adjoint: more threads than measurements?

schance995 · February 8, 2024, 3:12am

Hi! I’m reading the lightning qubit documentation for parallel adjoint differentiation and found this snippet:

# Before importing packages
import os
os.environ["OMP_NUM_THREADS"] = 4
import pennylane as qml
dev = qml.device("lightning.qubit", wires=2, batch_obs=True)

I believe this says that at most 4 measurements are computed in parallel over 4 threads. Does that mean that if this circuit had only 2 measurements, then only 2 threads would be used and the other 2 threads would be unused?

More generally, does each measurement use a single thread? Can I use more threads than the number of measurements to parallelize my circuit?

Vincent_Michaud-Riou · February 8, 2024, 4:39pm

I believe this says that at most 4 measurements are computed in parallel over 4 threads. Does that mean that if this circuit had only 2 measurements, then only 2 threads would be used and the other 2 threads would be unused?

That’s correct.

More generally, does each measurement use a single thread?

Yes.

Can I use more threads than the number of measurements to parallelize my circuit?

By default, L-Qubit maps OpenMP threads to observables. This is most efficient because it requires a minimal amount of communication between threads. However, as you pointed out, this prevents accelerating calculations when the number of observables is lower than the number of threads/cores on your machine.

There is another way to accelerate L-Qubit execution which is to parallelize gate- and generator-statevector products. This parallelization scheme is not activated by default and needs to be baked in by building L-Qubit from source with the CMake option -DLQ_ENABLE_KERNEL_OMP=1. It is expected to be less efficient because of hardware limitations but can be more scalable in the case you are talking about because there are typically more statevector entries to parallelize over than observables.

If you do not want to compile L-Qubit yourself, or just wanna have a quick try of a statevector-parallelized simulator, you may also install L-Kokkos pip install pennylane-lightning-kokkos which will do exactly that.

Which scheme is better really depends on hardware and workload in the end. If performance is truly a limiting factor, I would suggest running a quick benchmark for each build and seeing how the performance differs.

schance995 · February 9, 2024, 4:01am

Great, thanks! Didn’t know about the kokkos simulator, will try it out!

Topic		Replies	Views
Utilizing multi-core CPU clusters for QNN training PennyLane Help	4	515	November 7, 2023
Lightning.qubit CPU utilisation PennyLane Help	7	1086	February 16, 2021
How to execute quantum circuits with different parameters and different measurements in parallel? PennyLane Help	1	266	January 18, 2024
Multiprocessing issue with lightning.qubit PennyLane Help	3	21	June 18, 2025
Question on data parallelization PennyLane Help	4	491	January 24, 2024

Lightning parallel adjoint: more threads than measurements?

Related topics