Lightning parallel adjoint: more threads than measurements?

Hi! I’m reading the lightning qubit documentation for parallel adjoint differentiation and found this snippet:

# Before importing packages
import os
os.environ["OMP_NUM_THREADS"] = 4
import pennylane as qml
dev = qml.device("lightning.qubit", wires=2, batch_obs=True)

I believe this says that at most 4 measurements are computed in parallel over 4 threads. Does that mean that if this circuit had only 2 measurements, then only 2 threads would be used and the other 2 threads would be unused?

More generally, does each measurement use a single thread? Can I use more threads than the number of measurements to parallelize my circuit?

I believe this says that at most 4 measurements are computed in parallel over 4 threads. Does that mean that if this circuit had only 2 measurements, then only 2 threads would be used and the other 2 threads would be unused?

That’s correct.

More generally, does each measurement use a single thread?

Yes.

Can I use more threads than the number of measurements to parallelize my circuit?

By default, L-Qubit maps OpenMP threads to observables. This is most efficient because it requires a minimal amount of communication between threads. However, as you pointed out, this prevents accelerating calculations when the number of observables is lower than the number of threads/cores on your machine.

There is another way to accelerate L-Qubit execution which is to parallelize gate- and generator-statevector products. This parallelization scheme is not activated by default and needs to be baked in by building L-Qubit from source with the CMake option -DLQ_ENABLE_KERNEL_OMP=1. It is expected to be less efficient because of hardware limitations but can be more scalable in the case you are talking about because there are typically more statevector entries to parallelize over than observables.

If you do not want to compile L-Qubit yourself, or just wanna have a quick try of a statevector-parallelized simulator, you may also install L-Kokkos pip install pennylane-lightning-kokkos which will do exactly that.

Which scheme is better really depends on hardware and workload in the end. If performance is truly a limiting factor, I would suggest running a quick benchmark for each build and seeing how the performance differs.

1 Like

Great, thanks! Didn’t know about the kokkos simulator, will try it out!

1 Like