Parallel Circuit execution during optimisation

Hi,

I am trying to optimise a variational circuit by batching the inputs. A natural way I feel to speed this process up would be to execute each input in parallel during the optimisation. However trying to implement this using both multiprocessing and pathos seems to not work. Here’s a minimal example to illustrate this

import pennylane as qml
from pennylane import numpy as np
from pennylane.templates import AmplitudeEmbedding
import multiprocessing as mp
from functools import partial
from pathos.multiprocessing import ProcessingPool as Pool

dev = qml.device("default.qubit", wires=2)
var = np.random.randn(2, requires_grad=True)
opt = qml.AdamOptimizer()
xes = np.array(([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]]))
batch_size = 2

@qml.qnode(dev)
def circuit(var, x):
   AmplitudeEmbedding(features=x, wires=range(2), normalize=True)
   qml.RX(var[0], wires=0)
   qml.RX(var[1], wires=0)
   return qml.expval(qml.PauliZ(0))

# To avoid the pickler complaining even more than it would otherwise
def circuit_wrapper(var, x):
   return circuit(var, x)

def cost_serial(var, X):
   return np.sum([circuit(var,x) for x in X])

def cost_parallel(var, X, pool):
   mapper = partial(circuit_wrapper, var)
   return np.sum(pool.map(mapper, X))

# First step serially
for i in range(0, len(xes), batch_size):
   X = xes[i:i+batch_size]
   var = opt.step(lambda v : cost_serial(v, X), var)
# This will work without error

# Try and just do a standard evaluation in parallel using multiprocessing
pool = mp.Pool(4)
mapper = partial(circuit_wrapper, var)
result = np.sum(pool.map(mapper, X))
print("The result of our dummy circuit using multiprocessing is {:.3f}".format(result))
# This will work without error

# Now try and do the circuit evaluation in an optimiser
for i in range(0, len(xes), batch_size):
   X = xes[i:i+batch_size]
   try:
       var = opt.step(lambda v : cost_parallel(v, X, pool), var)
   except Exception as e:
       print(e)
# This will fail

# Now let's try and use pathos
pool = Pool(4)
mapper = partial(circuit_wrapper, var)
result = np.sum(pool.map(mapper, X))
print("The result of our dummy circuit using pathos is {:.3f}".format(result))
# This will work without error

# Now try and do the circuit evaluation in an optimiser
for i in range(0, len(xes), batch_size):
   X = xes[i:i+batch_size]
   try:
       var = opt.step(lambda v : cost_parallel(v, X, pool), var)
   except Exception as e:
       print(e)
# This will fail

multiprocessing will fail with Can't pickle local object 'VJPNode.initialize_root.<locals>.<lambda>' which is not that surprising given what I’ve seen online.

Using pathos gives a far stranger error though

Traceback (most recent call last):
  File "[redacted]/lib/python3.11/site-packages/autograd/tracer.py", line 118, in new_box
    return box_type_mappings[type(value)](value, trace, node)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: <class 'pennylane.numpy.tensor.tensor'>

I think on a technical level I know what’s going on here and why pathos won’t work. The optimiser registers the parameters with autograd but the subprocesses are spawning off their own copies of the parameters which autograd doesn’t know about…I could be wrong about this but that’s my guess after digging around for a bit.

That’s a bit of an aside though. fundamentally I would like to be able to run these circuits in parallel during the optimisation by whatever means works. What are people’s suggestions here on how best to achieve this?

Hey @acn! Welcome to the forum :rocket:

I’m not sure which version of PennyLane you’re using, but in v0.24 we introduced support for parameter broadcasting :slight_smile:. I recommend using v0.30 (the most current version), though.

Here’s an example!

import pennylane as qml
from pennylane import numpy as np

dev = qml.device("default.qubit", wires=2)

@qml.qnode(dev)
def circuit(features):
    qml.AngleEmbedding(features, wires=range(2))
    return [qml.expval(qml.PauliZ(i)) for i in range(2)]

batch1 = [0.1, 0.2]
batch2 = [0.3, 0.4]
batches = [batch1, batch2]

print(circuit(batches))
print(dev.num_executions)

'''
[tensor([0.99500417, 0.95533649], requires_grad=True), tensor([0.98006658, 0.92106099], requires_grad=True)]
1
'''

As you can see, the device only has one execution even though there are two batches of parameters. There are some gaps of support for parameter broadcasting that we are working on for future releases, but if you’re using default.qubit you should be fine :slight_smile:

1 Like

Hello @acn I’m going through exactly the same problem. Did you find any solution for parallelizing circuits while running in an optimizer?

Hey @erikrecio, welcome to the forum!

If parameter broadcasting isn’t what you need, there are some parallel computation packages you can use like Dask (although I don’t recommend using it). Our lightning qubit device also has parallel adjoint differentiation support (see here: Lightning Qubit device — Lightning 0.33.1 documentation). Our lightning gpu device also has parallel compute support: PennyLane v0.31 released | PennyLane Blog

Let me know if any of this helps!

Hello @isaacdevlugt , I’m encountering the similar problem. I got errors when trying to use parameter broadcasting. Here’s a minimal example circuit:

N=6
dev=qml.device('lightning.qubit',wires=N)

@qml.qnode(dev)
def circuit(ini_state_batch, tar_states):
    # An simple example that output the fidelity between the initial state and the target state
    qml.StatePrep(ini_state_batch, wires=range(N)) # The StatePrep broadcasts the states with no error
    if len(tar_states[0])>1: # more than one state
        tar_batch=[]
        for i in range(len(tar_states)):
            tar_batch.append(qml.math.dm_from_state_vector(tar_states[i])) 
    else: # just one state, do not need broadcasting
        tar_batch=qml.math.dm_from_state_vector(tar_states)
    return qml.expval(qml.Hermitian(tar_batch, wires=range(N)))

ini_state_batch=[]
tar_state_batch=[]
target = SU_N(2**N) # a function used to generate a random SU(N) matrix
for i in range(10):
    ini_state = getstates(N, True) # a function used to generate a random quantum state, the return is an np.array with 2**N dimensions. True means it will return a complex state.
    ini_state_batch.append(ini_state.tolist()) # the tolist() operation is to prevent pennylane from regarding the initial state as trainable parameters
    tar_state_batch.append((target @ ini_state).tolist())
print(circuit(ini_state_batch, tar_state_batch))

I constanty get the error ValueError: Observable must be a square matrix. at the last statement of the circuit() function. It looks like that qml.Hermitian always take tar_batch as a single density matrix, however, the tar_batch is a batch containing 10 density matrices. Since the SU_N() and getstates() has been used hundreds of times, they are probably not the cause of the error. Is my way of using parameter broadcasting wrong?
My qml.about():

Name: PennyLane
Version: 0.40.0
Summary: PennyLane is a cross-platform Python library for quantum computing, quantum machine learning, and quantum chemistry. Train a quantum computer the same way as a neural network.
Home-page: https://github.com/PennyLaneAI/pennylane
Author: 
Author-email: 
License: Apache License 2.0
Location: /home/timmy/.local/lib/python3.12/site-packages
Requires: appdirs, autograd, autoray, cachetools, diastatic-malt, networkx, numpy, packaging, pennylane-lightning, requests, rustworkx, scipy, tomlkit, typing-extensions
Required-by: PennyLane_Lightning, PennyLane_Lightning_GPU

Platform info:           Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python version:          3.12.3
Numpy version:           2.2.2
Scipy version:           1.14.1
Installed devices:
- lightning.gpu (PennyLane_Lightning_GPU-0.40.0)
- lightning.qubit (PennyLane_Lightning-0.40.0)
- default.clifford (PennyLane-0.40.0)
- default.gaussian (PennyLane-0.40.0)
- default.mixed (PennyLane-0.40.0)
- default.qubit (PennyLane-0.40.0)
- default.qutrit (PennyLane-0.40.0)
- default.qutrit.mixed (PennyLane-0.40.0)
- default.tensor (PennyLane-0.40.0)
- null.qubit (PennyLane-0.40.0)
- reference.qubit (PennyLane-0.40.0)
None

If you want to run my code, the SU_N and getstates() are below:

def SU_N(n):
    """
    Generate an n-order special unitary matrix (SU(n)).
    
    Args:
        n (int): The order of the matrix.
        
    Returns:
        np.ndarray: An n x n unitary matrix with determinant 1.
    """
    # Step 1: Generate a random complex matrix
    z = (np.random.randn(n, n) + 1j * np.random.randn(n, n)) / np.sqrt(2)
    
    # Step 2: Perform QR decomposition to get a unitary matrix
    q, r = np.linalg.qr(z)
    
    # Step 3: Adjust the diagonal of R to ensure Q is unitary
    d = np.diagonal(r)
    ph = d / np.abs(d)  # Normalize to unit modulus
    q = q @ np.diag(ph)
    
    # Step 4: Ensure determinant is 1
    det_q = np.linalg.det(q)
    q[0] = q[0] / det_q
    return q

def getstates(n, complex=False):
    state_r=np.random.rand(2**n)
    while np.linalg.norm(state_r)<10**(-8):
        state_r=np.random.rand(2**n)
    if complex:
        state_i=np.random.rand(2**n)
        while np.linalg.norm(state_i)<10**(-8):
            state_i=np.random.rand(2**n)
        state=state_r+1j*state_i
    else:
        state=state_r
    state=state/np.linalg.norm(state)
    return state

Hi @Tz_19 ,

You shouldn’t use batches within the circuit but instead outside of it. In many cases broadcasting is supported, but not always.

In this case my recommendation would be to create a simple test circuit which receives some batched inputs and parameters. You can try something similar to the code by @Kernel123 in topic #8124. You can even simplify it more if you want to test any changes.

If you’re still facing issues with a simple test circuit please post the full code to reproduce and the full error traceback.

I hope this helps!

Hi @CatalinaAlbornoz , thank you for your reply! I followed your reply and ensured that the batches are used outside the circuit. I also read the topic #8124 and changed the type of my batches to np.array as you suggested the poster of #8124 to change the type of his batch to jnp.array. However, that did not solve the problem. Here is a minimal full example:

import pennylane as qml
from pennylane import numpy as np
import cmath

N=6
dev=qml.device('lightning.qubit',wires=N)

def getstates(n, complex=False):
    """
    Generate a random state vector.

    Args:
        n (int): the qubit number.

    Returns:
        np.ndarray: An 2^n dimension vector.
    """
    state_r=np.random.rand(2**n)
    while np.linalg.norm(state_r)<10**(-8):
        state_r=np.random.rand(2**n)
    if complex:
        state_i=np.random.rand(2**n)
        while np.linalg.norm(state_i)<10**(-8):
            state_i=np.random.rand(2**n)
        state=state_r+1j*state_i
    else:
        state=state_r
    state=state/np.linalg.norm(state)
    return state

def SU_N(n):
    """
    Generate an n-order special unitary matrix (SU(n)).
   
    Args:
        n (int): The order of the matrix.
        
    Returns:
        np.ndarray: An n x n unitary matrix with determinant 1.
    """
    z = (np.random.randn(n, n) + 1j * np.random.randn(n, n)) / np.sqrt(2)
    q, r = np.linalg.qr(z)
    d = np.diagonal(r)
    ph = d / np.abs(d)  # Normalize to unit modulus
    q = q @ np.diag(ph)
    det_q = np.linalg.det(q)
    q[0] = q[0] / det_q
    return q

@qml.qnode(dev)
def circuit(ini_state_batch, tar_dm_batch):
    qml.StatePrep(ini_state_batch, wires=range(N))
    return qml.expval(qml.Hermitian(tar_dm_batch, wires=range(N)))

ini_state_batch=[]
tar_dm_batch=[]
target = SU_N(2**N)
for i in range(10):
    ini_state = getstates(N, True)
    ini_state_batch.append(ini_state.tolist())
    tar_state=target@ini_state
    tar_dm_batch.append(qml.math.dm_from_state_vector(tar_state))
for j in range(len(tar_dm_batch)):
    print(len(tar_dm_batch[j]), len(tar_dm_batch[j][0]))
print(circuit(np.array(ini_state_batch), np.array(tar_dm_batch)))

Here is the output of the code:

64 64
64 64
64 64
64 64
64 64
64 64
64 64
64 64
64 64
64 64
Traceback (most recent call last):
  File "/home/timmy/2025Spring/test.py", line 65, in <module>
    print(circuit(np.array(ini_state_batch), np.array(tar_dm_batch)))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/timmy/.local/lib/python3.12/site-packages/pennylane/workflow/qnode.py", line 905, in __call__
    return self._impl_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/timmy/.local/lib/python3.12/site-packages/pennylane/workflow/qnode.py", line 868, in _impl_call
    tape = self.construct(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/timmy/.local/lib/python3.12/site-packages/pennylane/logging/decorators.py", line 61, in wrapper_entry
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/timmy/.local/lib/python3.12/site-packages/pennylane/workflow/qnode.py", line 854, in construct
    self._qfunc_output = self.func(*args, **kwargs)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/timmy/2025Spring/test.py", line 53, in circuit
    return qml.expval(qml.Hermitian(tar_dm_batch, wires=range(N)))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/timmy/.local/lib/python3.12/site-packages/pennylane/capture/capture_meta.py", line 89, in __call__
    return type.__call__(cls, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/timmy/.local/lib/python3.12/site-packages/pennylane/ops/qubit/observables.py", line 87, in __init__
    Hermitian._validate_input(A, expected_mx_shape)
  File "/home/timmy/.local/lib/python3.12/site-packages/pennylane/ops/qubit/observables.py", line 95, in _validate_input
    raise ValueError("Observable must be a square matrix.")
ValueError: Observable must be a square matrix.

The first ten 64 64 is the number of the rows and columns of the elements in tar_dm_batch, printed from line 63 and 64, which meets the qubit number N=6. However, the error ValueError: Observable must be a square matrix. is constantly thrown out at line 53, which is the return statement of circuit(). The only thing I did not follow is that the poster in #8124 uses jax, but I’m not familiar with it so I just used numpy. Is jax necessary in my case?
My qml.about():

Version: 0.40.0
Summary: PennyLane is a cross-platform Python library for quantum computing, quantum machine learning, and quantum chemistry. Train a quantum computer the same way as a neural network.
Home-page: https://github.com/PennyLaneAI/pennylane
Author: 
Author-email: 
License: Apache License 2.0
Location: /home/timmy/.local/lib/python3.12/site-packages
Requires: appdirs, autograd, autoray, cachetools, diastatic-malt, networkx, numpy, packaging, pennylane-lightning, requests, rustworkx, scipy, tomlkit, typing-extensions
Required-by: PennyLane_Lightning, PennyLane_Lightning_GPU

Platform info:           Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python version:          3.12.3
Numpy version:           2.2.2
Scipy version:           1.14.1
Installed devices:
- lightning.gpu (PennyLane_Lightning_GPU-0.40.0)
- lightning.qubit (PennyLane_Lightning-0.40.0)
- default.clifford (PennyLane-0.40.0)
- default.gaussian (PennyLane-0.40.0)
- default.mixed (PennyLane-0.40.0)
- default.qubit (PennyLane-0.40.0)
- default.qutrit (PennyLane-0.40.0)
- default.qutrit.mixed (PennyLane-0.40.0)
- default.tensor (PennyLane-0.40.0)
- null.qubit (PennyLane-0.40.0)
- reference.qubit (PennyLane-0.40.0)
None

Thanks for sharing this code @Tz_19 ! I can replicate the error. My colleague Daniela will take a look at you code next week to see what the issue might be. I think we might need to add a for loop to break the batches somewhere. We’ll let you know what we find.