Parallel Circuit execution during optimisation

Hi,

I am trying to optimise a variational circuit by batching the inputs. A natural way I feel to speed this process up would be to execute each input in parallel during the optimisation. However trying to implement this using both multiprocessing and pathos seems to not work. Here’s a minimal example to illustrate this

import pennylane as qml
from pennylane import numpy as np
from pennylane.templates import AmplitudeEmbedding
import multiprocessing as mp
from functools import partial
from pathos.multiprocessing import ProcessingPool as Pool

dev = qml.device("default.qubit", wires=2)
var = np.random.randn(2, requires_grad=True)
opt = qml.AdamOptimizer()
xes = np.array(([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]]))
batch_size = 2

@qml.qnode(dev)
def circuit(var, x):
   AmplitudeEmbedding(features=x, wires=range(2), normalize=True)
   qml.RX(var[0], wires=0)
   qml.RX(var[1], wires=0)
   return qml.expval(qml.PauliZ(0))

# To avoid the pickler complaining even more than it would otherwise
def circuit_wrapper(var, x):
   return circuit(var, x)

def cost_serial(var, X):
   return np.sum([circuit(var,x) for x in X])

def cost_parallel(var, X, pool):
   mapper = partial(circuit_wrapper, var)
   return np.sum(pool.map(mapper, X))

# First step serially
for i in range(0, len(xes), batch_size):
   X = xes[i:i+batch_size]
   var = opt.step(lambda v : cost_serial(v, X), var)
# This will work without error

# Try and just do a standard evaluation in parallel using multiprocessing
pool = mp.Pool(4)
mapper = partial(circuit_wrapper, var)
result = np.sum(pool.map(mapper, X))
print("The result of our dummy circuit using multiprocessing is {:.3f}".format(result))
# This will work without error

# Now try and do the circuit evaluation in an optimiser
for i in range(0, len(xes), batch_size):
   X = xes[i:i+batch_size]
   try:
       var = opt.step(lambda v : cost_parallel(v, X, pool), var)
   except Exception as e:
       print(e)
# This will fail

# Now let's try and use pathos
pool = Pool(4)
mapper = partial(circuit_wrapper, var)
result = np.sum(pool.map(mapper, X))
print("The result of our dummy circuit using pathos is {:.3f}".format(result))
# This will work without error

# Now try and do the circuit evaluation in an optimiser
for i in range(0, len(xes), batch_size):
   X = xes[i:i+batch_size]
   try:
       var = opt.step(lambda v : cost_parallel(v, X, pool), var)
   except Exception as e:
       print(e)
# This will fail

multiprocessing will fail with Can't pickle local object 'VJPNode.initialize_root.<locals>.<lambda>' which is not that surprising given what I’ve seen online.

Using pathos gives a far stranger error though

Traceback (most recent call last):
  File "[redacted]/lib/python3.11/site-packages/autograd/tracer.py", line 118, in new_box
    return box_type_mappings[type(value)](value, trace, node)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: <class 'pennylane.numpy.tensor.tensor'>

I think on a technical level I know what’s going on here and why pathos won’t work. The optimiser registers the parameters with autograd but the subprocesses are spawning off their own copies of the parameters which autograd doesn’t know about…I could be wrong about this but that’s my guess after digging around for a bit.

That’s a bit of an aside though. fundamentally I would like to be able to run these circuits in parallel during the optimisation by whatever means works. What are people’s suggestions here on how best to achieve this?

Hey @acn! Welcome to the forum :rocket:

I’m not sure which version of PennyLane you’re using, but in v0.24 we introduced support for parameter broadcasting :slight_smile:. I recommend using v0.30 (the most current version), though.

Here’s an example!

import pennylane as qml
from pennylane import numpy as np

dev = qml.device("default.qubit", wires=2)

@qml.qnode(dev)
def circuit(features):
    qml.AngleEmbedding(features, wires=range(2))
    return [qml.expval(qml.PauliZ(i)) for i in range(2)]

batch1 = [0.1, 0.2]
batch2 = [0.3, 0.4]
batches = [batch1, batch2]

print(circuit(batches))
print(dev.num_executions)

'''
[tensor([0.99500417, 0.95533649], requires_grad=True), tensor([0.98006658, 0.92106099], requires_grad=True)]
1
'''

As you can see, the device only has one execution even though there are two batches of parameters. There are some gaps of support for parameter broadcasting that we are working on for future releases, but if you’re using default.qubit you should be fine :slight_smile:

1 Like

Hello @acn I’m going through exactly the same problem. Did you find any solution for parallelizing circuits while running in an optimizer?

Hey @erikrecio, welcome to the forum!

If parameter broadcasting isn’t what you need, there are some parallel computation packages you can use like Dask (although I don’t recommend using it). Our lightning qubit device also has parallel adjoint differentiation support (see here: Lightning Qubit device — Lightning 0.33.1 documentation). Our lightning gpu device also has parallel compute support: PennyLane v0.31 released | PennyLane Blog

Let me know if any of this helps!