# How to use Dask to parallelize QNode computations?

Hi everyone. I’m working on quantum neural networks for computer vision. I wonder if there is a better or more efficient way to implement quantum version of convolution rather than using a number of for loops like the one demonstrated in this tutorial? Josh suggested me to use Dask to parallelize QNode computations (Thanks @josh again! ). However, it seems I did not use it correctly because when I executed the demo codes provided in the documentation of Dask, I found the computation got even much slower. Here is the code I run:

import dask
import numpy as np
def inc(x):
return x + 1
def double(x):
return x * 2
return x + y
data = np.random.rand((100))
t0 = time.time()
output = []
for x in data:
output.append(c)
total.compute()
t1 = time.time()
print("Time: ", (t1 - t0) )
t0 = time.time()
output = []
for x in data:
a = (inc)(x)
b = (double)(x)
output.append(c)
total = (sum)(output)
t1 = time.time()
print("Time: ", (t1 - t0) )
Time:  0.034467220306396484
Time:  0.0003325939178466797


Could anyone help me with it? Many thanks!

Hi @gojiita_ku,

Welcome!

The QNodeCollection objects could be helpful in parallelizing QNode evaluations. We have Dask support for a QNodeCollection by specifying the parallel=True option (see the Asynchronous evaluation in the related docs).

It is worth noting, however, that as the documentation mentions, this option will only be useful with some devices (e.g., can be beneficial with the QVM device from the Forest plugin, as used in this tutorial).

Let us know how this goes!

Hi @antalszava,

Thanks for your answer! I’ll try QNodeCollection for my case and see if it would work.

I’m very excited to see Pennylane has been updated so quickly! I’m currently only using Pennylane for QML.

1 Like

Hi @antalszava,

I checked the documentation of QNodeCollection. It seems like all QNodes within a QNodeCollection must have the same input. But if we use QNodeCollection in the case of quantum convolution, all QNodes would have different inputs (e.g. different patches in the two-dimensional image). So that means QNodeCollection would not work for this case? I’m not sure if I got a correct interpretation of the class QNodeCollection.

Hi @gojiita_ku,

Indeed, that’s right. A QNodeCollection assumes that the inputs are the same for all the QNodes.

1. Could potentially all QNode inputs be concatenated into a single object, which is than passed to the QNodeCollection? Indexing into the input could then be a way for distributing the input parameters. Something along the lines of
import pennylane as qml
import numpy as np

dev = qml.device('default.qubit', wires=2)

@qml.qnode(dev)
def circ1(par):
qml.RY(par[0], wires=0)  # <---- 1. parameter
return qml.expval(qml.PauliZ(0))

@qml.qnode(dev)
def circ2(par):
qml.RX(par[1], wires=0)  # <---- 2. parameter
return qml.expval(qml.PauliZ(0))

qnodes = qml.QNodeCollection([circ1, circ2])

par1 = np.array([0.1234])
par2 = np.array([0.4323])

pars = np.concatenate([par1, par2]) # <---- Concatenating input parameters

qnodes(pars)


There is a historic reason for why a QNodeCollection behaves like this, mainly circuits with similar structures were considered when creating it.

1. The Dask logic in QNodeCollection is as follows:
for q in self.qnodes:



Here, _scheduler is a Dask scheduler to use, can be for example "threads".

Alternatively, this could also help with a custom solution to using Dask with multiple qnodes. The input parameters for each qnode could then be zipped with the qnode itself (just an idea for a potential approach):

for args, q in zip(arg_lists, qnodes):


where arg_lists would be an ordered list of arguments for each qnode and qnodes are the QNodes to evaluate.

Hope some of this is helpful, let us know how it goes!

Hi @antalszava,

Thank you so much! I’ll try both of your suggestions and get back to you soon.

Sounds good!

1 Like

Hi @antalszava,

I’m so sorry for getting back to you so late… I was quite busy with my work for the past few weeks .

I tried the first approach you suggested and here is the code

def test(simulator,parallel):
dev = qml.device(simulator, wires=2)
t0 = time.time()
pars = np.random.rand(128)
qnodes = qml.QNodeCollection()
for i in range(128):
@qml.qnode(dev,interface='torch')
def circ(inputs,weights,i=i):
qml.RY(weights, wires=0)
qml.RY(inputs[i], wires=0)
return qml.expval(qml.PauliZ(0))
qnodes.append(circ)

res = qnodes(pars,0.1,parallel=parallel)
t1 = time.time()
print("Running time", (t1 - t0)



I was a bit confused by the result as the running time is even higher when I set parallel = True. Also, it is recommended by the documentation to use external simulator for asynchronous mode. So I used qiskit simulator like qiskit.aer, but received an error shown below:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-76b5e0198504> in <module>
----> 1 test('qiskit.aer',False)

<ipython-input-8-be9b5cfd8a9d> in test(simulator, parallel)
1 def test(simulator,parallel):
----> 2     dev = qml.device(simulator, wires=2)
3     t0 = time.time()
4     pars = np.random.rand(128)
5     qnodes = qml.QNodeCollection()

~/anaconda3/lib/python3.7/site-packages/pennylane/__init__.py in device(name, *args, **kwargs)
246
247         # loads the device class
249
250         if Version(version()) not in Spec(plugin_device_class.pennylane_requires):

~/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py in load(self, require, *args, **kwargs)
2432         if require:
2433             self.require(*args, **kwargs)
-> 2434         return self.resolve()
2435
2436     def resolve(self):

~/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py in resolve(self)
2438         Resolve the entry point from its module and attrs.
2439         """
-> 2440         module = __import__(self.module_name, fromlist=['__name__'], level=0)
2441         try:
2442             return functools.reduce(getattr, self.attrs, module)

~/anaconda3/lib/python3.7/site-packages/pennylane_qiskit/__init__.py in <module>
17 from .aer import AerDevice
18 from .basic_aer import BasicAerDevice
20 from .ibmq import IBMQDevice

~/anaconda3/lib/python3.7/site-packages/pennylane_qiskit/converter.py in <module>
33
34
---> 35 def _check_parameter_bound(param: Parameter, var_ref_map: Dict[Parameter, qml.variable.Variable]):
36     """Utility function determining if a certain parameter in a QuantumCircuit has
37     been bound.

AttributeError: module 'pennylane' has no attribute 'variable'


Hi @gojiita_ku,

No worries, hope things are going well!

Nice! Indeed, for default.qubit setting parallel=True will likely not have the desired effect. We’ve experienced better performance in particular with the QVM device and welcome all insight on further observations!

As for the error, could you make sure that you have the latest released version of both PennyLane and PennyLane-Qiskit? Both should be version 0.16.0. The version number can be checked by looking at the output of qml.about() and the packages can be upgraded by e.g., pip install pennylane --upgrade.

Hi @antalszava,

Following your advice, I upgraded PennyLane-Qiskit to version 0.16.0. Now my code works for qiskit simulator. Thanks for that! But the running time is still higher when I set parallel = True. Then I run the code on the QVM device, as I remember you mentioned you’ve experienced better performance in particular with the QVM device. But I receive an error when I set parallel = True which I could not understand… Here is the code:

def test(simulator,parallel):
dev = qml.device(simulator,device='4q-pyqvm')
t0 = time.time()
pars = np.random.rand(12)
qnodes = qml.QNodeCollection()
for i in range(12):
@qml.qnode(dev,interface='torch')
def circ(inputs,weights,i=i):
qml.RY(weights, wires=0)
qml.RY(inputs[i], wires=0)
return qml.expval(qml.PauliZ(0))
qnodes.append(circ)

res = qnodes(pars,0.1,parallel=parallel)
t1 = time.time()
print("Running time", (t1 - t0) )
test('forest.qvm',True)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-996259bf4fe9> in <module>
----> 1 test('forest.qvm',True)

12         qnodes.append(circ)
13
---> 14     res = qnodes(pars,0.1,parallel=parallel)
15     t1 = time.time()
16     print("Running time", (t1 - t0) )

~/anaconda3/lib/python3.7/site-packages/pennylane/collections/qnode_collection.py in __call__(self, *args, **kwargs)
274
275     def __call__(self, *args, **kwargs):
--> 276         results = self.evaluate(args, kwargs)
277         return self.convert_results(results, self.interface)
278

~/anaconda3/lib/python3.7/site-packages/pennylane/collections/qnode_collection.py in evaluate(self, args, kwargs)
228
230
231         for q in self.qnodes:

395     keys = [x.__dask_keys__() for x in collections]
396     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 397     results = schedule(dsk, keys, **kwargs)
398     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
399

74     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
---> 76                         pack_exception=pack_exception, **kwargs)
77

~/anaconda3/lib/python3.7/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
500                     else:
--> 501                         raise_exception(exc, tb)
503                 state['cache'][key] = res

110         if exc.__traceback__ is not tb:
111             raise exc.with_traceback(tb)
--> 112         raise exc
113
114 else:

270     try:
273         id = get_id()
274         result = dumps((result, id))

251         func, args = arg[0], arg[1:]
252         args2 = [_execute_task(a, cache) for a in args]
--> 253         return func(*args2)
254     elif not ishashable(arg):
255         return arg

~/anaconda3/lib/python3.7/site-packages/pennylane/qnode.py in __call__(self, *args, **kwargs)
596
597         # execute the tape
--> 598         res = self.qtape.execute(device=self.device)
599
600         # if shots was changed

~/anaconda3/lib/python3.7/site-packages/pennylane/tape/tape.py in execute(self, device, params)
1321             params = self.get_parameters()
1322
-> 1323         return self._execute(params, device=device)
1324
1325     def execute_device(self, params, device):

~/anaconda3/lib/python3.7/site-packages/pennylane/interfaces/torch.py in _execute(self, params, **kwargs)
266     def _execute(self, params, **kwargs):
267         kwargs["tape"] = self
--> 268         res = _TorchInterface.apply(kwargs, *params)
269         return res
270

~/anaconda3/lib/python3.7/site-packages/pennylane/interfaces/torch.py in forward(ctx, input_kwargs, *input_)
71         # evaluate the tape
72         tape.set_parameters(ctx.all_params_unwrapped, trainable_only=False)
---> 73         res = tape.execute_device(ctx.args, device)
74         tape.set_parameters(ctx.all_params, trainable_only=False)
75

~/anaconda3/lib/python3.7/site-packages/pennylane/tape/tape.py in execute_device(self, params, device)
1352
1353         if isinstance(device, qml.QubitDevice):
-> 1354             res = device.execute(self)
1355         else:
1356             res = device.execute(self.operations, self.observables, {})

~/anaconda3/lib/python3.7/site-packages/pennylane_forest/qvm.py in execute(self, circuit, **kwargs)
156             self._circuit_hash = circuit.graph.hash
157
--> 158         return super().execute(circuit, **kwargs)
159
160     def apply(self, operations, **kwargs):

~/anaconda3/lib/python3.7/site-packages/pennylane/_qubit_device.py in execute(self, circuit, **kwargs)
194         # generate computational basis samples
195         if self.shots is not None or circuit.is_sampled:
--> 196             self._samples = self.generate_samples()
197
198         multiple_sampled_jobs = circuit.is_sampled and self._has_partitioned_shots()

~/anaconda3/lib/python3.7/site-packages/pennylane_forest/qvm.py in generate_samples(self)
230     def generate_samples(self):
231         if "pyqvm" in self.qc.name:
--> 232             return self.qc.run(self.prog, memory_map=self._parameter_map)
233
234         if self.circuit_hash is None:

~/anaconda3/lib/python3.7/site-packages/pyquil/api/_error_reporting.py in wrapper(*args, **kwargs)
249             global_error_context.log[key] = pre_entry
250
--> 251         val = func(*args, **kwargs)
252
253         # poke the return value of that call in

~/anaconda3/lib/python3.7/site-packages/pyquil/api/_quantum_computer.py in run(self, executable, memory_map)
136             for region_name, values_list in memory_map.items():
137                 self.qam.write_memory(region_name=region_name, value=values_list)
139
140     @_record_call

~/anaconda3/lib/python3.7/site-packages/pyquil/pyqvm.py in run(self)
265         for _ in range(self.program.num_shots):
266             self.wf_simulator.reset()
--> 267             self._execute_program()
268             for name in self.ram.keys():
269                 self._memory_results.setdefault(name, list())

~/anaconda3/lib/python3.7/site-packages/pyquil/pyqvm.py in _execute_program(self)
482         halted = len(self.program) == 0
483         while not halted:
--> 484             halted = self.transition()
485
486         return self

~/anaconda3/lib/python3.7/site-packages/pyquil/pyqvm.py in transition(self)
311         """
312         assert self.program is not None
--> 313         instruction = self.program[self.program_counter]
314
315         if isinstance(instruction, Gate):

~/anaconda3/lib/python3.7/site-packages/pyquil/quil.py in __getitem__(self, index)
893             Program(self.instructions[index])
894             if isinstance(index, slice)
--> 895             else self.instructions[index]
896         )
897

IndexError: list index out of range



Hi @gojiita_ku, from the following part of the error message,

~/anaconda3/lib/python3.7/site-packages/pyquil/pyqvm.py in transition(self)


it looks like you are using the pyQVM (that is, the Python implementation of the QVM). In the past, we saw the best parallel speedup with the standard QVM - this is a simulator that is written in Lisp, and runs in a separate process to the Python program.

Perhaps you could try running it again, this time using the QVM? The following demo shows how to do this:

Hi @josh,

Yes, you are right. I was using the device 4q-pyqvm . So I walked through the demo you recommended and switched to the device 4q-qvm . I also downloaded and installed the Forest SDK . I set up local server for the QVM and quilc on my laptop following the documentation. But after I run the following code on the terminal:

import pennylane as qml
import numpy as np
import time
import torch

def test(simulator,parallel):
dev = qml.device(simulator,device='4q-qvm')
t0 = time.time()
pars = np.random.rand(12)
qnodes = qml.QNodeCollection()
for i in range(12):
@qml.qnode(dev,interface='torch')
def circ(inputs,weights,i=i):
qml.RY(weights, wires=0)
qml.RY(inputs[i], wires=0)
return qml.expval(qml.PauliZ(0))
qnodes.append(circ)

res = qnodes(pars,0.1,parallel=parallel)
t1 = time.time()
print("Running time", (t1 - t0) )

test('forest.qvm',True)



Segmentation fault: 11


Was I missing anything important?

Also, I’m wondering that even if we can get a speed up on the QVM device in the parallel mode, the running time would be still higher than the one on the default.qubit simulator. I noticed that running the same code in the sequential mode takes only 0.039s on the default.qubit simulator while taking 1.479s on the QVM device…

Actually, my goal is to train a hybrid quantum-classical neural network which involves executing B\times128 \times128 circuits where B is batch size (e.g. 4). Both forward and backward pass need to be computed for this large number of circuits. So I really want to know if we could reduce the running time to a level which is acceptable for production by employing the QNodeCollection method (particularly in the parallel mode)?

Hi @gojiita_ku,

Was I missing anything important?

From what I can see, your code looks good! Which makes the segmentation fault more puzzling

Unfortunately, the segmentation fault indicates it might be a memory issue in the QVM itself? The only thing I can suggest is perhaps opening up an issue on the QVM GitHub issue.

I noticed that running the same code in the sequential mode takes only 0.039s on the default.qubit simulator while taking 1.479s on the QVM device…

One thing to keep in mind is that the supported methods of computing quantum gradients can differ between devices, which can affect total optimization time

For example, default.qubit defaults to backpropagation, which adds some overhead to the forwards pass, but the backwards pass can be done in constant time.

In contrast, when using an external device like the QVM, only parameter-shift is supported. This means that the forward pass, while faster than default.qubit, the backwards pass may be slower, since it scales with the number of parameters in the circuit!

You can check out the backprop demo for more details if you are interested!

Hi @josh,

Unfortunately, the segmentation fault indicates it might be a memory issue in the QVM itself? The only thing I can suggest is perhaps opening up an issue on the QVM GitHub issue.

Ok. I will open up an issue on the QVM GitHub issue.

For example, default.qubit defaults to backpropagation, which adds some overhead to the forwards pass, but the backwards pass can be done in constant time.
In contrast, when using an external device like the QVM, only parameter-shift is supported. This means that the forward pass, while faster than default.qubit , the backwards pass may be slower , since it scales with the number of parameters in the circuit!

I got what you mean. I need to consider running times for both forward and backwards pass. Thanks for your explanation! But what I want to show is that the default.qubit has much better performance in terms of running time than the QVM device even just for the forward pass. It takes only 0.039s to run the above code (which only computes the forward pass) on the default.qubit simulator while taking 1.479s on the QVM device. As you mentioned, the backwards pass may be slower on the QVM device when it scales with the number of parameters in the circuit since only parameter-shift is supported. So I wonder if the default.qubit would be the optimal choice for optimizing the total computation time.

I got another question. Could QNodeCollection objects be also helpful in parallelizing QNode backwards pass evaluations (e.g. calculation of parameterized circuits gradients) ?

It takes only 0.039s to run the above code (which only computes the forward pass) on the default.qubit simulator while taking 1.479s on the QVM device.

This is really interesting! Unfortunately I can’t comment too much, since I am unaware as to how QVM works internally. It could be the case that it has more optimal performance at higher qubit number, or greater depth circuits, for example.

I got another question. Could QNodeCollection objects be also helpful in parallelizing QNode backwards pass evaluations (e.g. calculation of parameterized circuits gradients) ?

Unfortunately not – QNodeCollection only parallelizes multiple forward passes.

However, we recently added support for PennyLane devices to perform batch execution. At the moment, one device supports it — the PennyLane-Braket device. It uses this batch execution capability to batch execute all the parameter-shift gradients at once.

1 Like

Hi @josh,

Thanks for your advice! Actually, I participated in QHack 2021 and used AWS credits to get access to AWS bracket service. I trained a hybrid quantum-classical model using SV1 simulator at the parallel model but found the training was much slower than the local simulator (e.g. default.quibit). I found later from the documentation that the remote device has more advantage at computing large circuits rather than small circuits, considering the transition time between the local user and the remote server. In my case, the circuit to compute is always very small (e.g. circuit with four qubits). So I’m guessing maybe I had better use local simulators even though they do not support batch execution .

Anyway, I think I should share an example code I’m currently working on so that you or anyone else could give me further supervision on how to reduce the running time .

No worries @gojiita_ku! Yes, that would definitely be helpful.

Basically, I am working on an image segmentation task based on quantum convolution. The original model is very complicated, so I show you only the code for the quantum part which I feel most difficult to optimize . The function of the code is to perform a quantum version of 1\times1 convolution on a feature map with the shape of (batch size, channel, height, width), namely each value from each channel at the same spatial position is fed into a quantum circuit every time. In my case, the input feature map should be in the shape of (batch size, 4, 128, 128) where the value of batch size can be flexible. I run the code with batch size of 2 and it took about 15 minutes in my laptop. The size of dataset in my task is very large, so I guess it would take a very very long time to train the model… So could you please help me with this code? Many thanks!

import pennylane as qml
import torch
import torch.nn as nn
import time

n_qubits = 4
dev = qml.device('default.qubit', wires=n_qubits)

class QuanvLayer(nn.Module):
def __init__(self, batch_size, height, width, channel):
super(QuanvLayer, self).__init__()
self.q_params = nn.Parameter(torch.randn(2, channel))
qnodes = qml.QNodeCollection()
for i in range(batch_size*height*width):
def circuit(inputs,weights,i=i):

for j in range(channel):
qml.RY(inputs[i][j], wires=j)

return [qml.expval(qml.PauliZ(wires=j)) for j in range(4)]
qnodes.append(circuit)

self.qnodes = qnodes

def forward(self, x):
b,c,h,w = x.shape
x = x.permute(1,0,2,3)  # (c,b,h,w)
x = torch.flatten(x,start_dim=1) # (c,b*h*w)
x = x.permute(1,0) # (b*h*w,c)

x = self.qnodes(x,self.q_params,parallel=True)  #(b*h*w,c)

x = x.reshape((b,h,w,c)) # (b,h,w,c)

x = x.permute(0,3,2,1) # (b,c,w,h)

x = x.permute(0,1,3,2) # (b,c,h,w)

return x

batch_size = 2
height = 128
width = 128
channel = 4
t0 = time.time()
x = torch.randn(batch_size, channel,height,width)