How to use Dask to parallelize QNode computations?

Hi everyone. I’m working on quantum neural networks for computer vision. I wonder if there is a better or more efficient way to implement quantum version of convolution rather than using a number of for loops like the one demonstrated in this tutorial? Josh suggested me to use Dask to parallelize QNode computations (Thanks @josh again! ). However, it seems I did not use it correctly because when I executed the demo codes provided in the documentation of Dask, I found the computation got even much slower. Here is the code I run:

import dask
import numpy as np
def inc(x):
    return x + 1
def double(x):
    return x * 2
def add(x, y):
    return x + y
data = np.random.rand((100))
t0 = time.time()
output = []
for x in data:
    a = dask.delayed(inc)(x)
    b = dask.delayed(double)(x)
    c = dask.delayed(add)(a, b)
    output.append(c)
total = dask.delayed(sum)(output)
total.compute() 
t1 = time.time()
print("Time: ", (t1 - t0) )
t0 = time.time()
output = []
for x in data:
    a = (inc)(x)
    b = (double)(x)
    c = (add)(a, b)
    output.append(c)
total = (sum)(output)
t1 = time.time()
print("Time: ", (t1 - t0) )
Time:  0.034467220306396484
Time:  0.0003325939178466797

Could anyone help me with it? Many thanks!

Hi @gojiita_ku,

Welcome! :slightly_smiling_face:

The QNodeCollection objects could be helpful in parallelizing QNode evaluations. We have Dask support for a QNodeCollection by specifying the parallel=True option (see the Asynchronous evaluation in the related docs).

It is worth noting, however, that as the documentation mentions, this option will only be useful with some devices (e.g., can be beneficial with the QVM device from the Forest plugin, as used in this tutorial).

Let us know how this goes! :slightly_smiling_face:

Hi @antalszava,

Thanks for your answer! I’ll try QNodeCollection for my case and see if it would work.:grinning:

I’m very excited to see Pennylane has been updated so quickly! I’m currently only using Pennylane for QML.

1 Like

Hi @antalszava,

I checked the documentation of QNodeCollection. It seems like all QNodes within a QNodeCollection must have the same input. But if we use QNodeCollection in the case of quantum convolution, all QNodes would have different inputs (e.g. different patches in the two-dimensional image). So that means QNodeCollection would not work for this case? I’m not sure if I got a correct interpretation of the class QNodeCollection.

Hi @gojiita_ku,

Indeed, that’s right. A QNodeCollection assumes that the inputs are the same for all the QNodes.

  1. Could potentially all QNode inputs be concatenated into a single object, which is than passed to the QNodeCollection? Indexing into the input could then be a way for distributing the input parameters. Something along the lines of
import pennylane as qml
import numpy as np

dev = qml.device('default.qubit', wires=2)

@qml.qnode(dev)
def circ1(par):
    qml.RY(par[0], wires=0)  # <---- 1. parameter
    return qml.expval(qml.PauliZ(0))

@qml.qnode(dev)
def circ2(par):
    qml.RX(par[1], wires=0)  # <---- 2. parameter
    return qml.expval(qml.PauliZ(0))

qnodes = qml.QNodeCollection([circ1, circ2])

par1 = np.array([0.1234])
par2 = np.array([0.4323])

pars = np.concatenate([par1, par2]) # <---- Concatenating input parameters

qnodes(pars)

There is a historic reason for why a QNodeCollection behaves like this, mainly circuits with similar structures were considered when creating it.

  1. The Dask logic in QNodeCollection is as follows:
for q in self.qnodes:
    results.append(dask.delayed(q)(*args, **kwargs))

return dask.compute(*results, scheduler=_scheduler)

Here, _scheduler is a Dask scheduler to use, can be for example "threads".

Alternatively, this could also help with a custom solution to using Dask with multiple qnodes. The input parameters for each qnode could then be zipped with the qnode itself (just an idea for a potential approach):

for args, q in zip(arg_lists, qnodes):
    results.append(dask.delayed(q)(*args, **kwargs))

where arg_lists would be an ordered list of arguments for each qnode and qnodes are the QNodes to evaluate.

Hope some of this is helpful, let us know how it goes! :slightly_smiling_face:

Hi @antalszava,

Thank you so much! I’ll try both of your suggestions and get back to you soon. :grinning:

Sounds good! :slightly_smiling_face:

1 Like

Hi @antalszava,

I’m so sorry for getting back to you so late… I was quite busy with my work for the past few weeks :sob:.

I tried the first approach you suggested and here is the code

def test(simulator,parallel):
    dev = qml.device(simulator, wires=2)
    t0 = time.time()
    pars = np.random.rand(128)
    qnodes = qml.QNodeCollection()
    for i in range(128):
        @qml.qnode(dev,interface='torch')
        def circ(inputs,weights,i=i):
            qml.RY(weights, wires=0)
            qml.RY(inputs[i], wires=0)  
            return qml.expval(qml.PauliZ(0))
        qnodes.append(circ)

    res = qnodes(pars,0.1,parallel=parallel)
    t1 = time.time()
    print("Running time", (t1 - t0) 

I was a bit confused by the result as the running time is even higher when I set parallel = True. Also, it is recommended by the documentation to use external simulator for asynchronous mode. So I used qiskit simulator like qiskit.aer, but received an error shown below:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-76b5e0198504> in <module>
----> 1 test('qiskit.aer',False)

<ipython-input-8-be9b5cfd8a9d> in test(simulator, parallel)
      1 def test(simulator,parallel):
----> 2     dev = qml.device(simulator, wires=2)
      3     t0 = time.time()
      4     pars = np.random.rand(128)
      5     qnodes = qml.QNodeCollection()

~/anaconda3/lib/python3.7/site-packages/pennylane/__init__.py in device(name, *args, **kwargs)
    246 
    247         # loads the device class
--> 248         plugin_device_class = plugin_devices[name].load()
    249 
    250         if Version(version()) not in Spec(plugin_device_class.pennylane_requires):

~/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py in load(self, require, *args, **kwargs)
   2432         if require:
   2433             self.require(*args, **kwargs)
-> 2434         return self.resolve()
   2435 
   2436     def resolve(self):

~/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py in resolve(self)
   2438         Resolve the entry point from its module and attrs.
   2439         """
-> 2440         module = __import__(self.module_name, fromlist=['__name__'], level=0)
   2441         try:
   2442             return functools.reduce(getattr, self.attrs, module)

~/anaconda3/lib/python3.7/site-packages/pennylane_qiskit/__init__.py in <module>
     17 from .aer import AerDevice
     18 from .basic_aer import BasicAerDevice
---> 19 from .converter import load, load_qasm, load_qasm_from_file
     20 from .ibmq import IBMQDevice

~/anaconda3/lib/python3.7/site-packages/pennylane_qiskit/converter.py in <module>
     33 
     34 
---> 35 def _check_parameter_bound(param: Parameter, var_ref_map: Dict[Parameter, qml.variable.Variable]):
     36     """Utility function determining if a certain parameter in a QuantumCircuit has
     37     been bound.

AttributeError: module 'pennylane' has no attribute 'variable'

Could you please help me with these problems? Many thanks!

Hi @gojiita_ku,

No worries, hope things are going well! :slightly_smiling_face:

Nice! Indeed, for default.qubit setting parallel=True will likely not have the desired effect. We’ve experienced better performance in particular with the QVM device and welcome all insight on further observations! :slightly_smiling_face:

As for the error, could you make sure that you have the latest released version of both PennyLane and PennyLane-Qiskit? Both should be version 0.16.0. The version number can be checked by looking at the output of qml.about() and the packages can be upgraded by e.g., pip install pennylane --upgrade.

Hi @antalszava,

Thanks for your reply!

Following your advice, I upgraded PennyLane-Qiskit to version 0.16.0. Now my code works for qiskit simulator. Thanks for that! But the running time is still higher when I set parallel = True:rofl:. Then I run the code on the QVM device, as I remember you mentioned you’ve experienced better performance in particular with the QVM device. But I receive an error when I set parallel = True which I could not understand… Here is the code:

def test(simulator,parallel):
    dev = qml.device(simulator,device='4q-pyqvm')
    t0 = time.time()
    pars = np.random.rand(12)
    qnodes = qml.QNodeCollection()
    for i in range(12):
        @qml.qnode(dev,interface='torch')
        def circ(inputs,weights,i=i):
            qml.RY(weights, wires=0)
            qml.RY(inputs[i], wires=0)  
            return qml.expval(qml.PauliZ(0))
        qnodes.append(circ)

    res = qnodes(pars,0.1,parallel=parallel)
    t1 = time.time()
    print("Running time", (t1 - t0) )
test('forest.qvm',True)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-996259bf4fe9> in <module>
----> 1 test('forest.qvm',True)

<ipython-input-2-de01ad33e3df> in test(simulator, parallel)
     12         qnodes.append(circ)
     13 
---> 14     res = qnodes(pars,0.1,parallel=parallel)
     15     t1 = time.time()
     16     print("Running time", (t1 - t0) )

~/anaconda3/lib/python3.7/site-packages/pennylane/collections/qnode_collection.py in __call__(self, *args, **kwargs)
    274 
    275     def __call__(self, *args, **kwargs):
--> 276         results = self.evaluate(args, kwargs)
    277         return self.convert_results(results, self.interface)
    278 

~/anaconda3/lib/python3.7/site-packages/pennylane/collections/qnode_collection.py in evaluate(self, args, kwargs)
    227                 results.append(dask.delayed(q)(*args, **kwargs))
    228 
--> 229             return dask.compute(*results, scheduler=_scheduler)
    230 
    231         for q in self.qnodes:

~/anaconda3/lib/python3.7/site-packages/dask/base.py in compute(*args, **kwargs)
    395     keys = [x.__dask_keys__() for x in collections]
    396     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 397     results = schedule(dsk, keys, **kwargs)
    398     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    399 

~/anaconda3/lib/python3.7/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
     74     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     75                         cache=cache, get_id=_thread_get_id,
---> 76                         pack_exception=pack_exception, **kwargs)
     77 
     78     # Cleanup pools associated to dead threads

~/anaconda3/lib/python3.7/site-packages/dask/local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    499                         _execute_task(task, data)  # Re-execute locally
    500                     else:
--> 501                         raise_exception(exc, tb)
    502                 res, worker_id = loads(res_info)
    503                 state['cache'][key] = res

~/anaconda3/lib/python3.7/site-packages/dask/compatibility.py in reraise(exc, tb)
    110         if exc.__traceback__ is not tb:
    111             raise exc.with_traceback(tb)
--> 112         raise exc
    113 
    114 else:

~/anaconda3/lib/python3.7/site-packages/dask/local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    270     try:
    271         task, data = loads(task_info)
--> 272         result = _execute_task(task, data)
    273         id = get_id()
    274         result = dumps((result, id))

~/anaconda3/lib/python3.7/site-packages/dask/local.py in _execute_task(arg, cache, dsk)
    251         func, args = arg[0], arg[1:]
    252         args2 = [_execute_task(a, cache) for a in args]
--> 253         return func(*args2)
    254     elif not ishashable(arg):
    255         return arg

~/anaconda3/lib/python3.7/site-packages/pennylane/qnode.py in __call__(self, *args, **kwargs)
    596 
    597         # execute the tape
--> 598         res = self.qtape.execute(device=self.device)
    599 
    600         # if shots was changed

~/anaconda3/lib/python3.7/site-packages/pennylane/tape/tape.py in execute(self, device, params)
   1321             params = self.get_parameters()
   1322 
-> 1323         return self._execute(params, device=device)
   1324 
   1325     def execute_device(self, params, device):

~/anaconda3/lib/python3.7/site-packages/pennylane/interfaces/torch.py in _execute(self, params, **kwargs)
    266     def _execute(self, params, **kwargs):
    267         kwargs["tape"] = self
--> 268         res = _TorchInterface.apply(kwargs, *params)
    269         return res
    270 

~/anaconda3/lib/python3.7/site-packages/pennylane/interfaces/torch.py in forward(ctx, input_kwargs, *input_)
     71         # evaluate the tape
     72         tape.set_parameters(ctx.all_params_unwrapped, trainable_only=False)
---> 73         res = tape.execute_device(ctx.args, device)
     74         tape.set_parameters(ctx.all_params, trainable_only=False)
     75 

~/anaconda3/lib/python3.7/site-packages/pennylane/tape/tape.py in execute_device(self, params, device)
   1352 
   1353         if isinstance(device, qml.QubitDevice):
-> 1354             res = device.execute(self)
   1355         else:
   1356             res = device.execute(self.operations, self.observables, {})

~/anaconda3/lib/python3.7/site-packages/pennylane_forest/qvm.py in execute(self, circuit, **kwargs)
    156             self._circuit_hash = circuit.graph.hash
    157 
--> 158         return super().execute(circuit, **kwargs)
    159 
    160     def apply(self, operations, **kwargs):

~/anaconda3/lib/python3.7/site-packages/pennylane/_qubit_device.py in execute(self, circuit, **kwargs)
    194         # generate computational basis samples
    195         if self.shots is not None or circuit.is_sampled:
--> 196             self._samples = self.generate_samples()
    197 
    198         multiple_sampled_jobs = circuit.is_sampled and self._has_partitioned_shots()

~/anaconda3/lib/python3.7/site-packages/pennylane_forest/qvm.py in generate_samples(self)
    230     def generate_samples(self):
    231         if "pyqvm" in self.qc.name:
--> 232             return self.qc.run(self.prog, memory_map=self._parameter_map)
    233 
    234         if self.circuit_hash is None:

~/anaconda3/lib/python3.7/site-packages/pyquil/api/_error_reporting.py in wrapper(*args, **kwargs)
    249             global_error_context.log[key] = pre_entry
    250 
--> 251         val = func(*args, **kwargs)
    252 
    253         # poke the return value of that call in

~/anaconda3/lib/python3.7/site-packages/pyquil/api/_quantum_computer.py in run(self, executable, memory_map)
    136             for region_name, values_list in memory_map.items():
    137                 self.qam.write_memory(region_name=region_name, value=values_list)
--> 138         return self.qam.run().wait().read_memory(region_name="ro")
    139 
    140     @_record_call

~/anaconda3/lib/python3.7/site-packages/pyquil/pyqvm.py in run(self)
    265         for _ in range(self.program.num_shots):
    266             self.wf_simulator.reset()
--> 267             self._execute_program()
    268             for name in self.ram.keys():
    269                 self._memory_results.setdefault(name, list())

~/anaconda3/lib/python3.7/site-packages/pyquil/pyqvm.py in _execute_program(self)
    482         halted = len(self.program) == 0
    483         while not halted:
--> 484             halted = self.transition()
    485 
    486         return self

~/anaconda3/lib/python3.7/site-packages/pyquil/pyqvm.py in transition(self)
    311         """
    312         assert self.program is not None
--> 313         instruction = self.program[self.program_counter]
    314 
    315         if isinstance(instruction, Gate):

~/anaconda3/lib/python3.7/site-packages/pyquil/quil.py in __getitem__(self, index)
    893             Program(self.instructions[index])
    894             if isinstance(index, slice)
--> 895             else self.instructions[index]
    896         )
    897 

IndexError: list index out of range

Hi @gojiita_ku, from the following part of the error message,

~/anaconda3/lib/python3.7/site-packages/pyquil/pyqvm.py in transition(self)

it looks like you are using the pyQVM (that is, the Python implementation of the QVM). In the past, we saw the best parallel speedup with the standard QVM - this is a simulator that is written in Lisp, and runs in a separate process to the Python program.

Perhaps you could try running it again, this time using the QVM? The following demo shows how to do this:

Hi @josh,

Thank you for your reply!

Yes, you are right. I was using the device 4q-pyqvm :rofl:. So I walked through the demo you recommended and switched to the device 4q-qvm . I also downloaded and installed the Forest SDK . I set up local server for the QVM and quilc on my laptop following the documentation. But after I run the following code on the terminal:

import pennylane as qml
import numpy as np
import time
import torch


def test(simulator,parallel):
    dev = qml.device(simulator,device='4q-qvm')
    t0 = time.time()
    pars = np.random.rand(12)
    qnodes = qml.QNodeCollection()
    for i in range(12):
        @qml.qnode(dev,interface='torch')
        def circ(inputs,weights,i=i):
            qml.RY(weights, wires=0)
            qml.RY(inputs[i], wires=0)
            return qml.expval(qml.PauliZ(0))
        qnodes.append(circ)

    res = qnodes(pars,0.1,parallel=parallel)
    t1 = time.time()
    print("Running time", (t1 - t0) )


test('forest.qvm',True)

I received an error:

Segmentation fault: 11

Was I missing anything important?

Also, I’m wondering that even if we can get a speed up on the QVM device in the parallel mode, the running time would be still higher than the one on the default.qubit simulator. I noticed that running the same code in the sequential mode takes only 0.039s on the default.qubit simulator while taking 1.479s on the QVM device…

Actually, my goal is to train a hybrid quantum-classical neural network which involves executing B\times128 \times128 circuits where B is batch size (e.g. 4). Both forward and backward pass need to be computed for this large number of circuits. So I really want to know if we could reduce the running time to a level which is acceptable for production by employing the QNodeCollection method (particularly in the parallel mode)?

Hi @gojiita_ku,

Was I missing anything important?

From what I can see, your code looks good! Which makes the segmentation fault more puzzling :thinking:

Unfortunately, the segmentation fault indicates it might be a memory issue in the QVM itself? The only thing I can suggest is perhaps opening up an issue on the QVM GitHub issue.

I noticed that running the same code in the sequential mode takes only 0.039s on the default.qubit simulator while taking 1.479s on the QVM device…

One thing to keep in mind is that the supported methods of computing quantum gradients can differ between devices, which can affect total optimization time :slight_smile:

For example, default.qubit defaults to backpropagation, which adds some overhead to the forwards pass, but the backwards pass can be done in constant time.

In contrast, when using an external device like the QVM, only parameter-shift is supported. This means that the forward pass, while faster than default.qubit, the backwards pass may be slower, since it scales with the number of parameters in the circuit!

You can check out the backprop demo for more details if you are interested!

Hi @josh,

Unfortunately, the segmentation fault indicates it might be a memory issue in the QVM itself? The only thing I can suggest is perhaps opening up an issue on the QVM GitHub issue.

Ok. I will open up an issue on the QVM GitHub issue.

For example, default.qubit defaults to backpropagation, which adds some overhead to the forwards pass, but the backwards pass can be done in constant time.
In contrast, when using an external device like the QVM, only parameter-shift is supported. This means that the forward pass, while faster than default.qubit , the backwards pass may be slower , since it scales with the number of parameters in the circuit!

I got what you mean. I need to consider running times for both forward and backwards pass. Thanks for your explanation! But what I want to show is that the default.qubit has much better performance in terms of running time than the QVM device even just for the forward pass. It takes only 0.039s to run the above code (which only computes the forward pass) on the default.qubit simulator while taking 1.479s on the QVM device. As you mentioned, the backwards pass may be slower on the QVM device when it scales with the number of parameters in the circuit since only parameter-shift is supported. So I wonder if the default.qubit would be the optimal choice for optimizing the total computation time.

I got another question. Could QNodeCollection objects be also helpful in parallelizing QNode backwards pass evaluations (e.g. calculation of parameterized circuits gradients) ?

It takes only 0.039s to run the above code (which only computes the forward pass) on the default.qubit simulator while taking 1.479s on the QVM device.

This is really interesting! Unfortunately I can’t comment too much, since I am unaware as to how QVM works internally. It could be the case that it has more optimal performance at higher qubit number, or greater depth circuits, for example.

I got another question. Could QNodeCollection objects be also helpful in parallelizing QNode backwards pass evaluations (e.g. calculation of parameterized circuits gradients) ?

Unfortunately not – QNodeCollection only parallelizes multiple forward passes.

However, we recently added support for PennyLane devices to perform batch execution. At the moment, one device supports it — the PennyLane-Braket device. It uses this batch execution capability to batch execute all the parameter-shift gradients at once.

For more info, there is a nice write up in this demonstration: Computing gradients in parallel with Amazon Braket | PennyLane Demos

1 Like

Hi @josh,

Thanks for your advice! Actually, I participated in QHack 2021 and used AWS credits to get access to AWS bracket service. I trained a hybrid quantum-classical model using SV1 simulator at the parallel model but found the training was much slower than the local simulator (e.g. default.quibit). I found later from the documentation that the remote device has more advantage at computing large circuits rather than small circuits, considering the transition time between the local user and the remote server. In my case, the circuit to compute is always very small (e.g. circuit with four qubits). So I’m guessing maybe I had better use local simulators even though they do not support batch execution :sob:.

Anyway, I think I should share an example code I’m currently working on so that you or anyone else could give me further supervision on how to reduce the running time :blush:.

No worries @gojiita_ku! Yes, that would definitely be helpful.

Hi, @josh @antalszava

Sorry for my late reply!

Basically, I am working on an image segmentation task based on quantum convolution. The original model is very complicated, so I show you only the code for the quantum part which I feel most difficult to optimize :sob:. The function of the code is to perform a quantum version of 1\times1 convolution on a feature map with the shape of (batch size, channel, height, width), namely each value from each channel at the same spatial position is fed into a quantum circuit every time. In my case, the input feature map should be in the shape of (batch size, 4, 128, 128) where the value of batch size can be flexible. I run the code with batch size of 2 and it took about 15 minutes in my laptop. The size of dataset in my task is very large, so I guess it would take a very very long time to train the model… So could you please help me with this code? Many thanks!

import pennylane as qml
import torch
import torch.nn as nn
import time

n_qubits = 4
dev = qml.device('default.qubit', wires=n_qubits)

class QuanvLayer(nn.Module):
    def __init__(self, batch_size, height, width, channel):
        super(QuanvLayer, self).__init__()
        self.q_params = nn.Parameter(torch.randn(2, channel))
        qnodes = qml.QNodeCollection()
        for i in range(batch_size*height*width):
            @qml.qnode(dev, interface="torch",method='adjoint',mutable=False)
            def circuit(inputs,weights,i=i):
    
                for j in range(channel):
                    qml.RY(inputs[i][j], wires=j)
         
                return [qml.expval(qml.PauliZ(wires=j)) for j in range(4)]
            qnodes.append(circuit)

        self.qnodes = qnodes


    def forward(self, x):
        b,c,h,w = x.shape
        x = x.permute(1,0,2,3)  # (c,b,h,w)
        x = torch.flatten(x,start_dim=1) # (c,b*h*w)
        x = x.permute(1,0) # (b*h*w,c)

        x = self.qnodes(x,self.q_params,parallel=True)  #(b*h*w,c)

        x = x.reshape((b,h,w,c)) # (b,h,w,c)

        x = x.permute(0,3,2,1) # (b,c,w,h)
        
        x = x.permute(0,1,3,2) # (b,c,h,w)
        
        return x



batch_size = 2
height = 128
width = 128
channel = 4
t0 = time.time()
x = torch.randn(batch_size, channel,height,width)
x = torch.tensor(x,requires_grad=True)
loss = QuanvLayer(batch_size,height,width, channel)(x).sum()
loss.backward()
t1 = time.time()
print(t1-t0)