Speeding up grad computation

kushkhosla · May 25, 2019, 7:14pm

Hi!

As I mentioned in my previous post, I’m trying to write a quantum circuit to learn MNIST classifications. Because images are too large for a quantum circuit, the data is first run through an autoencoder to reduce the dimensionality from 28 * 28 to just one vector of length 10. I then run through a circuit with 10 wires, and use the expectation of each wire as the score for that class. I’ve got it all working, but it’s pretty slow.

This is what the circuit, cost and grad code look like. I’ve omitted setup and imports for the sake of brevity, but can post that if it might affect things.

# x will be a length ENCODING_SIZE vector
# that represents the encoding of a MNIST image
# thetas is of size 2 * NUM_QUBITS
@qml.qnode(dev)
def circuit(x, thetas):
    for i in range(ENCODING_SIZE):
        RX(x[i], wires=i)
    for i in range(NUM_QUBITS - 1):
        CNOT(wires=[i, i+1])
    for i in range(NUM_QUBITS):
        RX(thetas[i], wires=i)
    for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
        RY(thetas[i], wires=(i - NUM_QUBITS))
    return tuple(qml.expval.PauliZ(wires=i) for i in range(NUM_QUBITS))

# X is of size (b, 10), actual_labels is size (b,)
# thetas if of size 2 * NUM_QUBITS.
# implements cross-entropy classification loss
# as described here:
# https://pytorch.org/docs/stable/nn.html#crossentropyloss
# with numerical stability
def cost(X, actual_labels, thetas):
    b = X.shape[0]
    yhats = []
    for i in range(b):
        yhat = circuit(X[i], thetas)
        yhats.append(yhat)
    st = np.stack(yhats)
    actual_class_vals = st[range(b), actual_labels]
    shifted = st - np.max(st, axis=1)[:, np.newaxis]
    the_sum = np.log(np.sum(np.exp(shifted), axis=1))
    return np.mean(-actual_class_vals + the_sum)

# loaded the data in batches of size 4, so
# X is of size (4, 10)
X = encoder(inputs.view(len(labels), -1))
start = time.time()
qml.grad(cost, argnum=2)(X.numpy(), labels.numpy(), thetas)
print(time.time() - start)

this operation takes about 200 seconds (and scales linearly with the size of the batch, so 50 seconds per example). at this speed, it would take a month to do the entire 60000 image dataset. Is there anything I can do to speed this up, or is this just the nature of the implementation and there is not much that can be done about this? the reason I ask is because this is for a class project (CS269Q at Stanford) and we only have about two weeks remaining.

I have two thoughts so far on why it is slow:

the cost function is semi complicated so calculating the gradient is quite a hassle. however, I feel like in any classification task it’s going to be like this. should I try to switch to some dataset on which I can perform regression instead?
there are too many wires. I could try to reduce the number of wires, but the reason I picked 10 was so each wire could correspond to one of the 10 classes. if I need to reduce the number of wires to say 5, how would I classify after that? I guess I could attach it to a simple matrix multiplication that maps from the 5 wires to the 10 classes, and also learn that 5 x 10 matrix. the only problem is that really increases the number of parameters to learn, which may or may not be a problem. I’m not sure.

any thoughts on this would be very much appreciated. thanks so much!

josh · May 25, 2019, 8:45pm

Hi @kushkhosla! Thanks for your question.

Can I ask what simulator/device you are using for your quantum simulation? I would like to run some benchmarking on my side, to work out the best approach/work out where the speed up would be most effective.

kushkhosla · May 25, 2019, 9:16pm

hi @josh! thanks so much for your help, I really appreciate it.

I’m just using the standard qml.device(‘default.qubit’) simulator. as for actual hardware, I’m on my laptop’s CPU.

josh · May 27, 2019, 2:15pm

Hi @kushkhosla, before looking at the scaling issue, I decided to try benchmarking the different simulators. I used the following IPython script:

import pennylane as qml
from pennylane import numpy as np

ENCODING_SIZE = 10
NUM_QUBITS = 10

def circuit(x, thetas):
    for i in range(ENCODING_SIZE):
        qml.RX(x[i], wires=i)
    for i in range(NUM_QUBITS - 1):
        qml.CNOT(wires=[i, i + 1])
    for i in range(NUM_QUBITS):
        qml.RX(thetas[i], wires=i)
    for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
        qml.RY(thetas[i], wires=(i - NUM_QUBITS))
    return tuple(qml.expval.PauliZ(wires=i) for i in range(NUM_QUBITS))


x = np.random.random([ENCODING_SIZE])
thetas = np.random.random(2 * NUM_QUBITS)

devices = [
    qml.device("default.qubit", wires=NUM_QUBITS),
    qml.device("forest.numpy_wavefunction", wires=NUM_QUBITS),
    qml.device("forest.wavefunction", wires=NUM_QUBITS),
    qml.device("forest.qvm", device="{}q-qvm".format(NUM_QUBITS)),
    qml.device("forest.qvm", device="{}q-pyqvm".format(NUM_QUBITS)),
    qml.device("qiskit.basicaer", wires=NUM_QUBITS),
    qml.device("qiskit.aer", wires=NUM_QUBITS),
    qml.device("projectq.simulator", wires=NUM_QUBITS),
    # qml.device("microsoft.QubitSimulator", wires=NUM_QUBITS),
]

print("Encoding size: {}".format(ENCODING_SIZE))
print("Number of qubits: {}".format(NUM_QUBITS))

for dev in devices:
    print("\nDevice: {}".format(dev.name))
    qnode = qml.QNode(circuit, dev)
    %timeit qnode(x, thetas)

Running this script with ipython timing.ipy, gives the following results:

Encoding size: 10
Number of qubits: 10

Device: Default qubit PennyLane plugin
2.35 s ± 236 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: pyQVM NumpyWavefunction Simulator Device
293 ms ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Forest Wavefunction Simulator Device
350 ms ± 65.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Forest QVM Device (10q-qvm, 1024 shots)
5.6 s ± 92.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Forest pyQVM Device (10q-pyqvm, 1024 shots)
6.71 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Device: Qiskit Basic Aer (1024 shots)
179 ms ± 4.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Qiskit Aer (1024 shots)
162 ms ± 5.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Device: ProjectQ PennyLane plugin (1024 shots)
60.5 ms ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(NB: I modified the print statement above to give more device information. The devices that specify shots are hardware simulators, so increasing the number of shots increases accuracy, but also increases runtime.)

A couple of things to note:

The default.qubit device is quite slow. This is not intentional, but at the same time, the default.qubit device is not meant for production code — it is a reference plugin designed to show developers how a PennyLane plugin is coded.
We recommend instead that you use a plugin for an external high-performance qubit simulator. From the rough benchmarking above, it appears that for 10 qubits, the Rigetti Forest pyQVM simulator destroys the competition So you should see some significant improvements using
```
 qml.device("forest.qvm", device="{}q-pyqvm".format(NUM_QUBITS), shots=1024)
```
Alternatively, you can use the NumPy wavefunction simulator for exact expectation values:
```
 qml.device("forest.numpy_wavefunction", wires=NUM_QUBITS)
```
Both of these devices can be installed via
```
 git clone https://github.com/rigetti/pennylane-forest
 cd pennylane-forest
 pip install -e .
```

In terms of the scaling, note that the above times t are for a single circuit evaluation. To determine the gradient for M free parameters, PennyLane must query the quantum device 2M times; so the expected time taken per optimization step should be \sim 2Mt.

For more details on why this is the case, see @nathan’s great answer here:

Note that we are working on alleviating the optimization runtime scaling! This will likely be through a combination of:

Extending PennyLane to perform the gradient computations for each parameter in parallel, and
Implementing efficiency gains that can be achieved assuming the underlying device is a simulator (and not hardware).

kushkhosla · May 28, 2019, 4:48am

@josh, thank you so much for the suggestion and work you put into this. we took your advice, and performance sped up by a factor of about 200. we are now able to train in a reasonable amount of time, and only have to work on the circuit architecture.

again, thanks so much!

kareem_essafty · May 31, 2019, 6:53pm

@josh
can you please tell the version of each lib?

josh · May 31, 2019, 7:09pm

Certainly.

PennyLane: latest master version.
All PennyLane plugins: I am running the latest master version.
pyQuil: 2.7
Qiskit: 0.10.1
Qiskit-aer: 0.2.0
ProjectQ: 0.4.1
Q#: 0.5.1904.1302
QVM: 1.8.2 [94d402b]
Quilc: 1.8.2 [85e2290]

kareem_essafty · May 31, 2019, 7:15pm

there must be something in my environment that causes the pauliZ to produce an error

mxn.wls · June 28, 2019, 9:40pm

On this is there any way to get some performance gains here?

I’m running pyqvm & currently using a 2^N x 2^N hermitian cost operator. I was working with several different circuits (as I had ‘collisions’ between variables) so had to initialize 2 circuits for the hamitlonian H = z1 + z1z2, however this scales with the number of collisions. If we move to a hermitian cost operator though we get this exponential slowdown in the computation, is there something going on with the plugin?

Thanks for any help! Example below.

(This code runs ~1000x slower than measuring the expectations in the commented out return.)

import pennylane as qml
from pennylane import numpy as np
from homegrown.timing.utils import time_it

ENCODING_SIZE = 10
NUM_QUBITS = 10


# loop over problem

def circuit(x, thetas):
    for i in range(ENCODING_SIZE):
        qml.RX(x[i], wires=i)
    for i in range(NUM_QUBITS - 1):
        qml.CNOT(wires=[i, i + 1])
    for i in range(NUM_QUBITS):
        qml.RX(thetas[i], wires=i)
    for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
        qml.RY(thetas[i], wires=(i - NUM_QUBITS))
    # return tuple(qml.expval.PauliZ(wires=i) for i in range(NUM_QUBITS))
    wires = list(range(NUM_QUBITS))
    observable = np.zeros((2**NUM_QUBITS, 2**NUM_QUBITS)).astype(np.float32)
    return qml.expval.Hermitian(observable, wires=wires)

x = np.random.random([ENCODING_SIZE])
thetas = np.random.random(2 * NUM_QUBITS)

devices = [
    # qml.device("default.qubit", wires=NUM_QUBITS),
    # qml.device("forest.numpy_wavefunction", wires=NUM_QUBITS),
    # qml.device("forest.wavefunction", wires=NUM_QUBITS),
    # qml.device("forest.qvm", device="{}q-qvm".format(NUM_QUBITS)),
    qml.device("forest.qvm", device="{}q-pyqvm".format(NUM_QUBITS)),
    # qml.device("qiskit.basicaer", wires=NUM_QUBITS),
    # qml.device("qiskit.aer", wires=NUM_QUBITS),
    # qml.device("projectq.simulator", wires=NUM_QUBITS),
    # qml.device("microsoft.QubitSimulator", wires=NUM_QUBITS),
]

print("Encoding size: {}".format(ENCODING_SIZE))
print("Number of qubits: {}".format(NUM_QUBITS))

for dev in devices:
    print("\nDevice: {}".format(dev.name))
    qnode = time_it(qml.QNode(circuit, dev))
    qnode(x, thetas)

mxn.wls · June 28, 2019, 9:42pm

It’d be just super grand if we could hack the state out of the pyqvm simulation and work with that instead?

mxn.wls · June 28, 2019, 10:52pm

this is eval times for the same circuit, using 2 different methods of exp val evaluation (a)‘piecewise’ and ‘entire Hamiltonian’ for a) when the exp vals are divided into smaller chunks and b) aggregated into a single Hermitian matrix).

There are the cost evals on there and then the gradient computation for the entire hermitian matrix goes nuts.

The piecewise gradient computation circuits seem to scale as expected 2MN.

Each circuit has around 6 params.

by ‘gradient’ I mean a computation which calls an optimizer for 1 step.

nathan · June 29, 2019, 12:06am

Hi @mxn.wls,

I don’t have the codebase in front of me at the moment, but if I had to guess, it might be that there is a dense matrix multiplication somewhere in the plugin code whose complexity blows up as you increase number of qubits.

You could maybe try the latest version of PL (0.4) and use the default.qubit simulator. Not 100% it may solve your problem, but we did recently make some improvements there which significantly speed up the Hermitian expval.

mxn.wls · June 29, 2019, 4:38pm

This seems very slow, could you explain a little how the simulation works?

i.e. if I simulate a 2^10 state vector and 100 unitary operations exactly in numpy this takes 0.04 seconds. As far as I understand the gradient computations require 2 new circuits for each parameter. So let’s call gradient + classical processing ~1s (generously), where is all the overhead coming from?

Even if I simulate the whole probability density matrix instead of the statevector (100 operations, 8 qubits) I get <0.07s circuit evals, so should be getting <1s (roughly) circuit + gradient evals at 8 qubits, whereas the piecewise eval is O(10s). Actually thinking about it this makes sense (kind of) because the in the piecewise implementation number of circuits scales with the number of ‘common’ variables… so am I right in saying the simulator uses the whole density matrix?

mxn.wls · June 29, 2019, 4:45pm

Thanks for the tip I’ll give it a try (will have to be Monday) and let you know.

mxn.wls · July 1, 2019, 3:31pm

Hey, so the forest wavefunction simulator is particularly bad & there is clearly something going on (details below), but I’m still not understanding the circuit run times for the other simulators, 10s seems like a lot for 8 qubits (even with grad computations). What do you think?

Example
import pennylane as qml
from pennylane import numpy as np

NUM_QUBITS = 8

def circuit(thetas):
for i in range(NUM_QUBITS):
    qml.RX(thetas[i], wires=i)
for i in range(NUM_QUBITS - 1):
    qml.CNOT(wires=[i, i + 1])
for i in range(NUM_QUBITS):
    qml.RX(thetas[i], wires=i)
for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
    qml.RY(thetas[i], wires=(i - NUM_QUBITS))
observable = np.zeros((2**NUM_QUBITS, 2**NUM_QUBITS))
wires = list(range(NUM_QUBITS))
return qml.expval.Hermitian(observable, wires=wires)

x = np.random.random([NUM_QUBITS])

devices = [
qml.device("default.qubit", wires=NUM_QUBITS),
qml.device("forest.numpy_wavefunction", wires=NUM_QUBITS),
qml.device("forest.wavefunction", wires=NUM_QUBITS),
qml.device("forest.qvm", device="{}q-qvm".format(NUM_QUBITS)),
qml.device("forest.qvm", device="{}q-pyqvm".format(NUM_QUBITS)),
]

print("Encoding size: {}".format(NUM_QUBITS))
print("Number of qubits: {}".format(NUM_QUBITS))

from time import time

for dev in devices:
print("\nDevice: {}".format(dev.name))
t0 = time()
opt = qml.AdamOptimizer(stepsize=0.1)
qnode = qml.QNode(circuit, dev)
thetas = np.random.random(2 * NUM_QUBITS)
thetas = opt.step(qnode, thetas)
print('Time: ', (time() - t0))

Output
Device: Default qubit PennyLane plugin
Time: 10.741438150405884

Device: pyQVM NumpyWavefunction Simulator Device
Time: 10.3768470287323

Device: Forest Wavefunction Simulator Device
Time: 12.075280666351318

Device: Forest QVM Device
Time: 112.82866406440735

Device: Forest QVM Device
Time: 20.80642604827881

josh · July 1, 2019, 3:47pm

Hi @mxn.wls,

The results look to be expected, once the number of parameters are taken into account. For example, consider the following modification to the script, which times both a single QNode evaluation, as well as an optimization step:

from time import time
import pennylane as qml
from pennylane import numpy as np

NUM_QUBITS = 8

def circuit(thetas):
    for i in range(NUM_QUBITS):
        qml.RX(thetas[i], wires=i)
    for i in range(NUM_QUBITS - 1):
        qml.CNOT(wires=[i, i + 1])
    for i in range(NUM_QUBITS):
        qml.RX(thetas[i], wires=i)
    for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
        qml.RY(thetas[i], wires=(i - NUM_QUBITS))
    observable = np.zeros((2**NUM_QUBITS, 2**NUM_QUBITS))
    wires = list(range(NUM_QUBITS))
    return qml.expval(qml.Hermitian(observable, wires=wires))

x = np.random.random([NUM_QUBITS])

devices = [
qml.device("default.qubit", wires=NUM_QUBITS),
qml.device("forest.numpy_wavefunction", wires=NUM_QUBITS),
qml.device("forest.wavefunction", wires=NUM_QUBITS),
qml.device("forest.qvm", device="{}q-qvm".format(NUM_QUBITS)),
qml.device("forest.qvm", device="{}q-pyqvm".format(NUM_QUBITS)),
]

thetas = np.random.random(2 * NUM_QUBITS)

print("Encoding size: {}".format(NUM_QUBITS))
print("Number of qubits: {}".format(NUM_QUBITS))
print("Number of parameters:", len(thetas))

for dev in devices:
    print("\nDevice: {}".format(dev.name))
    qnode = qml.QNode(circuit, dev)

    t0 = time()
    qnode(thetas)
    t1 = time()

    print("Forward pass:", t1-t0)

    opt = qml.AdamOptimizer(stepsize=0.1)

    t0 = time()
    thetas = opt.step(qnode, thetas)
    t1 = time()

    print('Backwards pass: ', (t1 - t0))

For me, this gives the following result:

Encoding size: 8
Number of qubits: 8
Number of parameters: 16

Device: Default qubit PennyLane plugin

Forward pass: 0.5093185901641846
Backwards pass:  18.87824559211731

(timings will vary depending on CPU/memory).

Since PennyLane treats all devices as hardware devices, a single optimization step using a gradient-based optimizer requires (at minimum) 2 circuit evaluations per parameter. In this case, with 16 parameters, a single gradient descent step should take

\sim 2\times 16\times 0.5s \approx 16 s

which is approximately what we see above.

There are a couple of things we are working on to speed up this process:

Make default.qubit more efficient (recent work has resulted in a \sim 2 orders of magnitude improvement in default.qubit).

One improvement has been the move away from dense matrix multiplication, to using np.tensordot for matrix-vector multiplication. Currently, forest.wavefunction continues to use dense matrix multiplication for qml.Hermitian, hence the exponential growth in your plot above.
Parallelize the gradient computations. Since the gradient of each parameter is independently computed, this could be done in parallel, scaling with the number of cores available on the system. This is a bit difficult due to Python, however.
Provide a plugin device that can natively perform backpropagation through the quantum simulation, without requiring multiple quantum evaluations. For example, a simulator coded using TensorFlow or PyTorch.

josh · July 1, 2019, 3:52pm

Note: playing around with the timing script above, I can verify that the size of the observable passed to qml.Hermitian significantly affects the speed of the simulation.

Perhaps the faster path to improvement is rewriting how qml.Hermitian is handled in the default.qubit simulator, in order to chase down any inefficiencies.

mxn.wls · July 1, 2019, 4:01pm

Thanks for the detailed response.

On (1.) above you say that the problem comes from the forest.wavefunction simulator continues to use dense matrix multiplication.

Does the same problem apply to the pyqvm?
Is this something on the pennylane plugin side or the forest side?

I ask because the pyqvm simulator is usually very fast, whereas the other simulators scale badly.

(ex)
import pennylane as qml
from pennylane import numpy as np

NUM_QUBITS = 12

def circuit(thetas):
for i in range(NUM_QUBITS):
qml.RX(thetas[i], wires=i)
for i in range(NUM_QUBITS - 1):
qml.CNOT(wires=[i, i + 1])
for i in range(NUM_QUBITS):
qml.RX(thetas[i], wires=i)
for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
qml.RY(thetas[i], wires=(i - NUM_QUBITS))
observable = np.zeros((2NUM_QUBITS, 2NUM_QUBITS))
wires = list(range(NUM_QUBITS))
# return qml.expval.Hermitian(observable, wires=wires)
return [qml.expval.PauliZ(wires=[i]) for i in range(NUM_QUBITS)]

x = np.random.random([NUM_QUBITS])

devices = [
qml.device(“default.qubit”, wires=NUM_QUBITS),
qml.device(“forest.numpy_wavefunction”, wires=NUM_QUBITS),
qml.device(“forest.wavefunction”, wires=NUM_QUBITS),
qml.device(“forest.qvm”, device="{}q-qvm".format(NUM_QUBITS)),
qml.device(“forest.qvm”, device="{}q-pyqvm".format(NUM_QUBITS)),
]

print(“Encoding size: {}”.format(NUM_QUBITS))
print(“Number of qubits: {}”.format(NUM_QUBITS))

from time import time

for dev in devices:
print("\nDevice: {}".format(dev.name))
qnode = qml.QNode(circuit, dev)
thetas = np.random.random(2 * NUM_QUBITS)

# t0 = time()
# opt = qml.AdamOptimizer(stepsize=0.1)
# thetas = opt.step(qnode, thetas)
# print('Time gradients: ', (time() - t0))

t0 = time()
qnode(thetas)
print('Time cost: ', (time() - t0))

out
Device: Default qubit PennyLane plugin
Time cost: 6.934019327163696

Device: pyQVM NumpyWavefunction Simulator Device
Time cost: 6.896592617034912

Device: Forest Wavefunction Simulator Device
Time cost: 6.934622526168823

Device: Forest QVM Device
Time cost: 14.904531717300415

Device: Forest QVM Device
Time cost: 0.00637364387512207

josh · July 1, 2019, 4:32pm

Since the pyQVM is a hardware simulator, it doesn’t provide access to the wavefunction, just output samples. So no dense matrix multiplication needs to take place.

On the other hand, due to sampling the accuracy of the pyQVM will not be exact, like forest.wavefunction or default.qubit. Instead, as you increase the number of shots, you will start to approach the exact expectation value.

mxn.wls · July 1, 2019, 5:18pm

Ok, so the problem won’t be with the dense matrix multiplication which is good. It looks like the issue is just with how the pyqvm handles the ‘hermitian’ observables, where do you think the issue is here? (i.e. the crazy scaling of the pyqvm when using the hermitian operator)

Topic		Replies	Views
Qualcs device and qml.grad() timings PennyLane Help	6	656	March 10, 2022
Batching Inputs to Quantum Circuit PennyLane Help	14	3308	September 10, 2019
Backward function takes long and batches PennyLane Help	14	2599	July 4, 2023
Running a VQA with Qiskit Runtime PennyLane Qiskit	6	936	December 6, 2023
How to use Dask to parallelize QNode computations? PennyLane Help	26	1604	August 10, 2021

Speeding up grad computation

Related topics