Speeding up grad computation



As I mentioned in my previous post, I’m trying to write a quantum circuit to learn MNIST classifications. Because images are too large for a quantum circuit, the data is first run through an autoencoder to reduce the dimensionality from 28 * 28 to just one vector of length 10. I then run through a circuit with 10 wires, and use the expectation of each wire as the score for that class. I’ve got it all working, but it’s pretty slow.

This is what the circuit, cost and grad code look like. I’ve omitted setup and imports for the sake of brevity, but can post that if it might affect things.

# x will be a length ENCODING_SIZE vector
# that represents the encoding of a MNIST image
# thetas is of size 2 * NUM_QUBITS
def circuit(x, thetas):
    for i in range(ENCODING_SIZE):
        RX(x[i], wires=i)
    for i in range(NUM_QUBITS - 1):
        CNOT(wires=[i, i+1])
    for i in range(NUM_QUBITS):
        RX(thetas[i], wires=i)
    for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
        RY(thetas[i], wires=(i - NUM_QUBITS))
    return tuple(qml.expval.PauliZ(wires=i) for i in range(NUM_QUBITS))

# X is of size (b, 10), actual_labels is size (b,)
# thetas if of size 2 * NUM_QUBITS.
# implements cross-entropy classification loss
# as described here:
# https://pytorch.org/docs/stable/nn.html#crossentropyloss
# with numerical stability
def cost(X, actual_labels, thetas):
    b = X.shape[0]
    yhats = []
    for i in range(b):
        yhat = circuit(X[i], thetas)
    st = np.stack(yhats)
    actual_class_vals = st[range(b), actual_labels]
    shifted = st - np.max(st, axis=1)[:, np.newaxis]
    the_sum = np.log(np.sum(np.exp(shifted), axis=1))
    return np.mean(-actual_class_vals + the_sum)

# loaded the data in batches of size 4, so
# X is of size (4, 10)
X = encoder(inputs.view(len(labels), -1))
start = time.time()
qml.grad(cost, argnum=2)(X.numpy(), labels.numpy(), thetas)
print(time.time() - start)

this operation takes about 200 seconds (and scales linearly with the size of the batch, so 50 seconds per example). at this speed, it would take a month to do the entire 60000 image dataset. Is there anything I can do to speed this up, or is this just the nature of the implementation and there is not much that can be done about this? the reason I ask is because this is for a class project (CS269Q at Stanford) and we only have about two weeks remaining.

I have two thoughts so far on why it is slow:

  1. the cost function is semi complicated so calculating the gradient is quite a hassle. however, I feel like in any classification task it’s going to be like this. should I try to switch to some dataset on which I can perform regression instead?
  2. there are too many wires. I could try to reduce the number of wires, but the reason I picked 10 was so each wire could correspond to one of the 10 classes. if I need to reduce the number of wires to say 5, how would I classify after that? I guess I could attach it to a simple matrix multiplication that maps from the 5 wires to the 10 classes, and also learn that 5 x 10 matrix. the only problem is that really increases the number of parameters to learn, which may or may not be a problem. I’m not sure.

any thoughts on this would be very much appreciated. thanks so much!


Hi @kushkhosla! Thanks for your question.

Can I ask what simulator/device you are using for your quantum simulation? I would like to run some benchmarking on my side, to work out the best approach/work out where the speed up would be most effective.


hi @josh! thanks so much for your help, I really appreciate it.

I’m just using the standard qml.device(‘default.qubit’) simulator. as for actual hardware, I’m on my laptop’s CPU.


Hi @kushkhosla, before looking at the scaling issue, I decided to try benchmarking the different simulators. I used the following IPython script:

import pennylane as qml
from pennylane import numpy as np


def circuit(x, thetas):
    for i in range(ENCODING_SIZE):
        qml.RX(x[i], wires=i)
    for i in range(NUM_QUBITS - 1):
        qml.CNOT(wires=[i, i + 1])
    for i in range(NUM_QUBITS):
        qml.RX(thetas[i], wires=i)
    for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
        qml.RY(thetas[i], wires=(i - NUM_QUBITS))
    return tuple(qml.expval.PauliZ(wires=i) for i in range(NUM_QUBITS))

x = np.random.random([ENCODING_SIZE])
thetas = np.random.random(2 * NUM_QUBITS)

devices = [
    qml.device("default.qubit", wires=NUM_QUBITS),
    qml.device("forest.numpy_wavefunction", wires=NUM_QUBITS),
    qml.device("forest.wavefunction", wires=NUM_QUBITS),
    qml.device("forest.qvm", device="{}q-qvm".format(NUM_QUBITS)),
    qml.device("forest.qvm", device="{}q-pyqvm".format(NUM_QUBITS)),
    qml.device("qiskit.basicaer", wires=NUM_QUBITS),
    qml.device("qiskit.aer", wires=NUM_QUBITS),
    qml.device("projectq.simulator", wires=NUM_QUBITS),
    # qml.device("microsoft.QubitSimulator", wires=NUM_QUBITS),

print("Encoding size: {}".format(ENCODING_SIZE))
print("Number of qubits: {}".format(NUM_QUBITS))

for dev in devices:
    print("\nDevice: {}".format(dev.name))
    qnode = qml.QNode(circuit, dev)
    %timeit qnode(x, thetas)

Running this script with ipython timing.ipy, gives the following results:

Encoding size: 10
Number of qubits: 10

Device: Default qubit PennyLane plugin
2.35 s ± 236 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: pyQVM NumpyWavefunction Simulator Device
293 ms ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Forest Wavefunction Simulator Device
350 ms ± 65.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Forest QVM Device (10q-qvm, 1024 shots)
5.6 s ± 92.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Forest pyQVM Device (10q-pyqvm, 1024 shots)
6.71 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Device: Qiskit Basic Aer (1024 shots)
179 ms ± 4.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Qiskit Aer (1024 shots)
162 ms ± 5.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Device: ProjectQ PennyLane plugin (1024 shots)
60.5 ms ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

(NB: I modified the print statement above to give more device information. The devices that specify shots are hardware simulators, so increasing the number of shots increases accuracy, but also increases runtime.)

A couple of things to note:

  • The default.qubit device is quite slow. This is not intentional, but at the same time, the default.qubit device is not meant for production code — it is a reference plugin designed to show developers how a PennyLane plugin is coded.

  • We recommend instead that you use a plugin for an external high-performance qubit simulator. From the rough benchmarking above, it appears that for 10 qubits, the Rigetti Forest pyQVM simulator destroys the competition :slightly_smiling_face: So you should see some significant improvements using

     qml.device("forest.qvm", device="{}q-pyqvm".format(NUM_QUBITS), shots=1024)

    Alternatively, you can use the NumPy wavefunction simulator for exact expectation values:

     qml.device("forest.numpy_wavefunction", wires=NUM_QUBITS)

    Both of these devices can be installed via

     git clone https://github.com/rigetti/pennylane-forest
     cd pennylane-forest
     pip install -e .

In terms of the scaling, note that the above times t are for a single circuit evaluation. To determine the gradient for M free parameters, PennyLane must query the quantum device 2M times; so the expected time taken per optimization step should be \sim 2Mt.

For more details on why this is the case, see @nathan’s great answer here:

Note that we are working on alleviating the optimization runtime scaling! This will likely be through a combination of:

  • Extending PennyLane to perform the gradient computations for each parameter in parallel, and

  • Implementing efficiency gains that can be achieved assuming the underlying device is a simulator (and not hardware).


@josh, thank you so much for the suggestion and work you put into this. we took your advice, and performance sped up by a factor of about 200. we are now able to train in a reasonable amount of time, and only have to work on the circuit architecture.

again, thanks so much!


can you please tell the version of each lib?



  • PennyLane: latest master version.

  • All PennyLane plugins: I am running the latest master version.

  • pyQuil: 2.7

  • Qiskit: 0.10.1

  • Qiskit-aer: 0.2.0

  • ProjectQ: 0.4.1

  • Q#: 0.5.1904.1302

  • QVM: 1.8.2 [94d402b]

  • Quilc: 1.8.2 [85e2290]


there must be something in my environment that causes the pauliZ to produce an error :frowning: