# Speeding up grad computation

Thanks for the tip I’ll give it a try (will have to be Monday) and let you know.

Hey, so the forest wavefunction simulator is particularly bad & there is clearly something going on (details below), but I’m still not understanding the circuit run times for the other simulators, 10s seems like a lot for 8 qubits (even with grad computations). What do you think?

Example
import pennylane as qml
from pennylane import numpy as np

NUM_QUBITS = 8

def circuit(thetas):
for i in range(NUM_QUBITS):
qml.RX(thetas[i], wires=i)
for i in range(NUM_QUBITS - 1):
qml.CNOT(wires=[i, i + 1])
for i in range(NUM_QUBITS):
qml.RX(thetas[i], wires=i)
for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
qml.RY(thetas[i], wires=(i - NUM_QUBITS))
observable = np.zeros((2**NUM_QUBITS, 2**NUM_QUBITS))
wires = list(range(NUM_QUBITS))
return qml.expval.Hermitian(observable, wires=wires)

x = np.random.random([NUM_QUBITS])

devices = [
qml.device("default.qubit", wires=NUM_QUBITS),
qml.device("forest.numpy_wavefunction", wires=NUM_QUBITS),
qml.device("forest.wavefunction", wires=NUM_QUBITS),
qml.device("forest.qvm", device="{}q-qvm".format(NUM_QUBITS)),
qml.device("forest.qvm", device="{}q-pyqvm".format(NUM_QUBITS)),
]

print("Encoding size: {}".format(NUM_QUBITS))
print("Number of qubits: {}".format(NUM_QUBITS))

from time import time

for dev in devices:
print("\nDevice: {}".format(dev.name))
t0 = time()
opt = qml.AdamOptimizer(stepsize=0.1)
qnode = qml.QNode(circuit, dev)
thetas = np.random.random(2 * NUM_QUBITS)
thetas = opt.step(qnode, thetas)
print('Time: ', (time() - t0))


Output
Device: Default qubit PennyLane plugin
Time: 10.741438150405884

Device: pyQVM NumpyWavefunction Simulator Device
Time: 10.3768470287323

Device: Forest Wavefunction Simulator Device
Time: 12.075280666351318

Device: Forest QVM Device
Time: 112.82866406440735

Device: Forest QVM Device
Time: 20.80642604827881

Hi @mxn.wls,

The results look to be expected, once the number of parameters are taken into account. For example, consider the following modification to the script, which times both a single QNode evaluation, as well as an optimization step:

from time import time
import pennylane as qml
from pennylane import numpy as np

NUM_QUBITS = 8

def circuit(thetas):
for i in range(NUM_QUBITS):
qml.RX(thetas[i], wires=i)
for i in range(NUM_QUBITS - 1):
qml.CNOT(wires=[i, i + 1])
for i in range(NUM_QUBITS):
qml.RX(thetas[i], wires=i)
for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
qml.RY(thetas[i], wires=(i - NUM_QUBITS))
observable = np.zeros((2**NUM_QUBITS, 2**NUM_QUBITS))
wires = list(range(NUM_QUBITS))
return qml.expval(qml.Hermitian(observable, wires=wires))

x = np.random.random([NUM_QUBITS])

devices = [
qml.device("default.qubit", wires=NUM_QUBITS),
qml.device("forest.numpy_wavefunction", wires=NUM_QUBITS),
qml.device("forest.wavefunction", wires=NUM_QUBITS),
qml.device("forest.qvm", device="{}q-qvm".format(NUM_QUBITS)),
qml.device("forest.qvm", device="{}q-pyqvm".format(NUM_QUBITS)),
]

thetas = np.random.random(2 * NUM_QUBITS)

print("Encoding size: {}".format(NUM_QUBITS))
print("Number of qubits: {}".format(NUM_QUBITS))
print("Number of parameters:", len(thetas))

for dev in devices:
print("\nDevice: {}".format(dev.name))
qnode = qml.QNode(circuit, dev)

t0 = time()
qnode(thetas)
t1 = time()

print("Forward pass:", t1-t0)

opt = qml.AdamOptimizer(stepsize=0.1)

t0 = time()
thetas = opt.step(qnode, thetas)
t1 = time()

print('Backwards pass: ', (t1 - t0))


For me, this gives the following result:

Encoding size: 8
Number of qubits: 8
Number of parameters: 16

Device: Default qubit PennyLane plugin

Forward pass: 0.5093185901641846
Backwards pass:  18.87824559211731


(timings will vary depending on CPU/memory).

Since PennyLane treats all devices as hardware devices, a single optimization step using a gradient-based optimizer requires (at minimum) 2 circuit evaluations per parameter. In this case, with 16 parameters, a single gradient descent step should take

\sim 2\times 16\times 0.5s \approx 16 s

which is approximately what we see above.

There are a couple of things we are working on to speed up this process:

1. Make default.qubit more efficient (recent work has resulted in a \sim 2 orders of magnitude improvement in default.qubit).

One improvement has been the move away from dense matrix multiplication, to using np.tensordot for matrix-vector multiplication. Currently, forest.wavefunction continues to use dense matrix multiplication for qml.Hermitian, hence the exponential growth in your plot above.

2. Parallelize the gradient computations. Since the gradient of each parameter is independently computed, this could be done in parallel, scaling with the number of cores available on the system. This is a bit difficult due to Python, however.

3. Provide a plugin device that can natively perform backpropagation through the quantum simulation, without requiring multiple quantum evaluations. For example, a simulator coded using TensorFlow or PyTorch.

Note: playing around with the timing script above, I can verify that the size of the observable passed to qml.Hermitian significantly affects the speed of the simulation.

Perhaps the faster path to improvement is rewriting how qml.Hermitian is handled in the default.qubit simulator, in order to chase down any inefficiencies.

Thanks for the detailed response.

On (1.) above you say that the problem comes from the forest.wavefunction simulator continues to use dense matrix multiplication.

1. Does the same problem apply to the pyqvm?
2. Is this something on the pennylane plugin side or the forest side?

I ask because the pyqvm simulator is usually very fast, whereas the other simulators scale badly.

(ex)
import pennylane as qml
from pennylane import numpy as np

NUM_QUBITS = 12

def circuit(thetas):
for i in range(NUM_QUBITS):
qml.RX(thetas[i], wires=i)
for i in range(NUM_QUBITS - 1):
qml.CNOT(wires=[i,&#32;i&#32;+&#32;1])
for i in range(NUM_QUBITS):
qml.RX(thetas[i], wires=i)
for i in range(NUM_QUBITS, 2 * NUM_QUBITS):
qml.RY(thetas[i], wires=(i - NUM_QUBITS))
observable = np.zeros((2NUM_QUBITS, 2NUM_QUBITS))
wires = list(range(NUM_QUBITS))
# return qml.expval.Hermitian(observable, wires=wires)
return [qml.expval.PauliZ(wires=[i]) for i in range(NUM_QUBITS)]

x = np.random.random([NUM_QUBITS])

devices = [
qml.device(“default.qubit”, wires=NUM_QUBITS),
qml.device(“forest.numpy_wavefunction”, wires=NUM_QUBITS),
qml.device(“forest.wavefunction”, wires=NUM_QUBITS),
qml.device(“forest.qvm”, device="{}q-qvm".format(NUM_QUBITS)),
qml.device(“forest.qvm”, device="{}q-pyqvm".format(NUM_QUBITS)),
]

print(“Encoding size: {}”.format(NUM_QUBITS))
print(“Number of qubits: {}”.format(NUM_QUBITS))

from time import time

for dev in devices:
print("\nDevice: {}".format(dev.name))
qnode = qml.QNode(circuit, dev)
thetas = np.random.random(2 * NUM_QUBITS)

# t0 = time()
# opt = qml.AdamOptimizer(stepsize=0.1)
# thetas = opt.step(qnode, thetas)
# print('Time gradients: ', (time() - t0))

t0 = time()
qnode(thetas)
print('Time cost: ', (time() - t0))

out
Device: Default qubit PennyLane plugin
Time cost: 6.934019327163696

Device: pyQVM NumpyWavefunction Simulator Device
Time cost: 6.896592617034912

Device: Forest Wavefunction Simulator Device
Time cost: 6.934622526168823

Device: Forest QVM Device
Time cost: 14.904531717300415

Device: Forest QVM Device
Time cost: 0.00637364387512207

Since the pyQVM is a hardware simulator, it doesn’t provide access to the wavefunction, just output samples. So no dense matrix multiplication needs to take place.

On the other hand, due to sampling the accuracy of the pyQVM will not be exact, like forest.wavefunction or default.qubit. Instead, as you increase the number of shots, you will start to approach the exact expectation value.

Ok, so the problem won’t be with the dense matrix multiplication which is good. It looks like the issue is just with how the pyqvm handles the ‘hermitian’ observables, where do you think the issue is here? (i.e. the crazy scaling of the pyqvm when using the hermitian operator)

Also, is it 2 new circuits per parameter if that parameter only corresponds to one gate? If it isn’t, then that means it scales as 2*number of gates with parameters (applying the parameter shift rule to all gates containing that parameter then applying the product rule)

Ah, that might be the key. In order to perform arbitrary measurements of Hermitian observable A on a hardware device, this is the process that needs to be undertaken:

1. Calculate the unitary matrix U, comprised of the orthonormal eigenvectors of A down the columns.

2. Apply the unitary U^\dagger to the quantum state |\psi\rangle, just prior to measurement.

3. From the resulting Pauli-Z measurement samples, calculate the probability of measuring each computational basis state |\langle i | U^\dagger |\psi\rangle|^2.

4. The resulting expectation value is given by:

\langle \psi | A | \psi \rangle = \sum_i \lambda_i |\langle i | U^\dagger |\psi\rangle|^2

where \lambda_i are the eigenvalues of A.

So some of the problem comes from the number of gates. Though the number of parameters is O(10) the number of gates using these parameters is O(100).

The exponential issue with the pyqvm is less obvious to me. Maybe it comes from computing U^(dag)? I appreciate the description above but can’t immediately see why it would slow down the computation so much.

default is default.qubit
pyqvm is forest.qvm
entire is evaluating the whole hamiltonian
piecewise is splitting into pieces and evaluating different circuits for ‘collisions’ between expectations/variables

I also parallelized and wrote my own gradient computation, given the large overhead of the number of gates, so these speeds will be v machine dependent. I still think there is something funny going on with the arbitrary hermitian evals in the pyqvm so will be interested to hear what you guys come out with. Thx for all the info so far

This is my thought as well, however the scaling of np.linalg.eigh would need to be investigated to see if this is in fact the cause. Another reason could be how PyQuil implements arbitrary unitaries — whether it’s via direct matrix multiplication, or whether they decompose the unitary first (which might have some cost associated with it).

The plots look great! We are always looking to speed up the optimization and work on the performance of PennyLane, so if you have noticed anything while writing your own parallelized gradient computation, please feel free to submit a PR to the PennyLane GitHub repository

I’ve encountered the same issue while training a QNN with high-dimensional state vectors. In an attempt to speed it up, I implemented a PennyLane device for Qulacs. It’s more than 2 times faster than the default PennyLane plugin (which was the fastest before from my experiments):

Encoding size: 10
Number of qubits: 10

Device: Default qubit PennyLane plugin
6.89 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Device: pyQVM NumpyWavefunction Simulator Device
437 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Qiskit PennyLane plugin
232 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Qiskit PennyLane plugin
217 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: ProjectQ PennyLane plugin
10.2 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Device: Qulacs device
2.49 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


(I don’t have QVM installed at the moment)

I put it on GitHub for others to use: https://github.com/soudy/pennylane-qulacs/. Note that it is still a very basic implementation tailored to my needs, but it can easily be expanded.

1 Like

That is awesome, @soudy! Is the Qulacs plugin something you would consider open-sourcing? We could also link to it on https://pennylane.ai/plugins.html

Definitely, see my edit. So far only basic gates and expectation value of Paulis and arbitrary Hermitians are implemented. Maybe in the near future I’ll find some time to expand it, or someone else will so you can link to it.
I do think it’s a good first contender for a simple, faster back-end. Qulacs also has GPU and OpenMP support which I haven’t explored yet, so it could even give a larger speedup.

1 Like

Further update: I did some benchmarks comparing CPU and GPU with the Qulacs plugin:

(tested on Intel Xeon Gold 5118 CPUs and an NVIDIA TITAN RTX GPU)

The testing script I used if you want to compare with other devices:

import time
import sys
import torch
import numpy as np
import pennylane as qml

device = torch.device('cuda:0')

torch.tensor([1, 2], device=device)

qubits = range(2, 26)

REPEATS = 50
N_LAYERS = 8

np.random.seed(23326)

for n in qubits:
VAR = 0.01 * np.random.randn(N_LAYERS, n, 3)

def layer(w):
for i in range(n):
qml.Rot(w[i, 0], w[i, 1], w[i, 2], wires=i)

for i in range(n-1):
qml.CNOT(wires=[i, i+1])

qml.CNOT(wires=[n-1, 0])

# =============== Qulacs GPU
dev = qml.device('qulacs.simulator', gpu=True, wires=n)

@qml.qnode(dev)
def circuit(weights):
for w in weights:
layer(w)

return qml.expval(qml.PauliZ(0))

time_start = time.time()
vals = []
for i in range(REPEATS):
val = circuit(VAR)
vals.append(val)

time_end = time.time()
gpu_duration = time_end - time_start
print(f'Qulacs GPU {n} qubits took {gpu_duration}s')

gpu_average = np.sum(vals) / len(vals)

# =============== Qulacs CPU
dev2 = qml.device('qulacs.simulator', wires=n)

@qml.qnode(dev2)
def qulacs_circuit(weights):
for w in weights:
layer(w)

return qml.expval(qml.PauliZ(0))

time_start = time.time()
vals = []
for i in range(REPEATS):
val = qulacs_circuit(VAR)
vals.append(val)

time_end = time.time()
cpu_duration = time_end - time_start
print(f'Qulacs CPU {n} qubits took {cpu_duration}s')

cpu_average = np.sum(vals) / len(vals)

assert np.allclose(gpu_average, cpu_average), 'GPU and CPU results dont agree'


Note that due to some weird bug with Qulacs, the torch.tensor call is needed to make it recognize the GPU (https://github.com/qulacs/qulacs/issues/195). So the GPU scales nicely for higher number of qubits, while single-core is best for 12 qubits and fewer.

Thanks for sharing the benchmarks @soudy!

Note that we’ve been having some issues with the discourse server lately, so some of the uploaded images are no longer available

Will this setup also be effective when I switch the interface to torch?

Hi @ycchen1989, yes it should be relatively similar between the default interface and the torch interface (most bottlenecks are in the simulation of the quantum circuits and computation of gradients).

Also note that in the latest version of pennylane, we’ve added an experimental device expt.tensornet.tf which should provide faster gradient computation and training than the default plugin for most cases.

1 Like

Hi, continuing on this thread I have some benchmarks regarding gradient computation comparing JIT vs no JIT (in Jax). Using backprop we get gradients faster ~3 orders of magnitude. Using a finite number of shots, with or without JIT, its quite a bit slower:

JIT True | shots None | time taken 5.002456e-05
JIT False | shots None | time taken 2.144583e-02

JIT True | shots 500 | time taken 7.670196e-02
JIT False | shots 500 | time taken 7.490716e-02

Here is the code snippet. I hope this helps to get a clear picture and for some of the future benchmarking:

import pennylane as qml

import jax
import jax.numpy as jnp
from jax import config

config.update("jax_enable_x64", True)

import time

def timeit(func, params):
"""Time the function.

Args:
func (Callable): A function to call.
params (array): The inputs to the function.

Returns:
float: The time taken to run the function.
"""
tic = time.perf_counter()
func(params)
toc = time.perf_counter()
return toc - tic

N = 5
variational_ansatz = qml.BasicEntanglerLayers
n_layers = 5

def get_grad_fn(shots=None, diff_method='best', jit=True):
"""Get the gradient function with a combination of shots, diff_method and JIT

Args:
shots (int, optional): Number of shots. Defaults to None.
diff_method (str, optional): The method to use for gradient computation.
Defaults to 'best'.
jit (bool, optional): Should the gradient function be JIT compiled.
Defaults to True.

Returns:
Callable: A callable function that computes gradients
"""
dev = qml.device("default.qubit.jax", wires=N, shots=shots)
variational_ansatz = qml.BasicEntanglerLayers

@jax.jit
@qml.qnode(dev, interface="jax", diff_method=diff_method)
def circuit(params: jnp.array):
"""Variational circuit that we want to optimize

Args:
params (jnp.array): _description_

Returns:
float: Expecatation value
"""
variational_ansatz(params, wires=range(N))
return qml.expval(qml.PauliZ(0))

grad_x = jax.grad(circuit)

if jit:
return jax.jit(grad_x)
else:
return grad_x

for shots in [None, 500]:
for jit in [True, False]:
grad_x = get_grad_fn(shots=shots, jit=jit)

key = jax.random.PRNGKey(42)
x = jax.random.uniform(key, variational_ansatz.shape(n_layers=n_layers, n_wires=N))

# Run the grad function once to compile
grad_x(x)

num_repeat = 100
times = np.empty(num_repeat)

for i in range(100):
x = jax.random.uniform(key, variational_ansatz.shape(n_layers=n_layers, n_wires=N))
times[i] = timeit(grad_x, x)

print(f"JIT {jit} | shots {shots} | time taken ", "{:e}".format(jnp.mean(times)))

1 Like