Also, is it 2 new circuits per parameter if that parameter only corresponds to one gate? If it isn’t, then that means it scales as 2*number of gates with parameters (applying the parameter shift rule to all gates containing that parameter then applying the product rule)
Speeding up grad computation
Ah, that might be the key. In order to perform arbitrary measurements of Hermitian observable A on a hardware device, this is the process that needs to be undertaken:

Calculate the unitary matrix U, comprised of the orthonormal eigenvectors of A down the columns.

Apply the unitary U^\dagger to the quantum state \psi\rangle, just prior to measurement.

From the resulting PauliZ measurement samples, calculate the probability of measuring each computational basis state \langle i  U^\dagger \psi\rangle^2.

The resulting expectation value is given by:
\langle \psi  A  \psi \rangle = \sum_i \lambda_i \langle i  U^\dagger \psi\rangle^2
where \lambda_i are the eigenvalues of A.
So some of the problem comes from the number of gates. Though the number of parameters is O(10) the number of gates using these parameters is O(100).
The exponential issue with the pyqvm is less obvious to me. Maybe it comes from computing U^(dag)? I appreciate the description above but can’t immediately see why it would slow down the computation so much.
default is default.qubit
pyqvm is forest.qvm
entire is evaluating the whole hamiltonian
piecewise is splitting into pieces and evaluating different circuits for ‘collisions’ between expectations/variables
I also parallelized and wrote my own gradient computation, given the large overhead of the number of gates, so these speeds will be v machine dependent. I still think there is something funny going on with the arbitrary hermitian evals in the pyqvm so will be interested to hear what you guys come out with. Thx for all the info so far
This is my thought as well, however the scaling of np.linalg.eigh
would need to be investigated to see if this is in fact the cause. Another reason could be how PyQuil implements arbitrary unitaries — whether it’s via direct matrix multiplication, or whether they decompose the unitary first (which might have some cost associated with it).
The plots look great! We are always looking to speed up the optimization and work on the performance of PennyLane, so if you have noticed anything while writing your own parallelized gradient computation, please feel free to submit a PR to the PennyLane GitHub repository
I’ve encountered the same issue while training a QNN with highdimensional state vectors. In an attempt to speed it up, I implemented a PennyLane device for Qulacs. It’s more than 2 times faster than the default PennyLane plugin (which was the fastest before from my experiments):
Encoding size: 10
Number of qubits: 10
Device: Default qubit PennyLane plugin
6.89 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Device: pyQVM NumpyWavefunction Simulator Device
437 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Device: Qiskit PennyLane plugin
232 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Device: Qiskit PennyLane plugin
217 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Device: ProjectQ PennyLane plugin
10.2 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Device: Qulacs device
2.49 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(I don’t have QVM installed at the moment)
I put it on GitHub for others to use: https://github.com/soudy/pennylanequlacs/. Note that it is still a very basic implementation tailored to my needs, but it can easily be expanded.
That is awesome, @soudy! Is the Qulacs plugin something you would consider opensourcing? We could also link to it on https://pennylane.ai/plugins.html
Definitely, see my edit. So far only basic gates and expectation value of Paulis and arbitrary Hermitians are implemented. Maybe in the near future I’ll find some time to expand it, or someone else will so you can link to it.
I do think it’s a good first contender for a simple, faster backend. Qulacs also has GPU and OpenMP support which I haven’t explored yet, so it could even give a larger speedup.
Further update: I did some benchmarks comparing CPU and GPU with the Qulacs plugin:
(tested on Intel Xeon Gold 5118 CPUs and an NVIDIA TITAN RTX GPU)
The testing script I used if you want to compare with other devices:
import time
import sys
import torch
import numpy as np
import pennylane as qml
device = torch.device('cuda:0')
torch.tensor([1, 2], device=device)
qubits = range(2, 26)
REPEATS = 50
N_LAYERS = 8
np.random.seed(23326)
for n in qubits:
VAR = 0.01 * np.random.randn(N_LAYERS, n, 3)
def layer(w):
for i in range(n):
qml.Rot(w[i, 0], w[i, 1], w[i, 2], wires=i)
for i in range(n1):
qml.CNOT(wires=[i, i+1])
qml.CNOT(wires=[n1, 0])
# =============== Qulacs GPU
dev = qml.device('qulacs.simulator', gpu=True, wires=n)
@qml.qnode(dev)
def circuit(weights):
for w in weights:
layer(w)
return qml.expval(qml.PauliZ(0))
time_start = time.time()
vals = []
for i in range(REPEATS):
val = circuit(VAR)
vals.append(val)
time_end = time.time()
gpu_duration = time_end  time_start
print(f'Qulacs GPU {n} qubits took {gpu_duration}s')
gpu_average = np.sum(vals) / len(vals)
# =============== Qulacs CPU
dev2 = qml.device('qulacs.simulator', wires=n)
@qml.qnode(dev2)
def qulacs_circuit(weights):
for w in weights:
layer(w)
return qml.expval(qml.PauliZ(0))
time_start = time.time()
vals = []
for i in range(REPEATS):
val = qulacs_circuit(VAR)
vals.append(val)
time_end = time.time()
cpu_duration = time_end  time_start
print(f'Qulacs CPU {n} qubits took {cpu_duration}s')
cpu_average = np.sum(vals) / len(vals)
assert np.allclose(gpu_average, cpu_average), 'GPU and CPU results dont agree'
Note that due to some weird bug with Qulacs, the torch.tensor
call is needed to make it recognize the GPU (https://github.com/qulacs/qulacs/issues/195). So the GPU scales nicely for higher number of qubits, while singlecore is best for 12 qubits and fewer.
Thanks for sharing the benchmarks @soudy!
Note that we’ve been having some issues with the discourse server lately, so some of the uploaded images are no longer available
Will this setup also be effective when I switch the interface to torch?
Hi @ycchen1989, yes it should be relatively similar between the default interface and the torch interface (most bottlenecks are in the simulation of the quantum circuits and computation of gradients).
Also note that in the latest version of pennylane, we’ve added an experimental device expt.tensornet.tf
which should provide faster gradient computation and training than the default plugin for most cases.
Hi, continuing on this thread I have some benchmarks regarding gradient computation comparing JIT vs no JIT (in Jax). Using backprop we get gradients faster ~3 orders of magnitude. Using a finite number of shots, with or without JIT, its quite a bit slower:
JIT True  shots None  time taken 5.002456e05
JIT False  shots None  time taken 2.144583e02
JIT True  shots 500  time taken 7.670196e02
JIT False  shots 500  time taken 7.490716e02
Here is the code snippet. I hope this helps to get a clear picture and for some of the future benchmarking:
import pennylane as qml
import jax
import jax.numpy as jnp
from jax import config
config.update("jax_enable_x64", True)
import time
def timeit(func, params):
"""Time the function.
Args:
func (Callable): A function to call.
params (array): The inputs to the function.
Returns:
float: The time taken to run the function.
"""
tic = time.perf_counter()
func(params)
toc = time.perf_counter()
return toc  tic
N = 5
variational_ansatz = qml.BasicEntanglerLayers
n_layers = 5
def get_grad_fn(shots=None, diff_method='best', jit=True):
"""Get the gradient function with a combination of shots, diff_method and JIT
Args:
shots (int, optional): Number of shots. Defaults to None.
diff_method (str, optional): The method to use for gradient computation.
Defaults to 'best'.
jit (bool, optional): Should the gradient function be JIT compiled.
Defaults to True.
Returns:
Callable: A callable function that computes gradients
"""
dev = qml.device("default.qubit.jax", wires=N, shots=shots)
variational_ansatz = qml.BasicEntanglerLayers
@jax.jit
@qml.qnode(dev, interface="jax", diff_method=diff_method)
def circuit(params: jnp.array):
"""Variational circuit that we want to optimize
Args:
params (jnp.array): _description_
Returns:
float: Expecatation value
"""
variational_ansatz(params, wires=range(N))
return qml.expval(qml.PauliZ(0))
grad_x = jax.grad(circuit)
if jit:
return jax.jit(grad_x)
else:
return grad_x
for shots in [None, 500]:
for jit in [True, False]:
grad_x = get_grad_fn(shots=shots, jit=jit)
key = jax.random.PRNGKey(42)
x = jax.random.uniform(key, variational_ansatz.shape(n_layers=n_layers, n_wires=N))
# Run the grad function once to compile
grad_x(x)
num_repeat = 100
times = np.empty(num_repeat)
for i in range(100):
x = jax.random.uniform(key, variational_ansatz.shape(n_layers=n_layers, n_wires=N))
times[i] = timeit(grad_x, x)
print(f"JIT {jit}  shots {shots}  time taken ", "{:e}".format(jnp.mean(times)))