Speeding up grad computation

Also, is it 2 new circuits per parameter if that parameter only corresponds to one gate? If it isn’t, then that means it scales as 2*number of gates with parameters (applying the parameter shift rule to all gates containing that parameter then applying the product rule)

Ah, that might be the key. In order to perform arbitrary measurements of Hermitian observable A on a hardware device, this is the process that needs to be undertaken:

  1. Calculate the unitary matrix U, comprised of the orthonormal eigenvectors of A down the columns.

  2. Apply the unitary U^\dagger to the quantum state |\psi\rangle, just prior to measurement.

  3. From the resulting Pauli-Z measurement samples, calculate the probability of measuring each computational basis state |\langle i | U^\dagger |\psi\rangle|^2.

  4. The resulting expectation value is given by:

    \langle \psi | A | \psi \rangle = \sum_i \lambda_i |\langle i | U^\dagger |\psi\rangle|^2

    where \lambda_i are the eigenvalues of A.

So some of the problem comes from the number of gates. Though the number of parameters is O(10) the number of gates using these parameters is O(100).

The exponential issue with the pyqvm is less obvious to me. Maybe it comes from computing U^(dag)? I appreciate the description above but can’t immediately see why it would slow down the computation so much.

default is default.qubit
pyqvm is forest.qvm
entire is evaluating the whole hamiltonian
piecewise is splitting into pieces and evaluating different circuits for ‘collisions’ between expectations/variables

I also parallelized and wrote my own gradient computation, given the large overhead of the number of gates, so these speeds will be v machine dependent. I still think there is something funny going on with the arbitrary hermitian evals in the pyqvm so will be interested to hear what you guys come out with. Thx for all the info so far

This is my thought as well, however the scaling of np.linalg.eigh would need to be investigated to see if this is in fact the cause. Another reason could be how PyQuil implements arbitrary unitaries — whether it’s via direct matrix multiplication, or whether they decompose the unitary first (which might have some cost associated with it).

The plots look great! We are always looking to speed up the optimization and work on the performance of PennyLane, so if you have noticed anything while writing your own parallelized gradient computation, please feel free to submit a PR to the PennyLane GitHub repository :slight_smile:

I’ve encountered the same issue while training a QNN with high-dimensional state vectors. In an attempt to speed it up, I implemented a PennyLane device for Qulacs. It’s more than 2 times faster than the default PennyLane plugin (which was the fastest before from my experiments):

Encoding size: 10
Number of qubits: 10

Device: Default qubit PennyLane plugin
6.89 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Device: pyQVM NumpyWavefunction Simulator Device
437 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Qiskit PennyLane plugin
232 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: Qiskit PennyLane plugin
217 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Device: ProjectQ PennyLane plugin
10.2 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Device: Qulacs device
2.49 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

(I don’t have QVM installed at the moment)

I put it on GitHub for others to use: https://github.com/soudy/pennylane-qulacs/. Note that it is still a very basic implementation tailored to my needs, but it can easily be expanded.

1 Like

That is awesome, @soudy! Is the Qulacs plugin something you would consider open-sourcing? We could also link to it on https://pennylane.ai/plugins.html

Definitely, see my edit. So far only basic gates and expectation value of Paulis and arbitrary Hermitians are implemented. Maybe in the near future I’ll find some time to expand it, or someone else will so you can link to it.
I do think it’s a good first contender for a simple, faster back-end. Qulacs also has GPU and OpenMP support which I haven’t explored yet, so it could even give a larger speedup.

1 Like

Further update: I did some benchmarks comparing CPU and GPU with the Qulacs plugin:
qnn_cpu_vs_gpu
qulacs_table
(tested on Intel Xeon Gold 5118 CPUs and an NVIDIA TITAN RTX GPU)

The testing script I used if you want to compare with other devices:

import time
import sys
import torch
import numpy as np
import pennylane as qml

device = torch.device('cuda:0')

torch.tensor([1, 2], device=device)

qubits = range(2, 26)

REPEATS = 50
N_LAYERS = 8

np.random.seed(23326)

for n in qubits:
    VAR = 0.01 * np.random.randn(N_LAYERS, n, 3)

    def layer(w):
        for i in range(n):
            qml.Rot(w[i, 0], w[i, 1], w[i, 2], wires=i)

        for i in range(n-1):
            qml.CNOT(wires=[i, i+1])

        qml.CNOT(wires=[n-1, 0])


    # =============== Qulacs GPU
    dev = qml.device('qulacs.simulator', gpu=True, wires=n)


    @qml.qnode(dev)
    def circuit(weights):
        for w in weights:
            layer(w)

        return qml.expval(qml.PauliZ(0))


    time_start = time.time()
    vals = []
    for i in range(REPEATS):
        val = circuit(VAR)
        vals.append(val)

    time_end = time.time()
    gpu_duration = time_end - time_start
    print(f'Qulacs GPU {n} qubits took {gpu_duration}s')

    gpu_average = np.sum(vals) / len(vals)

    # =============== Qulacs CPU
    dev2 = qml.device('qulacs.simulator', wires=n)

    @qml.qnode(dev2)
    def qulacs_circuit(weights):
        for w in weights:
            layer(w)

        return qml.expval(qml.PauliZ(0))


    time_start = time.time()
    vals = []
    for i in range(REPEATS):
        val = qulacs_circuit(VAR)
        vals.append(val)

    time_end = time.time()
    cpu_duration = time_end - time_start
    print(f'Qulacs CPU {n} qubits took {cpu_duration}s')

    cpu_average = np.sum(vals) / len(vals)

    assert np.allclose(gpu_average, cpu_average), 'GPU and CPU results dont agree'

Note that due to some weird bug with Qulacs, the torch.tensor call is needed to make it recognize the GPU (CUDA error without loading something into GPU memory first · Issue #195 · qulacs/qulacs · GitHub). So the GPU scales nicely for higher number of qubits, while single-core is best for 12 qubits and fewer.

Thanks for sharing the benchmarks @soudy!

Note that we’ve been having some issues with the discourse server lately, so some of the uploaded images are no longer available :frowning:

Will this setup also be effective when I switch the interface to torch?

Hi @ycchen1989, yes it should be relatively similar between the default interface and the torch interface (most bottlenecks are in the simulation of the quantum circuits and computation of gradients).

Also note that in the latest version of pennylane, we’ve added an experimental device expt.tensornet.tf which should provide faster gradient computation and training than the default plugin for most cases.

1 Like

Hi, continuing on this thread I have some benchmarks regarding gradient computation comparing JIT vs no JIT (in Jax). Using backprop we get gradients faster ~3 orders of magnitude. Using a finite number of shots, with or without JIT, its quite a bit slower:

JIT True | shots None | time taken 5.002456e-05
JIT False | shots None | time taken 2.144583e-02

JIT True | shots 500 | time taken 7.670196e-02
JIT False | shots 500 | time taken 7.490716e-02

Here is the code snippet. I hope this helps to get a clear picture and for some of the future benchmarking:

import pennylane as qml


import jax
import jax.numpy as jnp
from jax import config

config.update("jax_enable_x64", True)


import time


def timeit(func, params):
    """Time the function.

    Args:
        func (Callable): A function to call.
        params (array): The inputs to the function.

    Returns:
        float: The time taken to run the function.
    """
    tic = time.perf_counter()
    func(params)
    toc = time.perf_counter()
    return toc - tic


N = 5
variational_ansatz = qml.BasicEntanglerLayers
n_layers = 5


def get_grad_fn(shots=None, diff_method='best', jit=True):
    """Get the gradient function with a combination of shots, diff_method and JIT

    Args:
        shots (int, optional): Number of shots. Defaults to None.
        diff_method (str, optional): The method to use for gradient computation.
                                     Defaults to 'best'.
        jit (bool, optional): Should the gradient function be JIT compiled.
                              Defaults to True.

    Returns:
        Callable: A callable function that computes gradients
    """
    dev = qml.device("default.qubit.jax", wires=N, shots=shots)
    variational_ansatz = qml.BasicEntanglerLayers

    @jax.jit
    @qml.qnode(dev, interface="jax", diff_method=diff_method)
    def circuit(params: jnp.array):
        """Variational circuit that we want to optimize

        Args:
            params (jnp.array): _description_

        Returns:
            float: Expecatation value
        """
        variational_ansatz(params, wires=range(N))
        return qml.expval(qml.PauliZ(0))

    grad_x = jax.grad(circuit)

    if jit:
        return jax.jit(grad_x)
    else:
        return grad_x


for shots in [None, 500]:
    for jit in [True, False]:
        grad_x = get_grad_fn(shots=shots, jit=jit)

        key = jax.random.PRNGKey(42)
        x = jax.random.uniform(key, variational_ansatz.shape(n_layers=n_layers, n_wires=N))

        # Run the grad function once to compile
        grad_x(x)

        num_repeat = 100
        times = np.empty(num_repeat)

        for i in range(100):
            x = jax.random.uniform(key, variational_ansatz.shape(n_layers=n_layers, n_wires=N))
            times[i] = timeit(grad_x, x)

        print(f"JIT {jit} | shots {shots} | time taken ", "{:e}".format(jnp.mean(times)))
1 Like