Batching Inputs to Quantum Circuit

Is there a way to batch inputs to a circuit? I’m trying to use Pennylane to train a circuit to do image classification on MNIST, and as it is right now for each batch I pass each input individually, gather their gradients together, and then do an optimizer step. According to back of the envelope calculations, to get through the 60000 / 128 batches it will take around 16 hours. I can post code if need be.

Thanks so much!

Hi kushkhosla,

At the moment, there is no direct way to batch inputs. Obviously batching of outputs can be done by asking for more shots, but no such trick works for inputs. What you are doing now is the recommended method.

If the device backends supported batching natively (either hardware or software), we could potentially add functionality to PennyLane to leverage this. Alternatively, we could implement batching internally in PennyLane itself as a convenience to the user.

The second one is more likely to be implemented sooner (since it is under our control), but might also necessity larger compute power in order to run faster (e.g., parallel computing).

I know this doesn’t help with your current problem, but it gives you an idea of what we’re thinking.

1 Like

Hi @nathan, This is an important feature so I hope you will provide this soon.

Agreed @cubicgate,

Unfortunately, there isn’t really a pattern in quantum computing of batching inputs like there is in machine learning. It’s not a feature supported by most of the frameworks our plugins rely on. For this reason, we are thinking of ways we can build it into a framework ourselves.

Agree @nathan, Thanks!

Hi,

I had a similar problem when working with TensorFlow. I was having performance issues when I was testing my models on larger data sets. The main bottleneck is that batch training using the GPU is not really possible with the PennyLane-TensorFlow Eager interface.

To solve this I wrote my own Device class that uses only TensorFlow operations. By popping operations from the QNode circuit queue and translating those operations to Tensorflow Ops, you get a model that can be executed on the GPU (with batches).

What I have now is quite hacky, and I’m quite sure that it violates some of the design principles of PennyLane, but it seems to work so far. More testing is required.

1 Like

Hi @rooler ,
Sounds cool. As we’re still thinking about how to implement this feature, it might be useful to see how you managed it. Feel free to share any code (even if incomplete)!

Can I just paste my device class and driver code here?

Sure, that’s a good start.

Setup:

Python: 3.7
pennylane: dev-0.5
tensorflow: 1.14 with CUDA: 10.0

GPU: GeForce GTX 950M


This is default_qubit_par.py located in pennylane.plugins. I have yet to implement variance, BasisState and sampling.

import warnings

import numpy as np
from scipy.linalg import eigh

from pennylane import Device
import tensorflow as tf

# tolerance for numerical errors
tolerance = 1e-10


# ========================================================
#  utilities
# ========================================================

def tf_kron(a: tf.Tensor, b: tf.Tensor) -> tf.Tensor:
    """
    Implementation of Kronecker product for tensorflow
    Args:
        a: tf.Tensor of size N x M
        b: tf.Tensor of size P x K

    Returns: tf.Tensor of size N * P x M * K

    """
    a_shape = [a.shape[0].value, a.shape[1].value]
    b_shape = [b.shape[0].value, b.shape[1].value]
    return tf.reshape(tf.reshape(a, [a_shape[0], 1, a_shape[1], 1]) * tf.reshape(b, [1, b_shape[0], 1, b_shape[1]]),
                      [a_shape[0] * b_shape[0], a_shape[1] * b_shape[1]])


def spectral_decomposition(A):
    r"""Spectral decomposition of a Hermitian matrix.

    Args:
        A (array): Hermitian matrix

    Returns:
        (vector[float], list[array[complex]]): (a, P): eigenvalues and hermitian projectors
            such that :math:`A = \sum_k a_k P_k`.
    """
    d, v = eigh(A)
    P = []
    for k in range(d.shape[0]):
        temp = v[:, k]
        P.append(np.outer(temp, temp.conj()))
    return d, P


# ========================================================
#  fixed gates
# ========================================================

I = tf.convert_to_tensor(np.eye(2), dtype=tf.complex128)
# Pauli matrices
X = tf.convert_to_tensor(np.array([[0, 1], [1, 0]]), dtype=tf.complex128)  #: Pauli-X matrix
Y = tf.convert_to_tensor(np.array([[0, -1j], [1j, 0]]), dtype=tf.complex128)  #: Pauli-Y matrix
Z = tf.convert_to_tensor(np.array([[1, 0], [0, -1]]), dtype=tf.complex128)  #: Pauli-Z matrix

H = tf.convert_to_tensor(np.array([[1, 1], [1, -1]]) / np.sqrt(2), dtype=tf.complex128)  #: Hadamard gate
# Two qubit gates
CNOT = tf.convert_to_tensor(np.array([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0]]),
                            dtype=tf.complex128)  #: CNOT gate
SWAP = tf.convert_to_tensor(np.array([[1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1]]),
                            dtype=tf.complex128)  #: SWAP gate
CZ = tf.convert_to_tensor(np.array([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, -1]]),
                          dtype=tf.complex128)  #: CZ gate


# ========================================================
#  parametrized gates
# ========================================================

def Rphi(phi) -> tf.Tensor:
    r"""One-qubit phase shift.

    Args:
        phi (float): phase shift angle
    Returns:
        array: unitary 2x2 phase shift matrix
    """
    return tf.convert_to_tensor([[1, 0], [0, np.exp(1j * phi)]], dtype=tf.complex128)


def Rotx(theta) -> tf.Tensor:
    r"""One-qubit rotation about the x axis.

    Args:
        theta (float): rotation angle
    Returns:
        array: unitary 2x2 rotation matrix :math:`e^{-i \sigma_x \theta/2}`
    """
    return tf.math.cos(theta / 2) * I + 1j * tf.math.sin(-theta / 2) * X


def Roty(theta) -> tf.Tensor:
    r"""One-qubit rotation about the y axis.

    Args:
        theta (float): rotation angle
    Returns:
        array: unitary 2x2 rotation matrix :math:`e^{-i \sigma_y \theta/2}`
    """
    return tf.math.cos(theta / 2) * I + 1j * tf.math.sin(-theta / 2) * Y


def Rotz(theta) -> tf.Tensor:
    r"""One-qubit rotation about the z axis.

    Args:
        theta (float): rotation angle
    Returns:
        array: unitary 2x2 rotation matrix :math:`e^{-i \sigma_z \theta/2}`
    """
    return tf.math.cos(theta / 2) * I + 1j * tf.math.sin(-theta / 2) * Z


def Rot3(params) -> tf.Tensor:
    r"""Arbitrary one-qubit rotation using three Euler angles.

    Args:
        a,b,c (float): rotation angles
    Returns:
        array: unitary 2x2 rotation matrix ``rz(c) @ ry(b) @ rz(a)``
    """
    a = params[0]
    b = params[1]
    c = params[2]
    return Rotz(c) @ (Roty(b) @ Rotz(a))


def CRotx(theta) -> tf.Tensor:
    r"""Two-qubit controlled rotation about the x axis.

    Args:
        theta (float): rotation angle
    Returns:
        array: unitary 4x4 rotation matrix :math:`|0\rangle\langle 0|\otimes \mathbb{I}+|1\rangle\langle 1|\otimes R_x(\theta)`
    """
    return tf_kron(tf.convert_to_tensor(np.array([[1, 0], [0, 0]]), dtype=tf.complex128), I) + \
           tf_kron(tf.convert_to_tensor(np.array([[0, 0], [0, 1]]), dtype=tf.complex128), Rotx(theta))


def CRoty(theta) -> tf.Tensor:
    r"""Two-qubit controlled rotation about the y axis.

    Args:
        theta (float): rotation angle
    Returns:
        array: unitary 4x4 rotation matrix :math:`|0\rangle\langle 0|\otimes \mathbb{I}+|1\rangle\langle 1|\otimes R_y(\theta)`
    """
    return tf_kron(tf.convert_to_tensor(np.array([[1, 0], [0, 0]]), dtype=tf.complex128), I) + \
           tf_kron(tf.convert_to_tensor(np.array([[0, 0], [0, 1]]), dtype=tf.complex128), Roty(theta))


def CRotz(theta) -> tf.Tensor:
    r"""Two-qubit controlled rotation about the z axis.

    Args:
        theta (float): rotation angle
    Returns:
        array: unitary 4x4 rotation matrix :math:`|0\rangle\langle 0|\otimes \mathbb{I}+|1\rangle\langle 1|\otimes R_z(\theta)`
    """
    return tf_kron(tf.convert_to_tensor(np.array([[1, 0], [0, 0]]), dtype=tf.complex128), I) + \
           tf_kron(tf.convert_to_tensor(np.array([[0, 0], [0, 1]]), dtype=tf.complex128), Rotz(theta))


def CRot3(params) -> tf.Tensor:
    r"""Arbitrary two-qubit controlled rotation using three Euler angles.

    Args:
        a,b,c (float): rotation angles
    Returns:
        array: unitary 4x4 rotation matrix :math:`|0\rangle\langle 0|\otimes \mathbb{I}+|1\rangle\langle 1|\otimes R(a,b,c)`
    """
    return tf_kron(tf.convert_to_tensor(np.array([[1, 0], [0, 0]]), dtype=tf.complex128), I) + \
           tf_kron(tf.convert_to_tensor(np.array([[0, 0], [0, 1]]), dtype=tf.complex128), Rot3(params))


# ========================================================
#  Arbitrary states and operators
# ========================================================

def unitary(*args):
    r"""Input validation for an arbitary unitary operation.

    Args:
        args (array): square unitary matrix

    Returns:
        array: square unitary matrix
    """
    U = np.asarray(args[0])

    if U.shape[0] != U.shape[1]:
        raise ValueError("Operator must be a square matrix.")

    if not np.allclose(U @ U.conj().T, np.identity(U.shape[0])):
        raise ValueError("Operator must be unitary.")

    return tf.convert_to_tensor(U, dtype=tf.complex128)


def hermitian(*args):
    r"""Input validation for an arbitary Hermitian expectation.

    Args:
        args (array): square hermitian matrix

    Returns:
        array: square hermitian matrix
    """
    A = np.asarray(args[0])

    if A.shape[0] != A.shape[1]:
        raise ValueError("Expectation must be a square matrix.")

    if not np.allclose(A, A.conj().T):
        raise ValueError("Expectation must be Hermitian.")

    return tf.convert_to_tensor(A, dtype=tf.complex128)


# ========================================================
#  device
# ========================================================


class DefaultQubitPar(Device):
    """Default qubit device for PennyLane.

    Args:
        wires (int): the number of modes to initialize the device in
        shots (int): How many times the circuit should be evaluated (or sampled) to estimate
            the expectation values. A value of 0 yields the exact result.
    """
    name = 'Default qubit PennyLane plugin with tensorflow parallelization'
    short_name = 'default.qubit_par'
    pennylane_requires = '0.5'
    version = '0.4.0'
    author = 'Xanadu Inc.'

    # Note: BasisState and QubitStateVector don't
    # map to any particular function, as they modify
    # the internal device state directly.
    _operation_map = {
        'BasisState': None,
        'QubitStateVector': None,
        'QubitUnitary': unitary,
        'PauliX': X,
        'PauliY': Y,
        'PauliZ': Z,
        'Hadamard': H,
        'CNOT': CNOT,
        'SWAP': SWAP,
        'CZ': CZ,
        'PhaseShift': Rphi,
        'RX': Rotx,
        'RY': Roty,
        'RZ': Rotz,
        'Rot': Rot3,
        'CRX': CRotx,
        'CRY': CRoty,
        'CRZ': CRotz,
        'CRot': CRot3
    }

    _observable_map = {
        'PauliX': X,
        'PauliY': Y,
        'PauliZ': Z,
        'Hadamard': H,
        'Hermitian': hermitian,
        'Identity': I
    }

    def __init__(self, wires, *, shots=0):
        super().__init__(wires, shots)
        self.eng = None
        self._state = None

    def pre_apply(self):
        self.reset()

    def apply(self, operation, wires, par):
        assert isinstance(par, (
            tf.Tensor, tf.Variable)), "Parameters must be TF Tensor or TF Variable, got {} instead".format(type(par))
        par = tf.to_complex128(par)
        if operation == 'QubitStateVector':
            state = par
            if len(state.shape) == 1 and state.shape[0] == 2 ** self.num_wires:
                self._state = state
            else:
                raise ValueError('State vector must be of length 2**wires.')
            if wires is not None and wires != [] and list(wires) != list(range(self.num_wires)):
                raise ValueError(
                    "The default.qubit plugin can apply QubitStateVector only to all of the {} wires.".format(
                        self.num_wires))
            return
        if operation == 'BasisState':
            raise NotImplementedError
        #     n = len(par[0])
        #     # get computational basis state number
        #     if n > self.num_wires or not (set(par[0]) == {0, 1} or set(par[0]) == {0} or set(par[0]) == {1}):
        #         raise ValueError(
        #             "BasisState parameter must be an array of 0 or 1 integers of length at most {}.".format(
        #                 self.num_wires))
        #     if wires is not None and wires != [] and list(wires) != list(range(self.num_wires)):
        #         raise ValueError(
        #             "The default.qubit plugin can apply BasisState only to all of the {} wires.".format(self.num_wires))
        #
        #     num = int(np.sum(np.array(par[0]) * 2 ** np.arange(n - 1, -1, -1)))
        #
        #     self._state = np.zeros_like(self._state)
        #     self._state[num] = 1.
        #     self._state[num] = tf.convert_to_tensor(self._state[num], stype=tf.complex128)
        #     return

        A = self._get_operator_matrix(operation, par)
        self._state = self.mat_vec_product(A, self._state, wires)

    def mat_vec_product(self, mat, vec, wires):
        r"""Apply multiplication of a matrix to subsystems of the quantum state.

        Args:
            mat (array): matrix to multiply
            vec (array): state vector to multiply
            wires (Sequence[int]): target subsystems

        Returns:
            array: output vector after applying ``mat`` to input ``vec`` on specified subsystems
        """
        # TODO: use multi-index vectors/matrices to represent states/gates internally
        mat = tf.reshape(mat, [2] * len(wires) * 2)
        vec = tf.reshape(vec, [2] * self.num_wires)
        axes = (np.arange(len(wires), 2 * len(wires)), wires)
        tdot = tf.tensordot(mat, vec, axes=axes)

        # tensordot causes the axes given in `wires` to end up in the first positions
        # of the resulting tensor. This corresponds to a (partial) transpose of
        # the correct output state
        # We'll need to invert this permutation to put the indices in the correct place
        unused_idxs = [idx for idx in range(self.num_wires) if idx not in wires]
        perm = wires + unused_idxs
        inv_perm = np.argsort(perm)  # argsort gives inverse permutation
        state_multi_index = tf.transpose(tdot, inv_perm)
        return tf.reshape(state_multi_index, [2 ** self.num_wires, ])

    def expval(self, observable, wires, par):
        if self.shots == 0:
            # exact expectation value
            A = self._get_operator_matrix(observable, par)
            ev = self.ev(A, wires)
        else:
            # estimate the ev
            ev = tf.reduce_mean(self.sample(observable, wires, par, self.shots))

        return ev

    def var(self, observable, wires, par):
        return NotImplementedError
        # if self.shots == 0:
        #     # exact expectation value
        #     A = self._get_operator_matrix(observable, par)
        #     var = self.ev(A @ A, wires) - self.ev(A, wires) ** 2
        # else:
        #     # estimate the ev
        #     var = np.var(self.sample(observable, wires, par, self.shots))
        #
        # return var

    def sample(self, observable, wires, par, n=None):
        return NotImplementedError
        # if n is None:
        #     n = self.shots
        #
        # if n == 0:
        #     raise ValueError("Calling sample with n = 0 is not possible.")
        # if n < 0 or not isinstance(n, int):
        #     raise ValueError("The number of samples must be a positive integer.")
        #
        # A = self._get_operator_matrix(observable, par)
        # a, P = spectral_decomposition(A)
        #
        # p = np.zeros(a.shape)
        # for idx, Pi in enumerate(P):
        #     p[idx] = self.ev(Pi, wires)
        #
        # return np.random.choice(a, n, p=p)

    def _get_operator_matrix(self, operation, par):
        """Get the operator matrix for a given operation or observable.

        Args:
          operation    (str): name of the operation/observable
          par (tuple[float]): parameter values
        Returns:
          array: matrix representation.
        """
        A = {**self._operation_map, **self._observable_map}[operation]
        if not callable(A):
            return A
        return A(par)

    def ev(self, A, wires):
        r"""Expectation value of observable on specified wires.

         Args:
          A (array[float]): the observable matrix as array
          wires (Sequence[int]): target subsystems
         Returns:
          float: expectation value :math:`\expect{A} = \bra{\psi}A\ket{\psi}`
        """
        As = self.mat_vec_product(A, self._state, wires)
        # expectation = np.vdot(self._state, As)
        expectation = tf.tensordot(tf.conj(self._state), As, axes=1)

        # if tf.abs(tf.imag(expectation)) > tolerance:
        #     warnings.warn('Nonvanishing imaginary part {} in expectation value.'.format(expectation.imag),
        #                   RuntimeWarning)
        return tf.real(expectation)

    def reset(self):
        """Reset the device"""
        # init the state vector to |00..0>
        self._state = np.zeros(2 ** self.num_wires, dtype=complex)
        self._state[0] = 1
        self._state = tf.convert_to_tensor(self._state, dtype=tf.complex128)

    @property
    def operations(self):
        return set(self._operation_map.keys())

    @property
    def observables(self):
        return set(self._observable_map.keys())

Example two qubit circuit that uses this Device:

import pennylane as qml
from pennylane import numpy as np
from pennylane.plugins.default_qubit_par import DefaultQubitPar
from pennylane.ops.qubit import QubitStateVector
import tensorflow as tf

# get the standard device and parallel device
dev = qml.device('default.qubit', wires=2)
par_dev = DefaultQubitPar(wires=2)

# simple test circuit
def circuit(var, state=None):
    QubitStateVector(state, wires=[0,1])
    qml.RX(var[0], wires=0)
    qml.RY(var[1], wires=1)
    qml.CNOT(wires=[0,1])
    return qml.expval(qml.PauliZ(0))

# construct the circuit using the standard device QNode
circ = qml.QNode(circuit, device=dev)
circ.construct([(1, 2)], {'state': np.array([1,0,0,0])})

# graph inputs
inputs = tf.placeholder(tf.complex128, [None, 4], name='inputs')

def construct_graph_from_circuit(input_state):
    """
    Pop operations from the queue and convert them into a TF graph using the default_dev_par class

    Args:
        input_state: data vector

    Returns: expectation value of the quantum cicuit.
    """
    # reset state
    par_dev.pre_apply()
    for op in circ.queue:
        # encode data into qubit state vector
        if op.name == 'QubitStateVector':
            input_state /= tf.norm(input_state)
            par_dev.apply(op.name, op.wires, input_state)

        elif op.name == 'BasisState':
            raise NotImplementedError
        # gate parameters are TF vars
        else:
            par_dev.apply(op.name, op.wires, tf.Variable(initial_value=op.parameters, dtype=tf.float64))
    # only works for single EV at the moment
    measurement = circ.ev[0]
    return par_dev.expval(measurement.name, measurement.wires, measurement.parameters)

# map function onto inputs in parallel.
outputs = tf.map_fn(construct_graph_from_circuit, inputs, parallel_iterations=120, dtype=tf.double)

# simple example loss function
loss = tf.abs(outputs)
# optimizer
opt = tf.train.GradientDescentOptimizer(learning_rate=0.01)
# Perform minimization with respect to the trainable variables.
train_step = opt.minimize(loss, var_list=[var for var in tf.trainable_variables()])

# Tensorflow session
sess = tf.Session(config=tf.ConfigProto(log_device_placement=False))
sess.run(tf.initialize_all_variables())

# random data
x = np.random.randn(10000, 4)

# use the standard PennyLane circuit so that we can compare
dev_circ = qml.QNode(circuit, device=dev)
# after one training step the parameters are updated so TF and PennyLane should not agree anymore
for i in range(2):
    res, o, = sess.run([train_step, outputs, ], feed_dict={inputs: x,})
    pennylane_out = np.array([dev_circ((1, 2), state=x[j] / np.linalg.norm(x[j])) for j in range(x.shape[0])])
    print("At step {}, PennyLane and TF agree: {}".format(i, all(np.isclose(pennylane_out, o))))

I have checked a couple of things between the PennyLane and TF device:

  • Output from some test circuits is the same (single and this double qubit example)
  • Parameterized gate matrices are the same
  • When running Tensorboard it shows that most of the computations are done on the GPU

Let me know what you think.

1 Like

Hi @rooler,

Looks very cool! Thanks for sharing the code. Aside from the ability to support batches, where did you find the biggest performance improvements when implementing the quantum ops in TF?

I know @josh has some thoughts about this as well.

Well, I think that enabling batching is the most important reason to use the TF ops. However, when I am not batching inputs and using the CPU, there still seems to be a performance increase with respect to PennyLane. I considered the following examples:

  1. Create a quantum circuit with 2 wires, initalize |0\rangle, apply N RX-gates and obtain \langle \sigma_0^z\rangle. Feed a vector x with N parameters to the circuit as gate parameters, and calculate the average execution time over 100 runs.

  2. Create a quantum circuit with N wires, do a state preparation |\psi\rangle and obtain \langle \sigma_0^z\rangle after applying two single qubit gates and a CNOT. Feed a vector x of size 2^N into |\psi\rangle and calculate the average execution time over 100 runs.

The vectors are fed sequentially to both the PennyLane and TF-PennyLane circuits, with a python for loop. Also, I verify that the outputs match. The following figures show an increase in performance with the TensorFlow interface:
pennylane_tf_comp_gates
Figure 1. Average circuit execution time for example 1. Tensorflow is faster when applying gate operations that are parametrized by an input variable x, even as the number of gates increases.
pennylane_tf_comp_state
Figure 2. Average circuit execution time for example 2. TensorFlow remains faster for large state vectors.

I am not yet sure where this improvement in performance comes from. Since Numpy and TF should be comparable in terms of performance for linear algebra stuff, some overhead must be incurred in the PennyLane interface. My code strips away this interface, and reduces the circuit to feeding a single (or batch) numpy array into a TF computational graph.

When I have the time, I will try to do some profiling to determine what could be improved.

Code to reproduce Figure 1:

import pennylane as qml
from pennylane import numpy as np
from pennylane.plugins.default_qubit_par import DefaultQubitPar
import tensorflow as tf
import matplotlib.pyplot as plt

import time


def construct_graph_from_circuit(input, quantum_circuit, parallel_device):
    """
    Pop operations from the queue and convert them into a TF graph using the default_dev_par class

    Args:
        input: data vector

    Returns: expectation value of the quantum cicuit.

    """
    # reset state
    parallel_device.pre_apply()
    for i, op in enumerate(quantum_circuit.queue):
        # if RX gate, add parameter to graph
        if len(op.parameters) == 1:
            parallel_device.apply(op.name, op.wires, input[i])
        # its a CNOT
        else:
            parallel_device.apply(op.name, op.wires, tf.zeros(1))
    # only works for single EV at the moment
    measurement = quantum_circuit.ev[0]
    return parallel_device.expval(measurement.name, measurement.wires, measurement.parameters)


grid = np.arange(1, 50, 5)
grid_time_tf = []
grid_time_pl = []
burn_in = 10
nsamples = 100 + burn_in

for ngates in grid:
    # get the standard device and parallel device
    dev = qml.device('default.qubit', wires=2)
    par_dev = DefaultQubitPar(wires=2)

    # simple test circuit
    def circuit(var):
        for i in range(len(var)):
            qml.RX(var[i], wires=0)
        qml.CNOT(wires=[0, 1])
        return qml.expval(qml.PauliZ(0))

    # construct the circuit using the standard device QNode
    circ = qml.QNode(circuit, device=dev, cache=True)
    circ.construct([np.random.randn(ngates)])
    # random data
    x = np.random.randn(nsamples, ngates)
    # initalize stuff
    tf_out = []
    pl_out = []
    start = None
    with tf.device('CPU:0'):
        # make tf placeholder for data input
        inputs = tf.placeholder(tf.complex128, [ngates], name='inputs')
        # construct graph
        outputs = construct_graph_from_circuit(inputs, circ, par_dev)
        # Tensorflow session
        sess = tf.Session(config=tf.ConfigProto(log_device_placement=False))
        sess.run(tf.initialize_all_variables())
        t = []
        # calculate average over several runs
        for i in range(nsamples):
            # exclude graph construction and optimizations
            if i >= burn_in:
                start = time.time()
            tf_out.append(sess.run([outputs], feed_dict={inputs: x[i]})[0])
            if i >= burn_in:
                t.append(time.time() - start)
        # store time
        grid_time_tf.append(np.mean(t))
    t = []
    for i in range(nsamples):
        # exclude optimizations
        if i >= burn_in:
            start = time.time()
        pl_out.append(circ(x[i]))
        if i >= burn_in:
            t.append(time.time() - start)
    # store time
    grid_time_pl.append(np.mean(t))

    assert np.all(np.isclose(tf_out, pl_out)), "TensorFlow and PennyLane do not agree!"

pl_c = np.array([64, 92, 77]) / 255
tf_c = np.array([255, 144, 0]) / 255
plt.plot(grid, grid_time_tf, c=tf_c, label='TensorFlow')
plt.plot(grid, grid_time_pl, c=pl_c, label='PennyLane')
plt.xlabel('$N_{gates}$')
plt.ylabel('mean time elapsed (seconds)')
plt.yscale('log')
plt.xticks(grid, grid)
plt.title('{} runs of QCircuit'.format(nsamples - burn_in))
plt.legend()
plt.show()


Code to reproduce Figure 2:

import pennylane as qml
from pennylane import numpy as np
from pennylane.plugins.default_qubit_par import DefaultQubitPar
from pennylane.ops.qubit import QubitStateVector
import tensorflow as tf
import matplotlib.pyplot as plt

import time


def construct_graph_from_circuit(input_state, quantum_circuit, parallel_device):
    """
    Pop operations from the queue and convert them into a TF graph using the default_dev_par class

    Args:
        input_state: data vector

    Returns: expectation value of the quantum cicuit.

    """
    # reset state
    parallel_device.pre_apply()
    for op in quantum_circuit.queue:
        # encode data into qubit state vector
        if op.name == 'QubitStateVector':
            input_state /= tf.norm(input_state)
            parallel_device.apply(op.name, op.wires, input_state)

        elif op.name == 'BasisState':
            raise NotImplementedError
        # gate parameters are TF vars
        else:
            parallel_device.apply(op.name, op.wires, tf.Variable(initial_value=op.parameters, dtype=tf.float64))
    # only works for single EV at the moment
    measurement = quantum_circuit.ev[0]
    return parallel_device.expval(measurement.name, measurement.wires, measurement.parameters)


grid = list(range(2, 16))
grid_time_tf = []
grid_time_pl = []
burn_in = 10
nsamples = 100 + burn_in

for nwires in grid:
    # get the standard device and parallel device
    dev = qml.device('default.qubit', wires=nwires)
    par_dev = DefaultQubitPar(wires=nwires)


    # simple test circuit
    def circuit(var, state=None):
        QubitStateVector(state, wires=list(range(nwires)))
        qml.RX(var[0], wires=0)
        qml.RY(var[1], wires=1)
        qml.CNOT(wires=[0, 1])
        return qml.expval(qml.PauliZ(0))


    # construct the circuit using the standard device QNode
    circ = qml.QNode(circuit, device=dev, cache=True)
    phi = np.zeros(2 ** nwires)
    circ.construct([(1, 2)], {'state': phi})
    # random data
    x = np.random.randn(nsamples, 2 ** nwires)
    # initalize stuff
    tf_out = []
    pl_out = []
    start = None
    with tf.device('CPU:0'):
        # graph inputs, feed on vector at a time
        inputs = tf.placeholder(tf.complex128, [2 ** nwires], name='inputs')
        # construct graph
        outputs = construct_graph_from_circuit(inputs, circ, par_dev)
        # Tensorflow session
        sess = tf.Session(config=tf.ConfigProto(log_device_placement=False))
        sess.run(tf.initialize_all_variables())
        t = []
        for i in range(nsamples):
            # exclude graph construction and optimizations
            if i >= burn_in:
                start = time.time()
            tf_out.append(sess.run([outputs], feed_dict={inputs: x[i], })[0])
            if i >= burn_in:
                t.append(time.time() - start)
        # store time
        grid_time_tf.append(np.mean(t))
    t = []
    for i in range(nsamples):
        # exclude optimizations
        if i >= burn_in:
            start = time.time()
        pl_out.append(circ((1, 2), state=x[i] / np.linalg.norm(x[i])))
        if i >= burn_in:
            t.append(time.time() - start)
    # store time
    grid_time_pl.append(np.mean(t))

    assert np.all(np.isclose(tf_out, pl_out)), "TensorFlow and PennyLane do not agree!"

pl_c = np.array([64, 92, 77]) / 255
tf_c = np.array([255, 144, 0]) / 255
plt.plot(grid, grid_time_tf, c=tf_c, label='TensorFlow')
plt.plot(grid, grid_time_pl, c=pl_c, label='PennyLane')
plt.xlabel('$N_{wires}$')
plt.ylabel('mean time elapsed (seconds)')
plt.yscale('log')
plt.xticks(grid, grid)
plt.title('{} runs of QCircuit'.format(nsamples - burn_in))
plt.legend()
plt.show()

Hi @rooler,

The default.qubit device was never designed for speed. It was just meant as a reference plugin.

However, despite all the other plugins/simulators available, we’ve observed lots of users still use that plugin. We are looking for ways to speed it up further in the future, but it is definitely not highly optimized. :laughing:

By the way, forgot to mention: if you do end up doing any profiling, we’d be happy to hear your conclusions or feedback.

Hi,

As you mentioned above, a parallelized QPU seems to be far away, so I understand that batching is not really a priority for quantum simulators right now. Nonetheless, I think that offering a plugin that would be capable of concurrent simulation would be really cool.

By the way, forgot to mention: if you do end up doing any profiling, we’d be happy to hear your conclusions or feedback.

I will do that! I will also try to tackle some of the current Github issues when I have the time.