Running quantum circuits in GPU

Hello all,

I am having some difficulties trying to run a standard qnode in a hybrid NN (same as in the QML/Turning quantum nodes into Keras Layers tutorial) using GPU instead of CPU. Is there an easy way to do this?

p.s. i have tried using CUDA but the @jit decorator doesnt work for quantum nodes.

Thank you in advance

Hi @NikSchet thanks for the question. Do you have some example code we can look to help identify the problem here.

Thank you for your reply. Yes please check my code here:

The problem i am trying to solve at the end of the day is to increase the number of qubits i am using to 25 because i my new dataset has 25features (the max qubits i am currently able to use is 17qubits, high a desktop computer: Gpu…Nvidia 3060, CPU…AMDRyzen7)

Hi @NikSchet I was able to get your code to run on my GPU (1060) with minimal changes:

I will attach a modified notebook with some comments (rename from .txt to .ipynbPennyLane_GPU_Example.txt (22.1 KB) ).

Some of the changes are as follows:

  • I added the following to ensure there can be selective control of the GPU device. 0 should be your default GPU, and -1 should allow you to hide the GPU from TensorFlow
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
  • Next, I add @tf.function to ensure the QNode JIT compiles
@tf.function
@qml.qnode(dev, interface="tf", diff_method="backprop")
def qnode(inputs, weights):
  • For all data in Section 6, I have modified them to np.float32.
  • In addition, the following lines allow selective choice between CPU and GPU (pick whichever you wish to run on):
with tf.device('/device:CPU:0'):
with tf.device('/device:GPU:0'):

This should allow you to get the code running on your GPU. Now, you mentioned having memory issues also. Backprop is a notoriously memory hungry algorithm for derivatives, and may not unfortunately allow you to run large-qubit algorithms with access to a large workstation / cluster. Generally, we try to use high-memory systems, and supercomputer / cloud grade GPUs for the upper 20 qubit regimes.

Though, if you wish to wait a little longer, you can actually run larger optimization problems on the CPU using the lightning.qubit device and the adjoint differentiation method. This trades memory for compute time, and will be OpenMP parallelized on Linux / MacOS machines (if you are running Windows, you can use WSL to get the parallelized version). It should be possible to swap one device for the other, if the memory wall becomes a problem reaching the 25qubit regime with backprop.

Let us know if you require any further assistance.

2 Likes

Thank you very much for your time that worked!!! :slight_smile:
p.s. code runs smoothly for 21 qubits

3 Likes

Hi @mlxd just wanted to iterate on this thread since it’s related. I’m trying to implement something similar in my framework and it works nicely with "default.qubit.tf" + backprop but when I switch to qiskit.aer + parameter-shift I can not get any gradient tf.GradientTape() gives me zero all the time. Is there another way to use qiskit’s simulator with tf.vectorized_map?

Thanks

Hi @jackaraz, if you have a minimum working example of your code we can take a look. Feel free to drop it here.

Hi @mlxd here is a minimal example. I believe the problem occurs due to parameter-shift and since Qiskit does not allow it, it doesn’t work properly.

import tensorflow as tf
import pennylane as qml
from pennylane import numpy as np

dev1 = qml.device("qiskit.aer", wires = 2, shots=10, backend='qasm_simulator')
dev2 = qml.device("default.qubit.tf", wires = 2, shots=None)

@qml.qnode(dev2, diff_method="backprop", interface="tf")
def circuit2(inputs, weights):
    qml.AngleEmbedding(inputs, wires = range(2), rotation="Y")

    qml.RY(weights[0], wires=0)
    qml.RY(weights[1], wires=1)
    qml.CNOT(wires = [0, 1])

    return qml.probs(op=qml.PauliZ(1))

@qml.qnode(dev1, diff_method="parameter-shift", interface="tf")
def circuit1(inputs, weights):
    qml.AngleEmbedding(inputs, wires = range(2), rotation="Y")

    qml.RY(weights[0], wires=0)
    qml.RY(weights[1], wires=1)
    qml.CNOT(wires = [0, 1])

    return qml.probs(op=qml.PauliZ(1))

weights = tf.Variable(tf.random.uniform((2,), dtype=tf.float64), trainable=True)
inputs = tf.random.uniform((10,2), dtype=tf.float64)

circ = tf.function(circuit2)
contract = lambda inpts : tf.vectorized_map(lambda vec: circ(vec, weights), inpts)
with tf.GradientTape() as tape:
    yhat = tf.reduce_mean(contract(inputs))
tape.gradient(yhat, weights)

# Output : <tf.Tensor: shape=(2,), dtype=float64, numpy=array([ 1.38777878e-17, -2.77555756e-17])>

circ = tf.function(circuit1)
contract = lambda inpts : tf.vectorized_map(lambda vec: circ(vec, weights), inpts)
with tf.GradientTape() as tape:
    yhat = tf.reduce_mean(contract(inputs))
tape.gradient(yhat, weights)
# Output: <tf.Tensor: shape=(2,), dtype=float64, numpy=array([0., 0.])>

My yhat is poorly choosen but it shows whats happening here I believe. circuit1 always gives zero gradient no matter what but I can get proper results from circuit2. Also If instead of using vectorized_map if I use the batching function I wrote here it gives me a good gradient result as well but this does not parallelize the execution on GPU. So I believe I need to use vectorized_map to parallelize the batch execution or is there any other way that you can suggest?

Thanks

Hi @jackaraz I had a quick look at your example but it is not clear to me what should be happening on the GPU side. I think this may better be created as an issue in the PennyLane repo, as it will allow the rest of the team to have a look.

Hi @mlxd, thanks I’ll move it to the github then.