Running PennyLane parallel on several CPU cores

Hi! I have read somewhere in the forum that GPU is not supported yet for PennyLane. When I run my code, I saw the CPU usage is only about 20% on 8 cores CPU. I might be wrong but I think the code runs on a single core. Is there a trick to make the code run on several cores to make the execution faster? Thanks!

Hi @eraraya-ricardo!

This depends on the underlying device being used. GPU support is available for the qulacs plugin and should also be possible using devices like default.qubit.tf (although probably not optimized).

We are also thinking of improving our options for optimized/parallelized CPU simulation. You could consider using Qulacs again (see here). There may also be improvements found by carefully building your version of NumPy from source and using default.qubit (but I’m not sure if that includes parallelization).

Hi @Tom_Bromley, thank you for replying to many of my questions, I really really appreciate it!

I see. I will try the GPU approach with Qulacs and see whether there is an improvement in runtime or not. I will post a follow up for this if I am able to run the code in Qulacs.

In the meantime, I am trying to find an alternative to make my code run faster. I tried the tutorial of running PennyLane code on Amazon Braket here (https://pennylane.ai/qml/demos/braket-parallel-gradients.html) and it worked. The tutorial showcase an amazing improvement of runtime and I would like to try this for my project.

But then I want to implement my quantum circuit (wrapped as Keras layer) using Amazon Braket instance, but it didn’t work. This is my code:

  1. The quantum circuit with amazon device
n_qubits = n_class

#dev = qml.device("default.qubit", wires=n_qubits)

dev_remote = qml.device(
    "braket.aws.qubit",
    device_arn=device_arn,
    wires=n_qubits,
    s3_destination_folder=s3_folder,
    parallel=True,
)

@qml.qnode(dev_remote)
def qcircuit(params, inputs):
    """A variational quantum circuit representing the DRC.

    Args:
        params (array[float]): array of parameters
        inputs = [x, y]
        x (array[float]): 1-d input vector
        y (array[float]): single output state density matrix

    Returns:
        float: fidelity between output state and input
    """
    
    # layer iteration
    for l in range(len(params[0])):
        # qubit iteration
        for q in range(n_qubits):
            # gate iteration
            for g in range(int(len(inputs)/3)):
                qml.Rot(*(params[0][l][3*g:3*(g+1)] * inputs[3*g:3*(g+1)] + params[1][l][3*g:3*(g+1)]), wires=q)
    
    return [qml.expval(qml.Hermitian(density_matrix(state_labels[i]), wires=[i])) for i in range(n_qubits)]

  1. Keras Layer and Model
class class_weights(tf.keras.layers.Layer):
    def __init__(self):
        super(class_weights, self).__init__()
        w_init = tf.random_normal_initializer()
        self.w = tf.Variable(
            initial_value=w_init(shape=(1, n_class), dtype="float32"),
            trainable=True,
        )

    def call(self, inputs):
        return (inputs * self.w)

X = tf.keras.Input(shape=(27,27,1))

conv_layer_1 = tf.keras.layers.Conv2D(filters=1, kernel_size=[3,3], strides=[2,2], name='Conv_Layer_1')(X)
conv_layer_2 = tf.keras.layers.Conv2D(filters=1, kernel_size=[3,3], strides=[2,2], name='Conv_Layer_2')(conv_layer_1)
max__pool_layer = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=None, name='Max_Pool_Layer')(conv_layer_2)
reshapor_layer = tf.keras.layers.Reshape((9,), name='Reshapor_Layer')(max__pool_layer)

qlayer = qml.qnn.KerasLayer(qcircuit, {"params": (2, 1, 9)}, output_dim=n_class, name='Quantum_Layer')(reshapor_layer)

class_weights_layer = class_weights()(qlayer)

model = tf.keras.Model(inputs=X, outputs=class_weights_layer, name='Conv DRC')

I tried to do forward pass like this

model(X_train[0:32])

And it did not work. It worked only when I am not using the Amazon Braket instance and just using the regular default.qubit. I got this error message:

TypeError: Object of type ‘float32’ is not JSON serializable

My first guess is that the Amazon plugin failed to communicate either the training data or the weights from/into the Keras layer. Have you ever experience this? Any ideas?
Thank you in advance, Tom

Btw ignore all the comments in the code, it is constructed from the Data Reuploading Classifier tutorial and I forgot to erase the comments haha

Hey @eraraya-ricardo, thanks for noticing this! The PL-Braket plugin was launched recently so there are probably a few imperfections we still need to iron out.

The issue seems to be with the dtype of the tensor used, for example the following works:

import pennylane as qml
import tensorflow as tf
from pennylane import numpy as np

qml.enable_tape()

wires = 2
dev = qml.device(
    "braket.aws.qubit",
    device_arn=device_arn,
    wires=wires,
    s3_destination_folder=s3_folder,
    parallel=True,
)

@qml.qnode(dev, interface="tf")
def qnode(weights, inputs):
    qml.templates.AngleEmbedding(inputs, wires=range(wires))
    qml.templates.StronglyEntanglingLayers(weights, wires=range(wires))
    return [qml.expval(qml.PauliZ(i)) for i in range(wires)]

x = tf.ones(2, dtype=tf.float64)
weights = tf.Variable(np.random.random((4, 2, 3)))

qnode(weights, x)

However, if you set x = tf.ones(2, dtype=tf.float32) it doesn’t work :thinking:.

I still need to understand what’s going here, and it may be a bug in PL or the plugin. Practically for your code, I’d recommend seeing if the error remains when you make sure everything is of dtype=tf.float64 (by default, many things are dtype=tf.float32 in Keras for efficiency).

I tried to set everything to tf.float64 and the error disappeared.

But, the running process took very long time and the code did not finish even after I waited for several minutes (I only tried to forward pass 32 samples of data, I think it should not take that long)

Great! We’re still thinking about how this can best be fixed to support float32.

However, I’m not sure if switching from float32 to float64 should make things notably slower. How many qubits do you have in your model? The remote simulator on Braket is best suited for when you have quite a few qubits (e.g., around 20 or more). For smaller numbers of qubits, local simulators will probably be faster because they don’t have the added latency time of communicating the job.