Is parameter broadcasting considered parallel or concurrent processing when using PennyLane?

I am currently using parameter broadcasting in PennyLane to evaluate the same quantum circuit on multiple input patches. I have not explicitly set max_workers or configured any parallel backend, since my workflow requires backpropagation and the default.qubit device does not support max_workers.

I would like to clarify:
Does this approach count as concurrent or parallel execution (e.g., across multiple CPU cores)?
My goal is to compute several identical quantum circuits simultaneously on different CPU cores, but from what I observe, it seems that parameter broadcasting does not leverage multi-core CPU parallelism by default.

Is there a recommended way to achieve actual parallel execution across CPUs for this type of use case in PennyLane?

Thanks!

Hi @charliechiou , welcome to the Forum!

My colleague Josh pointed out that default.qubit utilizes NumPy broadcasting under the hood:

Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python.

I’ve shared your question with my colleagues so they may be able to provide a more detailed answer in a few days.

1 Like

Hi, thanks for your reply. I also read the discussion in Questions about parallel execution, where one of the responses mentioned that default.qubit utilizes NumPy broadcasting. It seems that NumPy performs concurrent processing rather than parallel processing.

To further investigate this, I monitored the CPU usage (by using htop on linux) while executing the program, and it showed that only a single CPU core was active. Therefore, I believe that the execution mechanism is concurrent rather than truly parallel.Thanks ! :slightly_smiling_face:

Hi, I have recently switched my device to default.qubit.torch, and I would like to better understand the underlying mechanism of GPU-based computation in this context. Thanks !

Hi @charliechiou ,

default.qubit.torch works on CPU instead of GPU. .torch indicates that it uses the Torch interface, meaning that the inputs and outputs of the qnode will be Torch tensors.

If you want to use GPUs you should use lightning.gpu or lightning.kokkos depending on the GPU that you have. You can learn more about PennyLane simulators in our performance page.

I hope this helps!

1 Like

Hi, @CatalinaAlbornoz,

Is there any way for me to train the quantum part of my QNN using GPU for parallel processing? Right now, I’m training my hybrid QNN with PyTorch using CUDA on the GPU, and I’m wondering if it’s possible to also train the quantum part on the GPU to speed things up. And I noticed that I can set torch_device='cuda', does that mean my current computation is on the GPU?My device setting is as below,
qml.device(self.qml_device, wires=self.qubit_num,torch_device='cuda')

Thanks for your help! :folded_hands:

ps.Please also let me know if there is any way to view or show where the computing happen at
.Maybe I can exam myself.

Hi @charliechiou , what GPU are you using?

Hi @CatalinaAlbornoz , I’m using NVIDIA GeForce RTX 4090.

Thanks for confirming @charliechiou .

I think this should be compatible. So if you have Linux you can install pennylane-lightning-gpu with MPI from source.

Please create a virtual environment with venv to avoid any package conflict issues.
Please read the instructions here in the docs to learn how to install lightning.gpu in your environment.
Please pay special attention to the notes in green.

Let us know if you have any questions or issues!

Hi @CatalinaAlbornoz, thanks again for your help. I’m going to give it a try. Meanwhile, I’m trying to understand the mechanism behind this line:

torch_device = qml.device('default.qubit.torch', wires=self.qubit_num,torch_device='cuda')

Specifically, I’m wondering whether the quantum circuit runs in parallel or concurrently when using the 'cuda' backend. Thanks!

Hi @charliechiou ,

Could you please share a minimal reproducible example of your code here?

My colleague Lee shared a few insights:
Torch will run on a single GPU, but running on a GPU can count as single program, multiple data (SPMD), so it’s a form of parallel execution, but at the hardware level.

Depending on your workload it may end up using multiple threads, and if you have having different observables you could get gradients running on multiple cores via OpenMP. However this is entirely dependent on what code you’re running, so we’d need to see the code in order to understand what can (or not) run in parallel. It’s also very helpful to know your end goal with the code. For example, is this for benchmarking algorithms? We have a repo for qml benchmarks that could be useful.

I hope this helps!

Hi, @CatalinaAlbornoz
The following is a minimal reproducible example of my code.
My goal is to train a hybrid quantum neural network (QNN) that combines a classical MLP with a quantum circuit implemented using Pennylane.

import torch
import torch.nn as nn
import pennylane as qml
import matplotlib.pyplot as plt

class QNN(nn.Module):
    def __init__(self):
        super(QNN, self).__init__()
        self.qubit_num = 4
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        if torch.cuda.is_available():
            print("Using GPU")
            q_device = qml.device("default.qubit.torch", wires=self.qubit_num, torch_device="cuda")
        else:
            print("Using CPU")
            q_device = qml.device("default.qubit", wires=self.qubit_num)

        self.QNode = qml.QNode(self.qnn_circuit, q_device, interface="torch", diff_method="backprop")
        self.measure_set = [0, 2]

        self.p_rotation = nn.Parameter(torch.rand(12) * torch.pi, requires_grad=True)

    def qnn_circuit(self, embedding):
        qml.AngleEmbedding(features=embedding, wires=range(4))
        for i in range(4):
            qml.RY(self.p_rotation[i], wires=i)
        return [qml.expval(qml.PauliZ(w)) for w in self.measure_set]

    def forward(self, x):
        x = x / torch.norm(x, dim=1, keepdim=True)  # normalize
        outputs = [torch.tensor(self.QNode(xi), device=self.device) for xi in x]
        return torch.stack(outputs)

model = QNN()
x = torch.randn(5, 4)  
output = model(x)
print("Output shape:", output.shape)
print(output)

Thank you!

Thanks for sharing @charliechiou .

I’m assuming you’re not using OpenMP so you’re using a single core for everything.
Since you’re using different observables you could in theory parallelize this but it may be slower that it currently is.

Your code runs quite fast for me so I was wondering how slow it is for you, and why you’re looking into parallelizing this.

Hi, @CatalinaAlbornoz .Thanks for the clarification.

The execution speed on my side isn’t particularly slow either. However, since I’m using this as part of training a quantum neural network (QNN), reducing the computational cost per iteration could significantly improve the overall training time.

My motivation for further parallelization is mainly to scale across different features, as each observable corresponds to a distinct one. Structuring the computations this way could help improve scalability as the feature space increases.

While it may appear parallelized due to the use of different observables, I’d like to confirm whether the backend is truly executing them in parallel on the GPU, or if it’s still serial under the hood.

Hi @charliechiou ,

At the moment you don’t have any broadcasting in your circuit nor anything that would make it run in parallel.

Broadcasting itself doesn’t parallelize but is makes the looping (e.g. the for loops) happen in C instead of Python which is more efficient.

The code below shows how to do actual broadcasting.

import torch
import torch.nn as nn
import pennylane as qml
import matplotlib.pyplot as plt

class QNN(nn.Module):
    def __init__(self):
        super(QNN, self).__init__()
        self.qubit_num = 4
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        if torch.cuda.is_available():
            print("Using GPU")
            q_device = qml.device("default.qubit.torch", wires=self.qubit_num, torch_device="cuda")
        else:
            print("Using CPU")
            # Set max_workers depending on the number of cores you have available
            q_device = qml.device("default.qubit", wires=self.qubit_num, max_workers=2)

        # Set diff_method="adjoint"
        self.QNode = qml.QNode(self.qnn_circuit, q_device, interface="torch", diff_method="adjoint")
        self.measure_set = [0, 2]

        self.p_rotation = nn.Parameter(torch.rand(12) * torch.pi, requires_grad=True)

    def qnn_circuit(self, embedding):
        qml.AngleEmbedding(features=embedding, wires=range(4))
        for i in range(4):
            qml.RY(self.p_rotation[i], wires=i)
        return [qml.expval(qml.PauliZ(w)) for w in self.measure_set]

    def forward(self, x):
        x = x / torch.norm(x, dim=1, keepdim=True)  # normalize
        # You don't need to split the dataset into the different datapoints
        #outputs = [torch.tensor(self.QNode(xi), device=self.device) for xi in x]

        # You can use broadcasting so that the looping occurs in C which is more efficient
        outputs = self.QNode(x)
        return torch.stack(outputs, dim=1) # Add the dimension so that the results get stacked properly

model = QNN()
x = torch.randn(5,4)  
output = model(x)
#print("Output shape:", output.shape)
print(output)

In addition to broadcasting …

Since you’re using default.qubit you can set max_workers as seen in the docs to use a pool of at most max_workers processes asynchronously. This doesn’t guarantee that they will all be used, this is just a maximum. Note that backprop doesn’t work with this so you’ll need to change the diff_method to diff_method="adjoint".

Take a look at the code above and let me know if you have any further questions. You can also check out our performance page to learn about other simulators we have.

I hope this helps!