Batching circuits

Hello,

I am trying to run a machine learning experiment using a small circuit. In my optimization loop, I will execute the same circuit many times, but with different input parameters (the argument x in the following code snippet). I want to run on real hardward using the pennylane-qiskit plugin. I have acces to an Heron QPU with 156 qubits. So, I could fit multiple copies of my small circuit with different parameters in the same execution to save time.

I was hopping that one of these two transforms would achieve that, but without success so far.

Here is a simple example of what I tried.

import pennylane as qml
import numpy as np
from qiskit_ibm_runtime import QiskitRuntimeService

service = QiskitRuntimeService()
backend = service.least_busy()

device = qml.device("qiskit.remote", wires=2, backend=backend)


@qml.batch_input(argnum=[0, 1])
@qml.set_shots(32)
@qml.qnode(device)
def circuit(x, y):
    qml.RX(x[0], wires=0)
    qml.RX(x[1], wires=1)
    qml.CNOT(wires=[0, 1])
    qml.RY(y[0], wires=0)
    return qml.expval(qml.Z(wires=0) + qml.Z(wires=1))


x = np.random.random((2, 5))
print(f"x = {x}")

y = np.random.random(1)
print(f"y = {y}")

print(f"Outcome = {circuit(x, y)}")
print(qml.draw(circuit)(x=x, y=y))

I have the following output

➜  python main.py

x = [[0.  0.1 0.2 0.3 0.4]
[0.5 0.6 0.7 0.8 0.9]]

y = [2.]

Outcome = [0.47190482 0.40949323 0.37722648 0.44045717 0.09474952]

0: ──RX(0.00)─╭●──RY(2.00)── β•­<𝓗>
1: ──RX(0.50)─╰X──────────── β•°<𝓗>

0: ──RX(0.10)─╭●──RY(2.00)── β•­<𝓗>
1: ──RX(0.60)─╰X──────────── β•°<𝓗>

0: ──RX(0.20)─╭●──RY(2.00)── β•­<𝓗>
1: ──RX(0.70)─╰X──────────── β•°<𝓗>

0: ──RX(0.30)─╭●──RY(2.00)── β•­<𝓗>
1: ──RX(0.80)─╰X──────────── β•°<𝓗>

0: ──RX(0.40)─╭●──RY(2.00)── β•­<𝓗>
1: ──RX(0.90)─╰X──────────── β•°<𝓗>

Also, in the IBM Quantum Cloud dashboard, I do see that 5 jobs where executed, each for a single 2-qubit circuit instead of a larger 10-qubit circuit that would batch all the inputs.

I know I can do it manually, but I was hoping there was already an implemented way to achieve that behavior. It would be even nicer if it was smart by batching the inputs only when running on real hardward and not when executing on a simulator .

Hi @maxime, welcome to the Forum!

This is a really nice use case. Thanks for sharing it here.
I don’t think we have something designed for this but I’ll double check with our team just in case.

This might actually be a really nice contribution in case we don’t have something. I’ll have to check with our team to validate this, but let me know if you’d be potentially interested in contributing a feature that does this.

Hi @maxime,

I checked with our team and it looks like we don’t have anything out-of-the-box for this. I’ve added it as a suggestion for improvement though. Thanks for pointing this out! Let us know if you have trouble setting this up by hand.

I was able to configure it by hand in a somewhat hacky way. Here are some decisions I took.

  • I initialize a device with `wires=156` (156 is the number of qubits on the IBM device I am using). This way, the same device can be used independently of how many circuits I batched together within a single larger circuit as long as it fit in 156 wires.
  • I detect if I am using a simulator or a real device outside the Qnode / Quantum function declaration. Thus, I have two functions, one for the batched version and one for the regular version. I don’t know how to select the right implementation based on the device type inside a single QNode / Quantum function.
  • When my total number of circuits is not exactly divisible by the batch size, I decided to call one extra type the batched execution with a smaller batch size for the remaining inputs. However, one could also pad the inputs so that all batches are exactly the same size. I believe this would combine better with jax and maybe lead to faster execution since all the calls are the same and no recompilation is required.

I could help contribute this feature to PennyLane if I have some guidance.

It’s great to hear that you were able to hack a solution @maxime !

Let me check with the team to see if they have any recommendations here.

Hi @maxime

This would be a pretty cool addition to PennyLane. It seems like what you want is a variant of broadcast_expand, where instead of splitting your program into separate sequential executions, you’re splitting it into a single parallelized execution. Please feel free to open an PR to PennyLane to implement such a transform. We’d be happy to review your work once ready. I would suggest checking out the documentation and implementation of broadcast_expand , which I think is a good starting point. Instead of creating multiple tapes, you would just want to create a single tape with β€œcopies” of your original tape that act on different sets of wires and also distribute your input data correctly. Another useful transform to help you with re-mapping the wires might be map_wires (docs, source).

If you have any other questions, feel free to ask here. Although I would suggest opening a draft PR and continuing discussion there. It will be easier for the dev team to keep an eye on the discussion on GitHub.

1 Like