I’m currently benchmarking several QML models with different architectures, trying to identify the fastest simulation backend when using batched inputs, especially since some of the models are hybrid and built with TorchLayer
.
I noticed that even though default.qubit
, lightning.qubit
, and lightning.gpu
all accept batched inputs, only default.qubit
seems to actually benefit from batching in terms of execution speed: the time taken is significantly reduced and reduction scales with batch size.
To test this, I ran the following minimal example:
import pennylane as qml
import torch
import time
dev = qml.device("lightning.gpu", wires=3)
@qml.qnode(dev, interface="torch")
def circuit(x):
qml.AngleEmbedding(x, wires=[0, 1, 2])
return qml.expval(qml.PauliZ(0))
x_batch = torch.randn(5000, 3)
# --- Batched execution ---
start_batch = time.time()
results_batch = circuit(x_batch)
end_batch = time.time()
# --- Sequential execution ---
start_seq = time.time()
results_seq = torch.stack([circuit(x) for x in x_batch])
end_seq = time.time()
print(f"Batched time: {end_batch - start_batch:.4f} s")
print(f"Sequential time:{end_seq - start_seq:.4f} s")
The timings for lightning.qubit
and lightning.gpu
are only slightly different between the batched and sequential versions, which makes me wonder whether these devices actually support parameter broadcasting in the sense of processing batched inputs in parallel? Or are the inputs being unrolled and evaluated sequentially under the hood despite accepting batched input shapes?
Thanks a lot!