Torch Tensor Batching Speep-up

Hello. I am using a Pennylane-Torch Interface.
I was expecting a linear speed up with respect to number of batch when batching with Torch Tensor (as long as my device can run all the batches in parallel). But it seems like I am only getting an constant speed-up. Is there any way I can get around it?

Here is a minimal working example.

import pennylane as qml
from pennylane import numpy as np
import torch
import time

dev = qml.device('default.mixed', wires=1)

rho = np.array([[0.5, 0.5],[0.5,0.5]])
@qml.qnode(dev, interface="torch")
def circ_torch(theta):
    qml.QubitDensityMatrix(rho, wires=0)
    qml.RZ(-1 * theta, wires=0)
    return [qml.expval(qml.Identity(0)), qml.expval(qml.PauliZ(0))]
    

@qml.qnode(dev)
def circ(theta):
    qml.QubitDensityMatrix(rho, wires=0)
    qml.RZ(-1 * theta, wires=0)
    return [qml.expval(qml.Identity(0)), qml.expval(qml.PauliZ(0))]

thetas = torch.linspace(0,1,1000)

time1 = time.time()
circ_torch(thetas)
time2 = time.time()
for i in range(len(thetas)):
    x = circ(thetas[i])
time3 = time.time()

print(time2 - time1)
print(time3 - time2)

Output is

0.5288441181182861
0.7281041145324707

I expected ~1000 time speed up when using Torch Layer.

Hey @takh2324, welcome back!

default.mixed actually doesn’t support parameter-broadcasting natively; you can provide operators an argument with a leading batch dimension, but what happens under the hood is a glorified for loop.

You can see this with the following example:

import pennylane as qml
import torch

from pennylane import numpy as np

#dev = qml.device('default.qubit', wires=1)
dev = qml.device('default.mixed', wires=1)

@qml.qnode(dev)
def circ_torch(theta):
    qml.RZ(theta, wires=0)
    return qml.expval(qml.Z(0))

with qml.Tracker(dev) as tracker:
    circ_torch(torch.tensor([0.1, 0.2, 0.3]))

print(tracker.totals)

With default.qubit, the number of times the device executes something can be determined with the simulations key:

{'batches': 1, 'simulations': 1, 'executions': 3}

With default.mixed, the number of times the device executes something can be determined with the executions key:

{'executions': 3, 'batches': 1, 'batch_len': 3}

As you can see, the device executes something three times as opposed to default.qubit only needing one execution. At the moment, default.mixed is using our old device API. It will be updated soon! At that point, broadcasting should be natively supported.

Now, as for why you’re seeing a speedup, I’m not 100% sure :thinking:. It’s probably better to use time.process_time() instead of time.time() . The former is the CPU time, so it really measures how much time is spent computing what you are doing in the process. The latter is just wall clock time, so it would depend on what else is running on your laptop :sweat_smile:.

Let me know if this helps!