I’d like to simulate very long circuits with many mid-circuit measurements as space and time-efficiently as possible, and am having trouble figuring out what the best approach is in terms of simulator settings.

As a toy example, here is some code that generates a parameterized circuit that can potentially have very many mid-circuit measurements.

```
import pennylane as qml
import torch
def qnode_gen(device, diff_method='backprop', postselect=None):
@qml.qnode(device, interface='torch', diff_method=diff_method)
def qnode(inputs, weight):
measurements = []
for real in inputs:
qml.RY(real, 0)
qml.RY(weight, 0)
m = qml.measure(0, reset=True, postselect=postselect)
measurements.append(m)
return tuple(qml.expval(m) for m in measurements)
return qnode
ITERATIONS = 4
torch.manual_seed(6)
inputs = torch.rand(ITERATIONS)
weight = torch.rand(1).requires_grad_()
print(inputs)
qnode = qnode_gen(qml.device('default.qubit', wires=1))
fig, _ = qml.draw_mpl(qnode)(inputs, weight)
```

This produces the following circuit diagram for a circuit with four measurements:

If you’re interested enough to see a testing script for this toy example, I have one here.

I’ve tested a number of combinations of different torch and pennylane device settings without finding one that runs satisfyingly quickly for larger values of `ITERATIONS`

. In summary:

`qml.device('default.qubit', diff_method='backprop')`

has exponentially-sized saved tensors in the computation graph (Saved by`BmmBackward0`

and`PowBackward0`

). This makes sense since it has to maintain a statistical mixture of measurement outcomes.`qml.device('default.qubit', diff_method='parameter-shift')`

doesn’t have exponentially-sized saved tensors in the computation graph, but still scales poorly. I expect for the same reason: the forward pass requires maintaining the statistical mixture.`qml.device('default.qubit', diff_method='parameter-shift', shots=10)`

gives me a “probabilities do not sum to 1” error.

`qml.device('default.qubit.torch')`

with either method causes an error where the fixed number of qubits isn’t enough to support the additional qubits that the automatic call to`defer_measurements`

requires.- Using a
`torch.device('cuda')`

instead of`cpu`

presents an issue in case 1) but not in case 2) involving not all tensors being on the same device. - It occurs to me that if all these measurements are post-selected, then there shouldn’t be an exponential scaling issue in cases 1) and 2), which makes me think my justification is wrong.

In more summary:

I have a number of issues or errors trying different options and figured it would be more time-efficient to ask what the best approach is before going down the debugging rabbit hole:

**What is the best simulator and torch device configuration for circuits with many mid-circuit measurements, like above example?**

I suspect it will be `qml.device('default.qubit', diff_method='parameter-shift', shots=some_int)`

on `torch.device('cuda')`

.

Thank you very much for reading this, and any advice you can provide!