Running 30 qubit highly entangled circuit

Dear Pennylane team,
I have a 30 qubit circuit where all qubits are inter-entangled. The total number of gates is somewhere around 8000. This circuit should complete running in about 12 days on 4 cores.
I would like to know if there are ways to speedup the computation.

  • For example, would it make sense to run the circuit on a GCP instance with a GPU or an instance with no GPU but many cores? What is the best hardware configuration for this case?

  • Would you recommend AWS simulators instead of default qubit device running on a GCP instance. Will there be a significant speedup for executing the circuit?

Hi @Einar_Gabbassov, it’s great that you’re tackling a big problem!

The first thing I would suggest is changing the device from default.qubit to lightning.qubit. This will increase your speed by three times or more.

I will look into the other questions or see if someone else has an answer.

Hi @Einar_Gabbassov.

It depends on the problem (and circuit) you want to solve.

As you said, SV1 simulator in AWS might be faster in case of single execution of circuit (i.e. it doesn’t require any iterations).

But in case of variational quantum algorithms (e.g. VQE,QAOA,QNN), in which quantum and classical comuputation should be iterated, communication delay between SV1 and your laptop accumulates and it becomes siginificant problem.
This is critical in case of QNN, because QNN may require huge number of iterations than VQE and QAOA.

Moreover, SV1 only supports slow gradieint calculation rule of parameter-shift to date(?).
This is critical if your circuit have many variational paramters.

From my experience, in < 15 qubits, local simulator on your laptop is sometimes faster than SV1 for variational quantum algorithms.
30 qubits might be good target to compare SV1 and local simulator.

@CatalinaAlbornoz @Kuma-quant thank you for your replies.
Fortunately, this big circuit does not require any iterations on the parameters. I just need to run it and sample the device.

In the case I use lightning.qubit device (or even default.qubit) what is the best hardware configuration for running my circuit (e.g. have more CPU cores or may be one core but a powerful GPU)?

My apologies for such basic question, but I could not find any info on parallelization of the pennylane simulators.

P.S. I guess, simulators do a lot of matrix multiplications so it seems like a GPU could be pretty handy?

Hey @Einar_Gabbassov,

to also add my 2 cents, I benchmarked AWS’ SV1 simulator as a PennyLane device a few months back and the setting you describe seems very well suited to it. I used much shorter circuits, but in my case the runtime grew surprisingly little with the number of qubits in the regime of 25-32 qubits. Also, with that much entanglement, I doubt you get an advantage of a tensor network simulator…and running lightning.qubit locally will reach its limits, even if you could put it onto the GPU.

As @Kuma-quant says, the latency is large and optimisation with many different sequential runs can be cumbersome at this stage for remote simulators, but if it is a wide&deep circuit you want to be simulated only once, the massive parallelisation of the backend should be exactly what you are looking for (and the few seconds of sending the job will not make a difference).

Maybe try and see if it fits the bill?

1 Like

For this defined qnode:

@qml.qnode(dev, interface="tf", diff_method="backprop")
def qnode(inputs, weights):
    for i in range(blocks):
        qml.templates.AngleEmbedding(inputs, wires=range(n_qubits))
       qml.templates.StronglyEntanglingLayers(weights[i], wires=range(n_qubits)) 
    return [qml.expval(qml.PauliZ(i)) for i in range(n_qubits)]

When i set

dev = qml.device("", wires=n_qubits) 

i get much faster speed/epoch than with the suggested

dev = qml.device("lightning.qubit", wires=n_qubits) 

Any idea why? Thanks in advance!

p.s. In the second case i have to remove : interface="tf", diff_method="backprop" from the Qnode so it runs properly.

p.s.2 the maximum number of qubits i can run on my PC is 17Qubits (more qubits result to an error). Is there a way i can push to 20 Qubits? [just wondering]

Hi @NikSchet, this is very unusual. We are looking into why this may be happening.

About pushing to 20 qubits it really depends on the machine you’re using. You could try running on GPU if you have one.

Thank you very much. Do i need to make certain changes to run on GPU instead of CPU? somehow i cannot find something relevant in the website (sorry if that was obvious and i missed it).

Hi @NikSchet,

Thank you for your interest in Pennylane! Can you try running lightning.qubit with diff_method=“adjoint”? This might speed it up.

Regarding the 17 qubits problem, this is a bit weird, you should be able to simulate more on a modern desktop. Can you post the error message you get when you try to increase the qubits along with your environment specs (e.g. pennylane version, lightning version, OS, CPU model, RAM)? Also if it’s not too much trouble, can you post the full python script?


Thank you for the fast reply. I used the suggested diff_method and i got indeed much faster speeds. What is the difference between

dev = qml.device("lightning.qubit", wires=n_qubits)
@qml.qnode(dev, interface="tf", diff_method="backprop")

and 2.

dev = qml.device("lightning.qubit", wires=n_qubits) 
@qml.qnode(dev, interface="tf", diff_method=“adjoint”)

With second option i was able to push up to 19 qubits with (Windows 10, AMD Ryzen 7 5800x, Nvidia 3060 8gb, Gskill 32Gb ram [ultrafast]). Pennylane version 0.18.0 (i do not use pytorch lightning)
Beyond that qubit number the kernel just dies or i get the error : “MemoryError: bad allocation” on a jupyter notebook opened in edge browser (Anaconda distribution).

Maybe i can push to higher qubit number if i make changes to batch size?

The script i am running can be found here:
[note that this is a transfer learning script i first pre-train (hot-starting) the classical part and then i train the Hybrid at section 4.3]

Hi @NikSchet,

Really cool script, thank you for providing it, we will analyze it for potential memory problems. One thing you can do is upgrade pennylane/pennylane-lightning to their newest versions, perhaps this might help with the 19-qubit problem. Also if you have access to linux, pennylane-lightning should be even faster than on windows because it will have OpenMP support. Perhaps making changes to the batch size will help as well, as you suggest.

Regarding the difference between “backprop” and “adjoint”. These are two different methods of computing the gradient. Pennylane-lightning has a heavily optimized adjoint method, so we prefer it’s use for now. You can also try diff_method=“best” and that will default to “adjoint” for lightning. A more detailed explanation for backprop and adjoint are here: and here: