Slow Training and Low CPU Usage When Training Deep Circuit

Tz_19 · June 27, 2025, 3:40am

Hello! I’m trying to train a 4-bit special unitary circuit (SU(16)). The structure of the circuit is from this article. The training process is very slow, and the CPU Usage is low (about 10% to 20%) during the training process. The phenomenon becomes more severe when I use the circuit as components of a large circuit. In a 6-bit or 7-bit circuit containing about 10 to 15 SU(16), The CPU usage cannot even reach 10%, and the training is extremely slow. Here is my code:

import cmath
import math
import matplotlib
import matplotlib.pyplot as plt
import pennylane as qml
from pennylane import numpy as np
matplotlib.use('TkAgg')

dev=qml.device('lightning.qubit', wires=4)
# dev=qml.devices.DefaultQubit(wires=4)

def getstates(n, complex=False):
# generate an arbitrary quantum state
    state_r=np.random.rand(2**n)
    while np.linalg.norm(state_r)<10**(-8):
        state_r=np.random.rand(2**n)
    if complex:
        state_i=np.random.rand(2**n)
        while np.linalg.norm(state_i)<10**(-8):
            state_i=np.random.rand(2**n)
        state=state_r+1j*state_i
    else:
        state=state_r
    state=state/np.linalg.norm(state)
    return state

def SU_N(n):
    """
    Generate an n-order special unitary matrix (SU(n)) as the target to learn.
    
    Args:
        n (int): The order of the matrix.
        
    Returns:
        np.ndarray: An n x n unitary matrix with determinant 1.
    """
    # Step 1: Generate a random complex matrix
    z = (np.random.randn(n, n) + 1j * np.random.randn(n, n)) / np.sqrt(2)
    
    # Step 2: Perform QR decomposition to get a unitary matrix
    q, r = np.linalg.qr(z)
    
    # Step 3: Adjust the diagonal of R to ensure Q is unitary
    d = np.diagonal(r)
    ph = d / np.abs(d)  # Normalize to unit modulus
    q = q @ np.diag(ph)
    
    # Step 4: Ensure determinant is 1
    det_q = np.linalg.det(q)
    m, p=cmath.polar(det_q)
    c=cmath.rect(math.pow(m, -1/n), -p/n)
    for i in range(n):
        q[i] = q[i] * c
    return q

def uniform_CR(params, c_wires, target, op):
    # the component of the SU(16) circuit from the paper, which is a multi-qubit controlled rotation.
    if type(c_wires) == int or type(c_wires) == np.int64:
        c_wires=[c_wires]
    n = 2 ** len(c_wires)
    if len(params) < n:
        raise ValueError(f'The number of parameters is not enough. Need {n} parameters, given {len(params)}.')
    counter = 0
    for i in range(n):
        table = bin(i)[2:].zfill(len(c_wires))
        match op:
            case 'X':
                qml.ctrl(qml.RX(params[counter], target), c_wires, [bool(int(x)) for x in table])
            case 'Y':
                qml.ctrl(qml.RY(params[counter], target), c_wires, [bool(int(x)) for x in table])
            case 'Z':
                qml.ctrl(qml.RZ(params[counter], target), c_wires, [bool(int(x)) for x in table])
            case _:
                raise ValueError('Operations should be X, Y, or Z.')
        counter += 1

def Fmk(params, m, k, wires, op):
    # determine the non-trainable parameters of uniform_CR based on two non-trainable parameters m and k.
    m=m-1
    target = wires[m]
    c_wires = [wires[m-i] for i in range(1, k+1)]
    uniform_CR(params, c_wires, target, op)

def SU16(params, wires):
    # the SU(16) circuit, which is a 4-qubit circuit with 255 trainable parameters.
    if len(params)<255:
        raise ValueError('The number of parameters should be 255 to construct a SU(16).')
    qml.RZ(params[0], wires[0])
    #B8
    Fmk(params[1:3], 2, 1, wires, 'Z')
    Fmk(params[3:7], 3, 2, wires, 'Z')
    Fmk(params[7:15], 4, 3, wires, 'Z')
    Fmk(params[15:23], 4, 3, wires, 'Y')
    Fmk(params[247:255], 4, 3, wires, 'Z')
    #BA7
    Fmk(params[23:31], 1, 3, wires, 'Y')
    Fmk(params[31:39], 1, 3, wires, 'Z')
    Fmk(params[39:47], 4, 3, wires, 'Y')
    Fmk(params[47:55], 4, 3, wires, 'Z')
    #BA6
    Fmk(params[55:63], 2, 3, wires, 'Y')
    Fmk(params[63:71], 2, 3, wires, 'Z')
    Fmk(params[71:79], 4, 3, wires, 'Y')
    Fmk(params[79:87], 4, 3, wires, 'Z')
    #BA5
    Fmk(params[87:95], 1, 3, wires, 'Y')
    Fmk(params[95:103], 1, 3, wires, 'Z')
    Fmk(params[103:111], 4, 3, wires, 'Y')
    Fmk(params[111:119], 4, 3, wires, 'Z')
    #BA4
    Fmk(params[119:127], 3, 3, wires, 'Y')
    Fmk(params[127:135], 3, 3, wires, 'Z')
    Fmk(params[135:143], 4, 3, wires, 'Y')
    Fmk(params[143:151], 4, 3, wires, 'Z')
    #BA3
    Fmk(params[151:159], 1, 3, wires, 'Y')
    Fmk(params[159:167], 1, 3, wires, 'Z')
    Fmk(params[167:175], 4, 3, wires, 'Y')
    Fmk(params[175:183], 4, 3, wires, 'Z')
    #BA2
    Fmk(params[183:191], 2, 3, wires, 'Y')
    Fmk(params[191:199], 2, 3, wires, 'Z')
    Fmk(params[199:207], 4, 3, wires, 'Y')
    Fmk(params[207:215], 4, 3, wires, 'Z')
    #BA1
    Fmk(params[215:223], 1, 3, wires, 'Y')
    Fmk(params[223:231], 1, 3, wires, 'Z')
    Fmk(params[231:239], 4, 3, wires, 'Y')
    Fmk(params[239:247], 4, 3, wires, 'Z')

@qml.qnode(dev)
def circuit(params, ini_state, tar_state):
    # the quantum circuit to train, using the fidelity between the initial state and the target state as cost function.
    qml.StatePrep(ini_state, range(4))
    SU16(params, range(4))
    return qml.expval(qml.Hermitian(qml.math.dm_from_state_vector(tar_state), range(4)))

def cost(params, ini_state_batch, tar_state_batch):
    c = 0
    for ini_state, tar_state in zip(ini_state_batch, tar_state_batch):
        c += 1 - circuit(params, ini_state, tar_state)
    return c / len(ini_state_batch)

if __name__ == '__main__':
    batch_size = 16
    target = SU_N(16)
    ini_state_batch = []
    tar_state_batch = []
    for i in range(batch_size):
        ini_state = getstates(4, True)
        tar_state = target @ ini_state
        ini_state_batch.append(ini_state.tolist()) # The tolist() is to ensure that Pennylane regard the quantum states as non-trainable parameters, because I often receive warnings saying that parameters in qml.Hermitian cannot be trained even though I marked the 'requires_grad' as False.
        # ini_state_batch.requires_grad=False
        tar_state_batch.append(tar_state.tolist())
    params=np.random.uniform(0, 2*np.pi, 255)
    opt=qml.AdamOptimizer(0.04)
    converge_count=0 # This variable is used to check whether the cost converges
    i=1
    cur_cost=cost(params, ini_state_batch, tar_state_batch)
    print(f'Initial fidelity is {1-cur_cost}')
    loop=[0]
    F=[1-cur_cost]
    prev_cost=1
    while converge_count<5 and i<400:
        if abs(cur_cost-prev_cost)<0.003 and cur_cost<0.1:
            converge_count+=1
        else:
            converge_count=0
        prev_cost=cur_cost
        ini_state_batch=[]
        tar_state_batch=[]
        for j in range(batch_size):
            ini_state = getstates(4, True)
            ini_state_batch.append(ini_state.tolist())
            tar_state_batch.append((target @ ini_state).tolist())
        params, cur_cost=opt.step_and_cost(lambda p: cost(p, ini_state_batch, tar_state_batch), params)
        loop.append(i)
        F.append(1-cur_cost)
        i += 1
        if not (i % 10):
            print(f'Epoch {i}, current fidelity is {1-cur_cost}')

    print(f'Total round number is {i}, the final fidelity is {1-cur_cost}')
    fig, ax=plt.subplots()
    ax.plot(loop, F)
    ax.set_xlabel('Training epoch')
    ax.set_ylabel('1-Fidelity')
    ax.set_title('Training of SU(16) Block')
    plt.show()
    plt.clf()

The method I tried including the following:

Using default.qubit, default.mixed, qml.devices.DefaultQubit and lightning.qubit as device. The lightning.qubit is a bit faster than the others, but is still slow with low CPU usage.
Using JAX and trying to parallel the batch by jax.vmap in the cost function, and use @jax.jit to decorate the qnode, which is of no use in both training time and CPU usage.
Using Joblib to parallel the batch, which will constantly raise errors.
Using lightning.kokkos as device. It significantly increased the CPU usage to about 75%, but the training time is not reduced much.

I read the discussions in Topic 1304, where a Pennylane developer said that parameter-shift can be slow in some case, and backprop differentiation may be much faster. I also noticed that the lightning.qubit device supports adjoint differentiation method. However, I did not find how to choose which differentiation method to use.

The CPU I’m using is AMD Ryzen R7 5800H with 16 cores. My qml.about() (The lightning.kokkos device is run in WSL environment, so the following output does not include it):

Name: PennyLane
Version: 0.41.1
Summary: PennyLane is a cross-platform Python library for quantum computing, quantum machine learning, and quantum chemi
stry. Train a quantum computer the same way as a neural network.                                                        Home-page: https://github.com/PennyLaneAI/pennylane
Author:
Author-email:
License: Apache License 2.0
Location: C:\Users\Timmy\AppData\Local\Programs\Python\Python312\Lib\site-packages
Requires: appdirs, autograd, autoray, cachetools, diastatic-malt, networkx, numpy, packaging, pennylane-lightning, reque
sts, rustworkx, scipy, tomlkit, typing-extensions                                                                       Required-by: PennyLane_Lightning

Platform info:           Windows-11-10.0.26200-SP0
Python version:          3.12.7
Numpy version:           1.26.4
Scipy version:           1.12.0
Installed devices:
- default.clifford (PennyLane-0.41.1)
- default.gaussian (PennyLane-0.41.1)
- default.mixed (PennyLane-0.41.1)
- default.qubit (PennyLane-0.41.1)
- default.qutrit (PennyLane-0.41.1)
- default.qutrit.mixed (PennyLane-0.41.1)
- default.tensor (PennyLane-0.41.1)
- null.qubit (PennyLane-0.41.1)
- reference.qubit (PennyLane-0.41.1)
- lightning.qubit (PennyLane_Lightning-0.41.1)

CatalinaAlbornoz · June 27, 2025, 8:50pm

Hi @Tz_19 ,

It looks like you recoded a version of Mottonen State Preparation. We have a state preparation template that does this in PennyLane (see the docs for qml.MottonenStatePreparation).

That being said, this is usually only used as a last resort. It’s generally more inefficient than any other method available in PennyLane. You can generally obtain better results by using qml.StatePrep.

My guess is that the low usage you see in the CPU is due to the inefficiencies inherent in the code, but that’s just a guess.

In terms of devices, a good rule of thumb is that default.qubit is better for up to 10-15 qubits, and lightning.qubit is better in the 10-20 qubit range.

Regarding the differentiation method, as long as you don’t manually set it, the device will automatically choose the best method to use. If you want to manually set it then you can do it when you define the qnode. E.g.

@qml.qnode(dev, diff_method="backprop")

It might help to use a profiler such as SnakeViz to see where the slowdown is happening.

Let me know if this helps or if you have any follow up questions!

Tz_19 · June 28, 2025, 1:53am

Hi @CatalinaAlbornoz , thank you for your reply! I see that Pennylane offers some state preparation method including qml.StatePrep, but they are not always differentiable according to the documentation. I need a fully differentiable PQC that can perform arbitrary SU(16), because I want to use it as a component of a larger circuit of more qubits. Does Pennylane offer some templates like this, or I should implement it manually? Anyway, I will try to use a profiler to find the location of the slowdown first.

Tz_19 · June 30, 2025, 12:23pm

Hello @CatalinaAlbornoz, here is the profile of the running process of my code (for the limitation of file format, I replaced the file extension from .prof to .txt).
train3.txt (1.1 MB)
The time was mostly spent on controlled.py and capture_meta.py according to the profile. It seems that the process of creating controlled operators takes a lot of time when running my code, but I cannot figure out what exactly happening inside the process. I also tried to replaced my circuit with a simple qml.Rot, which means replacing the circuit() function with the following:

@qml.qnode(dev)
def circuit(params, ini_state, tar_state):
    # the quantum circuit to train, using the fidelity between the initial state and the target state as cost function.
    qml.StatePrep(ini_state, range(1))
    qml.Rot(params[0], params[1], params[2], wires=0)
    return qml.expval(qml.Hermitian(qml.math.dm_from_state_vector(tar_state), range(1)))

The training is much faster as expected compared to the training process of the original circuit, but the CPU usage is still very low.

As I mentioned in the previous post, I want to implement a fully differentiable PQC that performs arbitrary SU(16), like the two_qubit_decomp() in this documentation. Is there any efficient way to implement that in Pennylane?

CatalinaAlbornoz · July 3, 2025, 3:56am

Hi @Tz_19 , thanks for sharing this info!

I’m glad that things are much faster now.

We do have some tools for two-qubit decomposition. We have qml.ops.two_qubit_decomposition, which allows you to decompose a two-qubit unitary in terms of elementary operations.

We also have a page on Two-qubit Synthesis in our Quantum Compilation hub. Feel free to check them out to see if you find something else that can help you. Also, please let us know in case you were looking for something else.

Let us know if this helps!

Tz_19 · July 3, 2025, 1:53pm

Er… things have not gone faster yet, maybe my prevent post is easy to cause misunderstanding. My original SU(16) circuit is still slow, and I observed a low CPU usage even I’m training a simple circuit just containing a qml.Rot operation, so it seems that the potential inefficiencies inherent in my code is at most part of the reason causing the slow training. The ‘faster’ in my previous post refers to that the training process of the simple single-bit rotation circuit is much faster than my original SU(16) circuit. However, the single-bit rotation circuit is not what I need.

I noticed that Pennylane offers some methods to decompose two-qubit unitaries, but I need to decompose 3-bit or 4-bit unitaries. Does Pennylane offer any solutions to this?

CatalinaAlbornoz · July 4, 2025, 10:32pm

Hi @Tz_19 ,

I have two ideas for what might help you.

On one hand you could try using qml.SpecialUnitary. We have a demo on SU(N) which uses this functionality so I would recommend taking a look at the demo and the docs for qml.SpecialUnitary to see if this is what you need. Please make note of the warnings and notes in the docs since this method has some specific limitations.

On the other hand maybe what you’re looking for is something like KAK or Cartan decompositions. These methods allow you to do unitary synthesis. We have a demo on KAK decompositions, one on Unitary synthesis with recursive KAK decompositions, and one on Cartan decompositions. We also have a preprint on Recursive Cartan decompositions for unitary synthesis in case you’re looking for more info on this.

I hope one of these two options are what you’re looking for. Let me know in case this isn’t what you need.

Bonus: If you’re interested in lie algebras you can check our qml.liealg module, which has some useful functions for this.

I hope this helps!

Topic		Replies	Views
Pennylane Lightning NotImplementedError: VJP of eigvals wrt argnums (0,) not defined PennyLane Help	13	154	July 26, 2024
Speeding up grad computation PennyLane Help	32	7852	May 5, 2022
Execution time very long. Options to speed up? PennyLane Plugins	9	2406	April 26, 2021
Error faced in training the quantum network for estimating parameters PennyLane Help	31	2017	March 15, 2024
Batching Inputs to Quantum Circuit PennyLane Help	14	3464	September 10, 2019

Slow Training and Low CPU Usage When Training Deep Circuit

Related topics