QAOA optimization with default_qubit_tf device


I’m trying to do a QAOA optimization using the default_qubit_tf device. I have the following QAOA implementation

dev = qml.device('', wires = len(G.nodes), shots = 5000)
@qml.qnode(dev, interface = "tf", diff_method = "backprop")
def circuit(params, **kwargs):
    params = params.numpy()
    for i in range(len(G.nodes)):
        qml.Hadamard(wires = i)
    for j in range(p):
        qaoa.mixer_layer(params[1][j], mixer_h)
    return qml.probs(wires = list(range(len(G.nodes))))

where U_C implements the cost Hamiltonian unitary operations and qaoa.mixer_layer the mixer Hamiltonian ones. The parameters params are initially implemented as

params = tf.Variable(0.01*np.random.rand(2, p))

so that’s why I redefine them inside the circuit, so I can provide them to the gates as numpy variables (I don’t know if this is appropriate). I provide the cost function as

def cost_function(params):
    result = circuit(params).numpy()[0]

    counts = {}
    for i in range(len(result)):
        counts[f"{i:010b}"] = result[i]

    E = 0        
    for bitstring in counts.keys():
        x = string_to_list(bitstring)
        energy = MaxCut_cost(G, x)
        E += -energy*counts[bitstring]
    return tf.convert_to_tensor(np.array([E]))

which does the following

  1. Takes the probability outcomes coming from the circuit, and turns them into numpy variables.
  2. Defines a dictionary that saves the probabilities obtained in the circuit.
  3. For each bitstring obtained, computes the energy with MaxCut_costwhich defines the MaxCut cost funtion for the graph G I’ve defined.
  4. Finally, returns the energy as a tensor variable, exactly equal to the one I’m introducing through the parameters.

First of all, I’m not very familiar with tensorflow so that’s why (probably) my implementation is terrible. Secondly, I tried to do the optimization in a similar fashion to what I’m used to do with the default_qubit device, i.e., I define an optimizer opt = Optimizer(lr = 0.1) and then calculate the new parameters as params = opt.step(cost_function, params), but it gives me the following error

TypeError: Can't differentiate w.r.t. type <class 'tensorflow.python.ops.resource_variable_ops.ResourceVariable'>

I would really appreciate your help. The reason why I’m trying to use this tensorflow approach is because I want to optimize for large values of p, i.e., the number of repetitions defined in QAOA, and in the documentation it’s specified that this device is the appropriate one when I want to work with a big number of parameters.

If you have further questions about my implementation, I’ll be happy to specify them in more detail. Thanks in advance!



Hi @Javier,

Thanks for reaching out! :slight_smile:

That seems odd indeed. Which optimizer did you go for using, could you further include the parts of the code where the optimization is done and the optimizer is being used? (Also a notebook/complete snippet close to the original could be helpful for reproducing the error.)

Thank you!


Hi @Javier!

I would really appreciate your help. The reason why I’m trying to use this tensorflow approach is because I want to optimize for large values of p , i.e., the number of repetitions defined in QAOA, and in the documentation it’s specified that this device is the appropriate one when I want to work with a big number of parameters.

The reason we suggest for a large number of parameters is (as you show in your code snippet), since it is written in TensorFlow, it allows us to use backprop for differentiation (diff_method="backprop").

While backprop mode only works on specific simulators, it is much more efficient for training a large number of parameters compared to other simulator methods (such as the parameter-shift rule—for more details, see Quantum Gradients with Backpropagation).

In fact, we now have two devices that support backpropagation!

  •, which supports diff_method="backprop" when used with the tf interface, and

  • default.qubit.autograd, which supports diff_method="backprop" when used with the default autograd interface.

So if you would like, you can swap to default.qubit.autograd, which uses the Autograd-enabled version of NumPy that comes with PennyLane.

Secondly, I tried to do the optimization in a similar fashion to what I’m used to do with the default_qubit device, i.e., I define an optimizer opt = Optimizer(lr = 0.1) and then calculate the new parameters as params = opt.step(cost_function, params)

Note that the optimizers you must use depend entirely on your interface.

  • If using the autograd interface (the default one in PennyLane), you can use the provided PennyLane optimizers.

  • However, if you are using the tf interface, then you will need to leverage one of the TensorFlow optimizers (found in tf.keras.optimizers ). Only the TensorFlow optimizers know how to optimize TensorFlow code.

We have a small guide on our TensorFlow interface page, showing how to optimize in TensorFlow:

phi = tf.Variable([0.5, 0.1], dtype=tf.float64)
theta = tf.Variable(0.2, dtype=tf.float64)

opt = tf.keras.optimizers.SGD(learning_rate=0.1)
steps = 200

for i in range(steps):
    with tf.GradientTape() as tape:
        loss = tf.abs(circuit4(phi, theta) - 0.5)**2

    gradients = tape.gradient(loss, [phi, theta])
    opt.apply_gradients(zip(gradients, [phi, theta]))

Hope that helps! Let us know if you have any other questions.

thank you very much for your response @antalszava and @josh. I tried to change now to the default.qubit.autograd option to deal with numpy, and now I get the following error

TypeError: loop of ufunc does not support argument 0 of type ArrayBox which has no callable exp me

It appears when I perform the optimization step, using the adaptive gradient descent optimizer qml.AdagradOptimizer(lr). I provide you a notebook with all the details that you can find in my github (I couldn’t upload it here), if you want to take a look to it ( Otherwise, I don’t have any problem in specifying further details of the problem here.

Thank you very much for your time and efforts!

Hi @Josh,

I tried this implementation and seems to work pretty well whenever I add my “homemade” mixer Hamiltonian, i.e., if I define the mixer Hamiltonian by myself implementing the gates “by hand”. However, when I use the qml.qaoa option, the following error arises: TypeError: Cannot convert (-0-1j) to EagerTensor of dtype double. Do you know why? I don’t know if the qml.qaoaoption implements the gates in such a way that that tensorflowdoesn’t know how to interpret them.

Thank you very much!

Hi @Javier, could you post a minimal (non)-working code example that produces that error? That will help me track down this potential bug :slightly_smiling_face: Thanks!

Hi @Josh,

sure, I will write it here part by part (each item corresponds in my notebook to a separate cell):

  1. Packages that I’m introducing

         import pennylane as qml
         import cirq
         from pennylane_cirq import ops as cirq_ops
         from pennylane import qaoa
         import tensorflow as tf
         import autograd.numpy as np
         import networkx as nx
         import random
         import torch
  2. Graph definition

     def dregular_graph(n,d,mu,sigma):
         G= nx.generators.random_graphs.random_regular_graph(d,n)
         for e in list(G.edges):
             G[e[0]][e[1]]["weight"] = round(random.gauss(mu,sigma),2)
         return G
     G = dregular_graph(15,8,0.1, 1)
  3. Circuit definition

      def U_C(gamma):
          for e in list(G.edges):
             wire1 = int(e[0])
             wire2 = int(e[1])
             qml.CNOT(wires=[wire1, wire2])
             qml.RZ(G[wire1][wire2]["weight"]*gamma, wires=wire2)
             qml.CNOT(wires=[wire1, wire2])
      cost_h, mixer_h = qaoa.maxcut(G)
      def qaoa_mixer(alpha):
          qaoa.mixer_layer(alpha, mixer_h)
      p = 4
      dev = qml.device('', wires = len(G.nodes))
      @qml.qnode(dev, interface = "tf", diff_method = "backprop")
      def circuit(gamma, beta, **kwargs):
          for i in range(len(G.nodes)):
              qml.Hadamard(wires = i)
          for j in range(p):
          return qml.probs(wires = list(range(len(G.nodes))))
  4. Minimization

      gamma = tf.Variable([2*random.random() for _ in range(p)], dtype=tf.float64)
      beta = tf.Variable([2*random.random() for _ in range(p)], dtype=tf.float64)
      opt = tf.keras.optimizers.Adagrad(learning_rate=0.1)
      for _ in range(50):
          with tf.GradientTape() as tape:
              cost = sum(cost_function(gamma, beta))
          gradients = tape.gradient(cost, [gamma, beta])
          opt.apply_gradients(zip(gradients, [gamma, beta]))

On the other hand, if I forget about the mixer_h defined by the qaoa function and write it myself introducing the rotations explicitly, then it works perfectly.

By the way, when running the code double-check that the identation is correct because I had to modify it a little bit here so that I could write it as a preformatted text.

Thank you very much!

Hi @Javier! :slight_smile:

Thanks so much for your code! It helped spot a small issue when using default.qubit.autograd. We’ve had a go with patching it. Until this change makes it into the master version of PennyLane, you could checkout the branch with the update.

Another small modification that was needed in your snippet was related to how the gradient function is defined for the optimization. The gradient function had to be passed explicitly as grad_fn=qml.jacobian(cost_function) making the line for optimization:

params = opt.step(cost_function, params, grad_fn=qml.jacobian(cost_function))

Hope this helps! Otherwise, let us know if there are further things that we could help out with :slight_smile:


Hi @Javier,

This is now fixed, if you install PennyLane from the master branch then it should work nicely. :slight_smile:


Thank you very much for your solutions and efforts!

I wanted to ask you another thing that might not be very related with what I was asking here because probably we can’t handle it with backpropagation. I want to define a non-analytic circuit with some shots that at the end of the day samples the PauliZ operation over all the qubits, so I get an array full of 1 and -1. This I know it can be done with the qml.sample measurement. With that, I want to compute the cost function as C(x) = 1/N \sum_x E(x), with E(x) the energy of configuration x (sampled from the circuit) and N the number of shots.

The question is then, can I define a circuit with some device that is able to optimize with this definition of the measurement and the cost function? I know that qml.sample cannot be used for analytic optimization because it’s stochastic, but is there any other form I can implement?

I was trying to implement this with the lightning.qubit (I want to go to a big number of qubits) and computing the cost function as np.sum(*,,x))) with x* the transpose of the matrix given by the circuit and adj the adjacency matrix defined by my graph, but during the opt.step with AdagradOptimizer I get an error saying that Python cannot differentiate w.r.t. type <class 'numpy.int64'>.

Once again, thank you very much for your efforts and your answers!

Hi @Javier,

I suspect that the error you’re getting might be from returning qml.sample(...) from your QNode. Would you be able to share a minimal code example of this, so that I can have a look at what might be causing this error?

The question is then, can I define a circuit with some device that is able to optimize with this definition of the measurement and the cost function?

As you mentioned, it doesn’t really make sense to use qml.sample with analytic optimizations, hence it doesn’t support gradient calculations. You could try using a gradient-free optimizer, which would be able to run; just keep in mind that the sample return is stochastic, and it might thus not make much sense.

Feel free to share parts of your code here and we could have a closer look at what you’re doing and what’s not currently working!

Hi @Javier,

One side note about lightning.qubit is that it will default to using numpy when used for more than 16 qubits (and hence its performance will be identical to default.qubit in such cases).

Hi @theodor and @antalszava, thank you very much for your response.

With respect to @antalszava comment, I wanted to go to sizes between 16 and 20 qubits. In fact with the tensorflow option I can run in a reasonable amount of time 16 qubits, but for bigger systems, as the Hilbert space scales exponentially, I thought that one of the main bottlenecks was the computation of the cost function, since I was outputting qml.probs, calculate the energy for each possible bitstring, multiply both tensors and sum. Thus, I thought that approximating the energy as I specified in my previous message will enhance the operations. Is there any difference in using with parameter shift instead of default.qubit for these system sizes?

With respect to @theodor comment, here I attach you the code I was using. I’ll split it in several steps as before

  1. The graph definition is the same as the one shown in the code I sent in one of my previous messages to Josh.

  2. Circuit definition

    G = dregular_graph(4,2,0.1, 1)
    p = 1
    dev = qml.device('lightning.qubit', analytic = False, shots = 100, 
    wires = len(G.nodes))
    def circuit(params, **kwargs):
        for i in range(len(G.nodes)):
            qml.Hadamard(wires = i)
        for j in range(p):
        return [qml.sample(qml.PauliZ(k)) for k in G.nodes]
  3. Adjacency matrix and cost function definition

     def adjacency_matrix_np(G):
         adj = np.zeros((len(G.nodes),len(G.nodes)))
         for edge in G.edges:
             i = edge[0]
             j = edge[1]
             adj[i,j] = G[i][j]["weight"]
             adj[j,i] = G[j][i]["weight"]
         return adj
     adj = adjacency_matrix_np(G)
     def cost_function_light(params):
         result = circuit(params)
         prod =, result)
         final_prod =, prod)
         return np.sum(final_prod)/100
  4. Optimization

     opt = qml.AdagradOptimizer(0.1)
     steps = 20
     params = 0.01*np.random.rand(2, p)
     for _ in range(steps):
         params = opt.step(cost_function_light, params)

With the packages I sent in the previous message, in principle should reproduce the error I’m getting.

Thank you very much for your efforts and answers!

There is a mistake in the definition of the cost function (cost_function_light in the code). The second operation, i.e. the one defined by final prod should be defined as result*prod. As you mentioned, with gradient free optimizers like qml.RotosolveOptimizer I don’t get any error since no gradients are being calculated. However, the problem doesn’t get optimized but this is something I’ll try to figure out.

Sorry for the mistake and thank you very much for your time and efforts!

HI @Javier. The problem is that your cost-function seems to be highly stochastic, i.e. it outputs completely different values even when using the same input, due to the use of qml.sample. Even though the gradient cannot be calculated when returning samples, it would most likely not help with the optimization. If I may ask; why do you want to return the samples and optimize over them? It might make more sense to use the expected value in this case. :slight_smile:

Hi @theodor, thank you very much for your response!

I wanted to use this method of obtaining samples because I wanted to approximate the energy as C = 1/N \sum_{i=0}^N E(x_i) with N the number of samples. The reason for doing this is because I want to go to a big number of qubits (between 20 and 18). Initially I was doing these operations exactly with qml.probs, calculating the cost function as the product between the probability vector and another vector containing the energies of each possible sample (a 2^N size vector). For 16 qubits this works relatively well, i.e., in a reasonable amount of time (using tensorflow and backprop), but for 18 qubits already takes a a lot of time to optimize a QAOA with p = 1 (more than 4 hours), and apart form the circuit computation, one of the main bottlenecks I could see is the energy computation I was doing. Thus, with the previous approximation I expected to do these computations quicker.

Thank you very much!

Hey Javier,

While, as @theodor said, the gradient of a stochastic output is somewhat ill-defined, there is still a way to do it in practice. The idea is to return the expectation of the PauliZ operator expval(PauliZ(wires=...), but setting dev.shots=1. This, in turn, is theoretically well defined, because it is the gradient of an expectation value, estimated using a single measurement. In other words, the output of the quantum function is now deterministic, but its evaluation is a stochastic estimate.

Just stating this here for completeness, I don’t know how feasible it is in terms of runtime, where I assume this will use a lot more resources than returning an expectation (especially on a simulator, where you have access to the exact expectation, and sampling is less natural). We are planning to make the shots API a bit more flexible so that your use case may be supported better in future.

1 Like

Hi @Maria_Schuld

I wanted to implement something similar to what Javier has done here, and tried to use the same approach as you mentioned. I am curious as to what is actually measured in the backend when using the dev.shots = 1 approach? Is it that the parameter shift rule (or something similar) is applied to all the pauli_Z expectation values over all the available qubits? I.e if one had a 2 qubit system, one would perform parameter-shift on <z_1> and <z_2> to calculate the gradient with respect to some parameters gamma?

Also, has the shots API been edited to allow for something similar to what javier was requesting?

Hi @Viro,

If you’re working on hardware then you will be using the parameter-shift rule. If you’re working on a simulator you will need to specify the differentiation method when you define your qnode. If you don’t define the differentiation method then PennyLane will pick the default method for that particular simulator. This is how you can specify that you want to use the parameter-shift rule to get the gradient.

@qml.qnode(dev, diff_method="parameter-shift")

The parameter-shift rule is then applied to the entire circuit. If you want to measure the expectation value for more than one qubit you can use the tensor product of the PauliZ on different wires, and get the gradient with respect to your parameters. Here’s an example:

dev = qml.device("default.qubit", wires = 2, shots=1)

@qml.qnode(dev, diff_method="parameter-shift")
def my_circuit(data,params):
    # Encode the input data as an RX rotation
    qml.RX(data[0], wires=0)
    qml.RX(data[1], wires=1)
    # Create a rotation based on the angles in "params"
    qml.Rot(params[0], params[1], params[2], wires=0)
    qml.Rot(params[3], params[4], params[5], wires=1)
    # We return the expected value of a measurement along the Z axis 
    return qml.expval(qml.PauliZ(0)@qml.PauliZ(1))

data = np.array([0.1,0.1],requires_grad=False)
params = np.array([0.1,0.1,0.1,0.1,0.1,0.1],requires_grad=True)

grad_function = qml.grad(my_circuit)

Since you only use 1 shot then your answer will change every time you run it.

About the shots API, what functionality are you looking for?

Please let me know if this helps!