How does this hybrid layer learn

If I use the pennylane.qnn.keras layer and use the loss function in tensorflow for this layer:

wires = 2
n_quantum_layers = 1

dev = qml.device("strawberryfields.fock", wires=wires, cutoff_dim=15)

@qml.qnode(dev)
def layer(inputs, w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10):
    qml.templates.DisplacementEmbedding(inputs, wires=range(wires))
    qml.templates.CVNeuralNetLayers(w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, wires=range(wires))
    return [qml.expval(qml.X(wires=i)) for i in range(wires)]

weights = qml.init.cvqnn_layers_all(n_quantum_layers, wires, seed=None)
weight_shapes = {"w{}".format(i): w.shape for i, w in enumerate(weights)}#{"x": wires}
n_actions = env.action_space.n
input_dim = env.observation_space.n
qlayer = qml.qnn.KerasLayer(layer, weight_shapes, output_dim=wires)
clayer_in = tf.keras.layers.Dense(wires,input_dim=input_dim)
clayer_out = tf.keras.layers.Dense(n_actions, activation = 'linear')
model = tf.keras.models.Sequential([clayer_in,qlayer,clayer_out])
model.compile(optimizer=tf.keras.optimizers.Adam(), loss = 'mse'

How are the weights actually being trained using a classical tool like the loss function for classical input? Is there more to this training of the parameters? More embedding or something?

For fun here is a visual presentation of the entire layer (input is a 9-dim vector and output is a 4-dim vector):

Hi @Shawn,

After creating the qml.qnn.KerasLayer, which simply wraps a QNode into a layer that’s compatible with TensorFlow and Keras, TensorFlow is able to classically optimize the network as if the layer was a classical one, similar to e.g. a tf.keras.layers.Dense layer. The gradient for the quantum part of the network is supplied by the QNode and is calculated by different means depending on the device used (in the strawberryfields.fock case it’s calculated by finite differences), while all other gradients are calculated classically by TensorFlow.

So, in short, training the weights is handled by TensorFlow as if it was all classically, while the QNode handles all the “quantum stuff” and returns classical outputs and gradients to the optimizer.

Thanks @theodor! I’m a little confused – you say the parameters are calculated by Tensorflow but the gradient for the quantum part is supplied by the QNode…isn’t that contradictory? There should only be one way to update the parameters, right?

returns classical outputs and gradients to the optimizer.

Optimizer as in the optimizer I provide for the tf layer?

Hi @Shawn. So, what TensorFlow does is that it calculates the output of the network by calculating the outputs of each layer and feeds it forward into the next layer. For the qlayer it gets these values from the QNode (which is responsible for calculating them in that specific context).

Similarly for the gradients, each layer has a gradient that either TensorFlow can calculate itself (e.g. by using automatic differentiation) or, in the case of the quantum circuit, only collects the gradient from the QNode, which again does all the heavy lifting.

What happens inside the QNode (the “quantum stuff”) isn’t known by TensorFlow, which only sees the output, i.e. the measured values which have been transformed into classical data (which is what happens when a measurement occurs). So, for TensorFlow to be able to optimize this network, it simply attempt to minimize the cost function using the classical data it gets from the QNode (supplied via the qml.qnn.KerasLayer), without caring about how it was calculated, along with any other TensorFlow specific parts of the network.

Optimizer as in the optimizer I provide for the tf layer?

Yes, that would be the optimizer=tf.keras.optimizers.Adam() that you’ve defined in the model.compile() call.

I hope this makes sense. :slight_smile:

Hi @theodor thank you for the insight. From the tf side that makes sense. But I am more curious about the actual quantum parameters and how they get trained. How do those get trained exactly?

You wrote early that tf trains those but now I’m under the impression that that’s not the case.

Hi @Shawn. TensorFlow trains those parameters/weights in the sense that it varies them according to the gradient (i.e. the gradient with respect to those specific weights), that is supplied by the QNode, in an attempt to minimize the specified cost function.

Since the input weights are classical and the output is classical TensorFlow can attempt to find the best weights for minimizing this cost no matter what actually happens internally in the QNode.

A simple way to understand this would be to imagine a random function f(x) and, by hand, trying to minimize it (perhaps just by trial-and-error; the more inputs/weights x you try, the closer you would get to a minimum). You could still optimize and “train” the weights without ever knowing anything about what happens inside the function, although if the function would give you some hints as to where the minimum could be, you could use that to your advantage and find the optimal weights (i.e. train the function) quicker. This is what the QNode does by handing over the gradients, allowing faster and better optimization methods (e.g. the Adam optimization algorithm you’re using above).

Hi @theodor yea I get that but I’m more stuck on the fact that in the classical neural network, backpropagation is used to train the network. This is obviously not the case for the quantum neural network. I get what you are explaining but it’s a bit unsatisfactory because tf is indirectly updating the weights, not directly. And there seems to be something peculiar happening inside the qnode that is actually doing the dirty work to update the weights. Tf is just taking the results and optimizing that. That’s an added benefit but the Qnode is actually updating the weights. Or am I wrong here?

I’m referencing what you meant regarding the strawberryfields.fock statement. Is that what is updating the weights inside the qnode or how exactly do the weights inside the qnode get updated? It’s still not clear to me. Apologies if this is cumbersome.

Hi @Shawn. Backpropagation can still be used even if there are quantum layers in the network. It simply works by calculating the gradient for one layer at a time, starting with the last one. When reaching the qlayer the algorithm simply gets handed the gradient rather than doing the calculations itself.

TensorFlow is updating the weights by supplying the QNode function with the new weights (which are in turn simply applied to the circuit which you have defined and either simulated, e.g. by using heavy matrix/tensor calculations, or run on hardware, where the weights are translated into physical hardware parameters).

Exactly how the weights are applied to the quantum circuit depends on the device and the quantum operations. For example, the CVNeuralNetLayer would use the weights as phase, transmittivity and rotation angles for a set of interferometers along with displacement and kerr values that are needed (please read the docs for more details).

So what’s happening here is basically:

  1. The weights are updated (classically) and supplied to the QNode function

  2. The QNode uses these exact values as parameters for applying whatever operations that are used in the circuit (e.g. CVNeuralNetLayer)

  3. The QNode returns an output (after applying the circuit operations) along with a calculated gradient (by e.g. using finite differences or the parameter shift rule)

There’s nothing more peculiar happening inside the QNodes. :slight_smile:

@theodor You’re the man! Thank you. One part that is still confusing me – how does the qnode calculate the gradient? And how via finite differences? I haven’t seen any docs or any info on this in the source code.

@Shawn I’m glad I could be of help. As I mentioned above, there are several different ways to calculate the gradient of a QNode with respect to its parameters; the method depends on the device used. Some devices support exact gradient calculations using automatic differentiation (only simulators) or by using the parameter shift rule, while other devices use numerical methods like finite differences, which always works but might be unstable and inexact. Both parameter shift and finite differences work by running and measuring the QNode twice, using the results to calculate the gradient (see references for details).

If you want to know exactly how this is done by the QNode decorator, as well as which methods the devices use/support, you could have a look in the pennylane/qnodes/decorator.py file here, as well as in the specific devices device.capabilities methods.

Thanks @theodor. That makes sense now. I just read about the sampling of a qnode to get the derivative. I checked how many times my qlayer is sampled and it shows 500 times per call of the qlayer. Is the purpose of the 500 times just to create the finite-difference gradient? Why is it called so many times?

And is that what we expect to do on hardware or is this just a simulation procedure to calculate the gradient?

Edit: Why does the sf.fock device only support finite-difference? From this post my usecase could support automatic differentiation i.e. parameter shift-rule since the Kerr gate is at the end of the layer.

Hi @Shawn. That seems a bit strange. It might be because of the finite differences calculation, although 500 times seems a bit excessive. The hardware would also use finite differences unless it fully supports the parameter shift rule (which is seldom the case).

The strawberryfields.fock device could actually support using the parameter shift rule if all potential non-Gaussian gates in the circuit precedes the differentiated gate. In your case, the Kerr gate is at the end of the circuit, thus the parameter shift rule is not supported.

@theodor yea I find it also strange especially since my program would run much faster if it wasn’t called so many times. I also asked this question in slack and am waiting for a response. Not even sure what the sampling is needed for the finite difference case when just an infinitesimal small step of the parameters would work.

Do you know if sampling would be needed with hardware or is this just a simulation need?

How did you get the number of samples/calls? Would you mind sharing where you got that number from? Regarding the hardware, as far as I know it should work in pretty much the same way, by sampling and then calculating the gradient via the finite differences. Since the hardware generally needs more samples to get a good estimate of the expectation value (since it’s stochastic), there might be even more calls than for the simulator. Exactly how many I cannot say.

Hi @theodor. So would then the quantum speed up of a photonic quantum computer be misconstrued here since the quantum gates first needs to create the gradient before sending information to the next layer? Even if the gates take 0,1ns, if they need to be ran many times, then the speed up would be much lower, especially if my simulation is running 500 times and you say for hardware could need more samples.

What I did to find this out:

@qml.qnode(dev)
def layer(inputs, w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10):
    global g
    g+=1
    print(g)
    qml.templates.DisplacementEmbedding(inputs, wires=range(wires))
    qml.templates.CVNeuralNetLayers(w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, wires=range(wires))
    return [qml.expval(qml.X(wires=i)) for i in range(wires)]

And there is a while loop where a step is taken in my use case until a certain step ends the while. Within every step a single call to the neural network is done. g=0 outside of the while loop.

Hi @Shawn. Even if the quantum hardware would need to be sampled many times there could still be a quantum speed-up, since it doesn’t necessarily come from just performing operations faster; rather it performs the computations in a different way that, in a sense, parallelizes the calculations. Though, at the moment, we’re not sure how great this speed-up could be, and it’ll probably take some time before we get to the point when artificial neural networks, such as the one in your example, will be able to benefit from this.

@theodor of course there should be a speed up…I just wrote it would be much lower due to the sampling. Do you know what is up with the 500 runs of the layer?

Why there are 500 calls I cannot really say. It depends on the number of parameters you’re using, as well as your batch size. It might also be something that Keras is doing behind the scenes, since it’s handling the optimizations, although I’m not sure exactly why that many calls are being made.