Adjoint and backpropagation differentiation methods give different results

vabelis · March 16, 2022, 2:07pm

Running the following hybrid model with diff_method='adjoint' and diff_method='backprop' gives different results during the training.

import numpy as np
from sklearn.datasets import make_moons
import torch
import matplotlib.pyplot as plt
import torch.nn as nn
import pennylane as qml
import sys
from time import perf_counter

class Model(nn.Module):
    def __init__(self, dev, diff_method="backprop"):
        super().__init__()

        self.cnet_in = self.cnet()
        self.qcircuit = qml.qnode(dev, interface="torch", 
                                  diff_method=diff_method)(self.qnode)
        
        weight_shape = {"weights":(2,)}
        self.qlayer = qml.qnn.TorchLayer(self.qcircuit, weight_shape)
        self.cnet_out = self.cnet()

    def cnet(self):
        layers = [nn.Linear(2,256), nn.ReLU(True), nn.Linear(256,2), nn.Tanh()]
        return nn.Sequential(*layers)   

    def qnode(self, inputs, weights):
        # Data encoding:
        for x in range(len(inputs)):
            qml.Hadamard(x)
            qml.RZ(2.0 * inputs[x], wires=x)
        # Trainable part:
        qml.CNOT(wires=[0,1])
        qml.RY(weights[0], wires=0)
        qml.RY(weights[1], wires=1)
        return [qml.expval(qml.PauliZ(wires=0)), qml.expval(qml.PauliZ(wires=1))]

    def forward(self, x):
        x1 = self.cnet_in(x)
        x2 = self.qlayer(x1)
        x_output = self.cnet_out(x2)
        return x_output

def train(X, y_hot, dev_name, diff_method):
    
    dev = qml.device(dev_name, wires=2, shots=None)
    model  = Model(dev, diff_method)
    
    # Train the model
    opt = torch.optim.Adam(model.parameters(), lr=0.01)
    loss = torch.nn.L1Loss()

    X = torch.tensor(X, requires_grad=False).float()
    y_hot = y_hot.float()

    batch_size = 5
    batches = 200 // batch_size

    data_loader = torch.utils.data.DataLoader(
        list(zip(X, y_hot)), batch_size=batch_size, shuffle=True, drop_last=True
    )

    epochs = 6

    for epoch in range(epochs):

        running_loss = 0

        for xs, ys in data_loader:
            opt.zero_grad()

            loss_evaluated = loss(model(xs), ys)
            loss_evaluated.backward()

            opt.step()

            running_loss += loss_evaluated

        avg_loss = running_loss / batches
        print("Average loss over epoch {}: {:.10f}".format(epoch + 1, avg_loss))

    y_pred = model(X)
    predictions = torch.argmax(y_pred, axis=1).detach().numpy()
    correct = [1 if p == p_true else 0 for p, p_true in zip(predictions, y)]
    accuracy = sum(correct) / len(correct)
    print(f"Accuracy: {accuracy * 100}%")

if __name__ == "__main__":
    torch.manual_seed(42)
    np.random.seed(42)
    X, y = make_moons(n_samples=200, noise=0.1)
    y_ = torch.unsqueeze(torch.tensor(y), 1)  # used for one-hot encoded labels
    y_hot = torch.scatter(torch.zeros((200, 2)), 1, y_, 1)
    begin_time = perf_counter()
    train(X, y_hot, str(sys.argv[1]), str(sys.argv[2]))
    end_time = perf_counter()
    runtime = end_time-begin_time
    print(f'Runtime: {runtime:.2e} s or {(runtime/60):.2e} min.')

From the documentation, and the literature, I understand that the two differentiation methods are analytic and should be equivalent. Is there an explanation for the discrepancy I observe by running the above code. Here are the outputs for backpropagation and adjoint differentiation, respectively.

/work/vabelis/miniconda3/envs/ae_qml_pnl/lib/python3.8
/site-packages/torch/autograd/__init__.py:154: 
UserWarning: Casting complex values to real discards 
the imaginary part (Triggered internally at  
/opt/conda/conda-bld/pytorch_1640811757556/work/
aten/src/ATen/native/Copy.cpp:244.)
  Variable._execution_engine.run_backward(
Average loss over epoch 1: 0.3586139083
Average loss over epoch 2: 0.1699218154
Average loss over epoch 3: 0.0833731964
Average loss over epoch 4: 0.1283999979
Average loss over epoch 5: 0.0720961764
Average loss over epoch 6: 0.0696858093
Accuracy: 98.5%
Runtime: 7.94e+00 s or 1.32e-01 min.

Average loss over epoch 1: 0.3586135507
Average loss over epoch 2: 0.1701160818
Average loss over epoch 3: 0.1003372073
Average loss over epoch 4: 0.0771262795
Average loss over epoch 5: 0.0591097102
Average loss over epoch 6: 0.0415146127
Accuracy: 100.0%
Runtime: 6.75e+00 s or 1.12e-01 min.

Versions:

pytorch=1.10.2
pennylane=0.21.0
numpy=1.22.2

Thanks in advance for your answers!

CatalinaAlbornoz · March 16, 2022, 3:34pm

Hi @vabelis, thank you for asking this question. The results should in fact be identical. We will investigate this issue to see what’s going on.

christina · March 16, 2022, 5:43pm

@vabelis

I’m looking into it and unable to isolate the problem yet. Given the results for "default.qubit"+"adjoint", "lightning.qubit"+"adjoint", and both devices + "parameter-shift" all give the same result, I’m more willing to trust those results.

If you manage to isolate particular parameter-values where the two methods diverge for just the qnode derivative, that would make debugging the issue much easier.

vabelis · March 28, 2022, 4:36pm

Hey @christina and @CatalinaAlbornoz,

Apologies for the late answer. The problem with adjoint differentiation is that for our larger scale models, it always overtrains no matter what kind of hyperparameters we give it. In contrast, backprop gives reasonable results, so we sort of need to use backrprop at the moment.

Here’s a plot of the parameter values for training the simple model above:

Notice that after the first epoch the weight values are the same. However, after the second half of the second epoch, we observe a divergence in the weights obtained with the two differentiation methods. The trend gets worse and worse as we continue with the training. Again, the differences here are not that high, but for the real models we are working with, these differences manifest in a more profound manner, meaning the adjoint differentiation model does not learn while the model trained using backprop does.

Should we open a corresponding issue in the pennylane repo?

Here’s a link to the data that we used in the plot:
weight_data.txt (1.8 KB)

christina · March 29, 2022, 1:46pm

I’m wondering if there is a build up in floating point errors that leads to the introduction of chaotic behaviour.

I know from playing around with the seed that the behaviour was very sensitive to initial conditions.

I’m betting for any one set of parameters that adjoint and backprop match to good precision, but tiny variances can make a big difference in certain situations.

Quantum cost functions can be strange, so it’s entirely possible this one is chaotic.

christina · March 29, 2022, 1:58pm

Switching to torch.nn.MSELoss, adjoint and backprop give the same results.

This gives credence to the hypothesis that the differences were due to an interaction between floating point errors and the discontinuity in the derivative.

vabelis · April 1, 2022, 1:38pm

Hey @christina,

Concerning your previous comment, we’ve tried looking at MSE and we still see a divergence as epochs progress:

We believe this divergence cannot be simply floating point chaos, but more on this in a bit.

Additionally, we tried some models with Sigmoid activation to also test binary cross entropy loss, conventionally used for classifier models. Here’s the evolution of the first weight of the model for BCE and MSE, respectively:

There seems to not be a problem here anymore. However, in our more complex models, the backprop and adjoint methods diverge again.
When trained with backprop the loss plots seem reasonable but when trained with adjoint the loss plots heavily (unreasonably) fluctuate. These results are consistent throughout different hyperparameters and architectures. That is, from our tests it seems that adjoint method always gives problematic results, whereas backprop doesn’t.

From this point of view, it doesn’t seem to be just a floating point mismatch between the two methods. The first plot below is a model trained with backprop, and the second plot the same model (same seed and hyperparameters) trained with adjoint:

Thus, we think that there is a deeper discrepancy than floating point errors mixed with discontinuity in the derivative, which manifests when the models are complex/large enough.

Sorry to spam you with this, but this issue is really putting a spanner in our research.

christina · April 1, 2022, 3:33pm

Think you could track down a minimal example that reproduces the problem?

For example, an isolated qnode and a single set of weights that gives different gradients for the two methods.

That would help me enormously in tracking down what’s going wrong.

vabelis · April 6, 2022, 12:32pm

Hi @christina!

Here is a new minimal instance of a model that reproduces clearly the discrepancy between adjoint and backprop methods. We are not sure how to print the gradients of the TorchLayer, but the code below shows that there is an issue with adjoint.

import torch.nn as nn
import numpy as np
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
import torch
import pennylane as qml
import sys
from time import perf_counter

class Model(nn.Module):
    def __init__(self, dev, diff_method="backprop"):
        super().__init__()

        self.cnet_in = self.get_cnet_in()
        self.qcircuit = qml.qnode(dev, interface="torch", 
                                  diff_method=diff_method)(self.qnode)
        
        weight_shape = {"weights":(2,)}
        self.qlayer = qml.qnn.TorchLayer(self.qcircuit, weight_shape)

    def get_cnet_in(self):
        layers = [nn.Linear(2,16), nn.ReLU(True), nn.Linear(16,2), nn.Sigmoid()]
        return nn.Sequential(*layers)     
    
    def qnode(self, inputs, weights):
        # Data encoding:
        for x in range(len(inputs)):
            qml.Hadamard(x)
            qml.RZ(2.0 * inputs[x], wires=x)
        # Trainable part:
        qml.CNOT(wires=[0,1])
        qml.RY(weights[0], wires=0)
        qml.RY(weights[1], wires=1)
        o = [[1], [0]] * np.conj([[1], [0]]).T
        return qml.expval(qml.Hermitian(o, wires=[0]))
        #return qml.expval(qml.PauliY(wires=0))

    def forward(self, x):
        x = self.cnet_in(x)
        x = self.qlayer(x)
        return x

def train(X, y, dev_name, diff_method):
    
    dev = qml.device(dev_name, wires=2, shots=None)
    model  = Model(dev, diff_method)
    
    # Train the model
    opt = torch.optim.Adam(model.parameters(), lr=0.01)
    loss = torch.nn.MSELoss()

    X = torch.tensor(X, requires_grad=False).float()
    y = torch.Tensor(y).float()

    batch_size = 5
    batches = 200 // batch_size

    data_loader = torch.utils.data.DataLoader(
        list(zip(X, y)), batch_size=batch_size, shuffle=True, drop_last=True
    )

    epochs = 20
    avg_loss = []
    for epoch in range(epochs):

        running_loss = 0
        x = 0

        for xs, ys in data_loader:
            opt.zero_grad()

            loss_evaluated = loss(model(xs), ys)
            loss_evaluated.backward()
            opt.step()
            #print('Weights:', model.state_dict()['qlayer.weights'])
            running_loss += loss_evaluated
        loss_value = running_loss / batches
        avg_loss.append(loss_value.detach().numpy())
        print("Average loss over epoch {}: {:.10f}".format(epoch + 1, loss_value))

    y_pred = model(X)
    print(y_pred)
    predictions = (y_pred >= 0.5).type(torch.uint8)
    correct = [1 if p == p_true else 0 for p, p_true in zip(predictions, y)]
    accuracy = sum(correct) / len(correct)
    print(f"Accuracy: {accuracy * 100}%")
    
    return avg_loss


if __name__ == "__main__":
    torch.manual_seed(42)
    np.random.seed(42)
    X, y = make_moons(200)
    begin_time = perf_counter()
    avg_loss = train(X, y, str(sys.argv[1]), str(sys.argv[2]))
    end_time = perf_counter()
    runtime = end_time-begin_time
    print(f'Runtime: {runtime:.2e} s or {(runtime/60):.2e} min.')

Specifically, we are using MSELoss and a custom observable, qml.expval(qml.Hermitian(o, wires=[0])). With backprop the model learns while with adjoint it doesn’t. In the following files you can find the output of the two methods. The tensor[…] within the following files is the output of the model.
adjoint_output.txt (2.6 KB)
backprop_output.txt (2.6 KB)

If we change to a default observable (line 35-36 in the code snippet) such as PauliZ we observe that adjoint and backprop give identical results, as you can see below.
backprop_output_pauliZ.txt (3.7 KB)
adjoint_output_pauliZ.txt (3.7 KB)

The motivation for using a custom observable is not obvious here, but for our more complex models we are using BCE loss which doesn’t work unless we specify an observable with range [0,1] (Pauli is [-1, 1]), among other reasons.

How should we proceed? Should we open an issue?
Thanks in advance for your answer!

CatalinaAlbornoz · April 8, 2022, 6:18pm

Hi @vabelis, thank you for sharing this info. We’re taking a look at it.

CatalinaAlbornoz · April 8, 2022, 8:14pm

Hi @vabelis, this definitely looks like a bug. Can you please open an issue? If you can’t just let me know.

Thanks again for bringing this to our attention.

Tom_Bromley · April 11, 2022, 6:08pm

Hi @vabelis,

Thanks for posting a minimal example. We’re still trying to understand the root cause of the problem, but you should be able to get things working in adjoint mode by casting the contents of qml.Hermitian() to be complex:

    def qnode(self, inputs, weights):
        # Data encoding:
        for x in range(len(inputs)):
            qml.Hadamard(x)
            qml.RZ(2.0 * inputs[x], wires=x)
        # Trainable part:
        qml.CNOT(wires=[0,1])
        qml.RY(weights[0], wires=0)
        qml.RY(weights[1], wires=1)
        o = [[1], [0]] * np.conj([[1], [0]]).T
        o = o.astype(np.complex128)
        return qml.expval(qml.Hermitian(o, wires=[0]))

vabelis · April 13, 2022, 9:28am

Hi @Tom_Bromley,

Thanks for the reply. We ran the code including the casting o = o.astype(np.complex128), but adjoint and backprop still give different results.

Here are the results for 10 epochs with backprop:

Average loss over epoch 1: 0.2288654149
Average loss over epoch 2: 0.1421572864
Average loss over epoch 3: 0.1162629128
Average loss over epoch 4: 0.1119743958
Average loss over epoch 5: 0.1090452522
Average loss over epoch 6: 0.1062508821
Average loss over epoch 7: 0.1051507965
Average loss over epoch 8: 0.1046194881
Average loss over epoch 9: 0.1010731906
Average loss over epoch 10: 0.0989995301
tensor([0.3172, 0.3445, 0.2998, 0.9954, 0.9485, 0.5929, 0.2951, 0.9955, 0.9943,
        0.2962, 0.2949, 0.9836, 0.9960, 0.2968, 0.9609, 0.9921, 0.2949, 0.9939,
        0.2982, 0.2950, 0.8944, 0.3357, 0.9931, 0.3117, 0.2948, 0.2947, 0.2994,
        0.2949, 0.9116, 0.9727, 0.4481, 0.3231, 0.2949, 0.2960, 0.9936, 0.3450,
        0.9817, 0.9854, 0.9960, 0.2963, 0.2971, 0.6765, 0.2971, 0.3060, 0.8479,
        0.7562, 0.3005, 0.3190, 0.9911, 0.3010, 0.5553, 0.9947, 0.5835, 0.9262,
        0.6721, 0.2957, 0.9889, 0.3289, 0.9950, 0.3205, 0.9957, 0.3244, 0.2993,
        0.3836, 0.2978, 0.7501, 0.2964, 0.9233, 0.3058, 0.9837, 0.2954, 0.9384,
        0.9959, 0.2966, 0.7118, 0.5202, 0.3033, 0.9880, 0.3191, 0.4590, 0.7099,
        0.2965, 0.2961, 0.9509, 0.8825, 0.3002, 0.9958, 0.2975, 0.9047, 0.2947,
        0.3098, 0.9897, 0.3239, 0.2984, 0.4018, 0.7929, 0.9956, 0.5446, 0.3023,
        0.7861, 0.6247, 0.2988, 0.9904, 0.9921, 0.4111, 0.9916, 0.4232, 0.2947,
        0.4879, 0.9925, 0.9960, 0.7906, 0.6357, 0.9940, 0.2952, 0.8734, 0.2948,
        0.2962, 0.9960, 0.3013, 0.2967, 0.3921, 0.9947, 0.9949, 0.3046, 0.3076,
        0.9750, 0.9928, 0.2978, 0.2969, 0.9958, 0.2961, 0.3142, 0.9943, 0.2963,
        0.9868, 0.6672, 0.3547, 0.2951, 0.9763, 0.9880, 0.9910, 0.9688, 0.2957,
        0.2974, 0.2981, 0.5089, 0.4334, 0.3075, 0.2988, 0.9930, 0.9860, 0.3044,
        0.4766, 0.3682, 0.2947, 0.9953, 0.9684, 0.3287, 0.2955, 0.9938, 0.9792,
        0.9959, 0.9799, 0.2947, 0.2950, 0.3778, 0.2955, 0.7171, 0.2948, 0.2959,
        0.3031, 0.3124, 0.2948, 0.9957, 0.8564, 0.9568, 0.2973, 0.9952, 0.8265,
        0.7514, 0.2952, 0.9934, 0.9633, 0.9938, 0.3367, 0.3341, 0.6321, 0.3020,
        0.2959, 0.8188, 0.5956, 0.3154, 0.9385, 0.2953, 0.3654, 0.3553, 0.3095,
        0.9897, 0.3297], grad_fn=<StackBackward0>)
Accuracy: 89.5%
Runtime: 1.17e+01 s or 1.94e-01 min.

and adjoint:

Average loss over epoch 1: 0.2639860809
Average loss over epoch 2: 0.2047304213
Average loss over epoch 3: 0.2003999949
Average loss over epoch 4: 0.2093879730
Average loss over epoch 5: 0.2228839397
Average loss over epoch 6: 0.2330239564
Average loss over epoch 7: 0.2397201359
Average loss over epoch 8: 0.2444564104
Average loss over epoch 9: 0.2473183572
Average loss over epoch 10: 0.2487374842
tensor([0.5011, 0.5013, 0.5000, 0.5020, 0.5023, 0.5022, 0.4997, 0.5019, 0.5019,
        0.4996, 0.4996, 0.5018, 0.5020, 0.5000, 0.5018, 0.5018, 0.4995, 0.5021,
        0.5002, 0.4997, 0.5024, 0.5012, 0.5021, 0.5009, 0.4996, 0.4995, 0.5003,
        0.4996, 0.5024, 0.5023, 0.5021, 0.5009, 0.4995, 0.4999, 0.5021, 0.5007,
        0.5022, 0.5022, 0.5020, 0.4999, 0.5001, 0.5019, 0.4998, 0.5003, 0.5024,
        0.5019, 0.5000, 0.5008, 0.5022, 0.5004, 0.5021, 0.5019, 0.5022, 0.5024,
        0.5023, 0.4998, 0.5022, 0.5015, 0.5019, 0.5012, 0.5019, 0.5014, 0.5000,
        0.5018, 0.4998, 0.5024, 0.4997, 0.5018, 0.5007, 0.5022, 0.4995, 0.5023,
        0.5020, 0.4997, 0.5024, 0.5020, 0.5002, 0.5022, 0.5004, 0.5018, 0.5022,
        0.5000, 0.4996, 0.5018, 0.5019, 0.5004, 0.5020, 0.4998, 0.5019, 0.4995,
        0.5005, 0.5022, 0.5004, 0.4999, 0.5019, 0.5019, 0.5020, 0.5022, 0.5001,
        0.5024, 0.5022, 0.4999, 0.5022, 0.5021, 0.5015, 0.5022, 0.5020, 0.4996,
        0.5019, 0.5021, 0.5020, 0.5022, 0.5019, 0.5021, 0.4997, 0.5024, 0.4996,
        0.4996, 0.5020, 0.5001, 0.4997, 0.5013, 0.5021, 0.5021, 0.5002, 0.5004,
        0.5018, 0.5021, 0.5002, 0.4997, 0.5019, 0.4996, 0.5010, 0.5021, 0.4996,
        0.5022, 0.5022, 0.5008, 0.4995, 0.5023, 0.5018, 0.5018, 0.5018, 0.4996,
        0.5001, 0.4999, 0.5021, 0.5017, 0.5007, 0.5003, 0.5018, 0.5018, 0.5006,
        0.5021, 0.5016, 0.4995, 0.5019, 0.5023, 0.5010, 0.4998, 0.5018, 0.5022,
        0.5019, 0.5018, 0.4996, 0.4995, 0.5011, 0.4995, 0.5019, 0.4995, 0.4996,
        0.5006, 0.5006, 0.4995, 0.5020, 0.5019, 0.5023, 0.4998, 0.5021, 0.5019,
        0.5022, 0.4995, 0.5021, 0.5023, 0.5021, 0.5006, 0.5016, 0.5023, 0.5005,
        0.4998, 0.5024, 0.5019, 0.5007, 0.5018, 0.4997, 0.5009, 0.5015, 0.5008,
        0.5018, 0.5005], grad_fn=<StackBackward0>)
Accuracy: 73.5%
Runtime: 1.18e+01 s or 1.96e-01 min.

The above was tested with both: PennyLane-0.20.0 and PennyLane-0.22.2
Looking forward for further replies

Tom_Bromley · April 13, 2022, 1:14pm

Hi @vabelis!

Having dug a big deeper I agree that this is still an issue - apologies that the suggested fix didn’t work.

We posted a few more details about the issue here and will aim to fix as soon as possible.

Thanks,
Tom

Topic		Replies	Views
Differentiation method and Amplitude embedding PennyLane Help	3	790	June 29, 2021
Hybrid model (keras) with adjoint differentiation PennyLane Help	9	792	March 29, 2022
Pytorch benchmarks, different devices, and computing resources PennyLane Help	1	398	March 15, 2022
Second order differentiation using adjoint differentiation PennyLane Help	8	811	July 11, 2023
Backpropagation with Pytorch PennyLane Help	10	1773	February 9, 2021

Adjoint and backpropagation differentiation methods give different results

Related topics