How is the gradient measured on a real device?

I wonder how the gradient is determined in case one uses a real device instead of a simulation? It seems in a real device measuring derivatives from finite differences would be very noisy?

In case of finite differences:

grad = output(parameter_bump_up) - output(parameter_bump_down)

where output is a measurement of the output node one cares about.

Let’s say ‘output’ is just a final node that measures either 0 or 1 with some probabilities. It seems rather difficult to get a good measurement of the expectation of the difference between two already noisy measurements.
What is the mechanism that makes this still work?
Hope the question makes any sense, thanks

Andreas

Hi @nyquant!

PennyLane does not use finite differences to compute the gradient. Instead, it uses the parameter-shift rule.

The parameter-shift rule provides exact gradients, whereas numerical differentiation such as finite-difference provides only an approximation to the gradient.

Furthermore, the parameter-shift rule is simply written as a linear combination of quantum circuit evaluations, allowing it to be executed on near term hardware. We even have a degree of freedom in choosing where in the parameter space we perform these evaluations. Typically we maximise the distance between evaluations on near-term noisy quantum devices, as this allows us to compute expectation values further apart in parameter-space and to limit the effects of shot-noise.


As an example, we can compute the parameter-shift gradient rule for a simple function such as f(x)=\sin(x), and compare this to finite-differences :slight_smile:

Exact gradient: parameter-shift rule

We know that the gradient is given by f′(x)=\cos(x). By making use of the trig identity

\cos(x)=\frac{\sin(x+s)−\sin(x−s)}{2\sin(s)}

we can now write the gradient as

f′(x)=\frac{f(x+s)−f(x−s)}{2\sin(s)}

That is, the exact gradient of f(x) is given by evaluating the function at the points x+s and x−s, and computing a linear combination. This is exact for any value of s .

Compare this instead to finite-differences:

f′(x)\approx \frac{\sin(x+h)−\sin(x−h)}{h}+O(h^2)

This is an approximation that is only valid for h\ll 1 . Furthermore, numerical differentiation is prone to numerical instability, unlike the parameter-shift rule which is exact.


So as you can see from the example, the parameter-shift rule takes into account structural information about f(x) to allow the exact gradient of f(x) to be computed by simply taking additional function evaluations.

However, this requires f(x) to satisfy particular conditions, so is not universal!

Thanks @josh for your quick and detailed reply. The differencing reminds of so called non-standard exact schemes from numerical pde solving and it seems to me doing so that puts the gradient that’s being computed to its true theoretical value.

Taking your cos(x) example the method consists of basically scaling by an adjustment factor of 1/sin(s), however the question I have is that it still seem to rely on two separate measurements at x-s and x+s.
I seems if you run on a real device, those would need two separate shots (runs) just like with traditional finite differencing and you are back at the initial question with regards to the noise in estimating the derivative from taking differences. Do i still miss anything from the big picture?

Hi @nyquant

It’s true that we’d need two evaluations in that case.

Assuming that we have perfect systems and a perfect device (e.g. a simulator), as @josh was suggesting, calculating gradients using the parameter-shift rule provides the exact solution, whereas using finite-differences we can only estimate the gradient. That is, although there is no noise when performing the quantum computation itself, the result obtained when using finite-differences might still deviate from the analytic value of the gradient (as it’s an approximation).

When using a real device, additional noise of the quantum computation would further influence the accuracy of the result. However, for the parameter-shift rule, the errors due to noise could be the only reason for inaccurate gradient results.

Thanks @antalszava . I think its more clear to me now. Please correct me if I’m wrong.

Say for parameter setting x+s the probability of the output state 1 is p=60%, while for the parameter x-s the output state is p=40%. So in an ideal lucky case, taking 10 measurements each, one gets results like

Measurements(x+s) = [ 0, 1, 1, 0, 1, 0, 1, 0, 1,1 ]
Measurements(x-s) = [ 1, 0 ,0 ,0, 0, 1, 1, 0, 0,1 ]

with 2 extra “1” measurements reflecting a 20% up lift for p due to the parameter shift. Because of quantum randomness its not guaranteed to always get a perfect 6 to 4 count, but there will be a variance.

In case the parameter shift was very small, and we are comparing very similar probabilities, like p=51% vs p=49% the lift will be difficult to detect because of the measurement variance. It seems that’s where the exact gradient method comes in by allowing for large parameter shift to overcome the noise and at the same time compensating the bias that introduces to a naive finite difference with the correction factor.

Because of quantum randomness its not guaranteed to always get a perfect 6 to 4 count, but there will be a variance.

That’s correct. Note that as we increase the number of shots (samples) we take, the variance in the gradient calculation will reduce accordingly.