Parameter-shift rules

In the documentation for “Parameter-shift rules” it talks about constructing partial derivatives by evaluating the same circuit with different parameters:

How is this any different than numerical differentiation though, mentioned here?

I see one mention towards this on the page:
“While this construction bears some resemblance to the numerical finite-difference method for computing derivatives, here s is finite rather than infinitesimal.”

What’s a concrete example of s (i.e. what’s the significance of calling it out as finite vs. infinitesimal)?

Adding a little bit more to this as I read more through the documentation. The page on quantum gradient mentions:

“This means that quantum gradients can be computed by quantum computers,”

By computed here does it mean that the gradient function itself is being output as the result of the computation or that simply the gradient is being computed on a quantum computer by calling the circuit multiple times with a parameter shift?

Hi @amir!

How is this any different than numerical differentiation though, mentioned here?

The parameter-shift rule provides exact, analytic gradients, whereas numerical differentiation such as finite-difference provides only an approximation to the gradient.

As an example, we can compute the parameter-shift gradient rule for a simple function such as f(x) = \sin(x), and compare this to finite-differences :slight_smile:

We know that the gradient is given by:

\begin{align} f'(x) = \cos(x) \end{align}

By making use of the trig identity

\begin{align} \cos(x) = \frac{\sin(x+s) - \sin(x - s)}{2\sin(s)} \end{align}

we can now write the gradient as

\begin{align} f'(x) = \frac{f(x+s) - f(x - s)}{2\sin(s)} \end{align}

That is, the exact gradient of f(x) is given by evaluating the function at the points x+s and x-s, and computing a linear combination. This is exact for any value of s.

Compare this instead to finite-differences:

\begin{align} \cos(x) = \frac{\sin(x+h) - \sin(x - h)}{h} +\mathcal{O}(h^2) \end{align}

This is an approximation that is only valid for h\ll 1. Furthermore, numerical differentiation is prone to numerical instability, unlike the parameter-shift rule which is exact.

So as you can see from the example, the parameter-shift rule takes into account structural information about f(x) to allow the exact gradient of f(x) to be computed by simply taking additional function evaluations.

However, this requires f(x) to satisfy particular conditions, so is not universal!

What’s a concrete example of s (i.e. what’s the significance of calling it out as finite vs. infinitesimal)?

Within PennyLane, we typically take s=\pi/2. It is advantageous to maximise the shift on near-term noisy quantum devices, as this allows us to compute expectation values further apart in parameter-space and to limit the effects of shot-noise.

simply the gradient is being computed on a quantum computer by calling the circuit multiple times with a parameter shift?

Yes this is exactly it!

Thanks for your comprehensive answer. One note:

Should
\begin{align} f'(x) = \frac{f(x+s) - f(x - s)}{2\sin(s)} \end{align}

actually be:
\begin{align} f'(x) = \frac{f(x+s) - f(x - s)}{2f(s)} \end{align}
?

Follow-up question - is symbolic differentiation the same as what you provide with the exact, analytic gradient in the example above?

Also, do you think it might be helpful to others if I updated the documentation with a PR including what you’ve mentioned?

To answer your first question, no — while both equations are correct, I deliberately left the denominator as \sin(s) as it is independent of x.

In general, we want to limit the number of calls to our function f(x), as it might be expensive to compute. In this case, we only need two evaluations of f to compute the gradient; the denominator \sin(s) can be easily and efficiently computed independently :slightly_smiling_face:

Follow-up question - is symbolic differentiation the same as what you provide with the exact, analytic gradient in the example above?

No, symbolic differentiation is different yet again! We have three ways to compute derivatives:

  • Numerical differentiation: Numerical approximations to the gradient that can be used with arbitrary functions, for example finite-differences. Can be unstable, and can introduce floating point error.

  • Automatic differentiation: An exact, numerical approach to computing derivatives. Each component of the computation provides a rule for its derivative with respect to the input. At the end of the computation, the chain rule is used to combine all these gradient rules and determine the total gradient.

  • Symbolic differentiation: Manipulates mathematical expressions directly to determine the gradient rule/function. This is similar to how you would compute the gradient manually by hand, or using a computer algebra system such as Mathematica.

For more explanation, this Stack Overflow question is quite helpful: Difference between symbolic differentiation and automatic differentiation? - Stack Overflow

Also, do you think it might be helpful to others if I updated the documentation with a PR including what you’ve mentioned?

For sure! PRs adding improvements to our documentation are always encouraged :slightly_smiling_face:

Okay, so I’m seeking for a little more precision with the definition here. In your original response you said that you’re employing “exact, analytic gradients”. With the three ways to compute derivatives the one that is getting used by pennylane is automatic differentiation, which says it is an “exact, numerical approach”.

So, is it numerical or analytic? When I read numerical I think numerical differentiation. When I read analytic I think symbolic differentiation. Something I’m intuiting though is that automatic differentiation is this sweet spot between both.

One more thought on this, too. Would it be possible hypothetically to go double-wide on the circuit and calculate both f(x+s) and f(x - s) and take the expectation value of that difference in a single circuit, which would reduce circuit calls by half?

Hi Amir,

Yes, the use of the word “analytic” may be a bit confusing. Let me try to clarify.

One method to compute gradients is finite difference. Here we approximate the gradient of a function for example as

\frac{d f(x)}{dx}\approx \frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}.

I believe this is what Josh meant by numerical differentiation. Note that it gives an approximation of the derivative, whose quality depends on \epsilon. That’s why it can be unstable.

Another method is the parameter-shift rule. The idea here is to express the gradient as a linear combination of the function evaluated at two separate points. Consider f(x)=\sin(x). Its derivative, \cos(x), can be written as

\cos(x) = \frac{\sin(x+\pi/4)-\sin(x-\pi/4)}{\sqrt{2}},

so in this case we have

\frac{d f(x)}{dx}=\frac{f(x+\pi/4)-f(x-\pi/4)}{\sqrt{2}}.

Notice that the “shift” from x is not tiny, it’s \pi/4 and the equation is exact, i.e., it is not an approximation. A similar trick can be used for gradients of quantum gates. This is what Josh meant regarding exact, analytic gradients.

Finally, symbolic differentiation is just that, it’s literally writing the derivative as a mathematical formula.

To answer your final question, typically you can’t do this. If x is a parameter of the circuit, then you need to change the parameter to x+s and x-s separately, and individually calculate expectation values. The only exception I can think of is if you can “absorb” the shift into an observable. So if you can define an observable O_1 whose expectation value is f(x+s) and another observable O_2 whose expectation value is f(x-s), where the expectation is taken with respect to the circuit in the same configuration, then from linearity you can define the observable O_3 = O_1-O_2 whose expectation value is f(x+s)-f(x-s), which you can calculate or estimate directly. I’m not sure this is even possible.

Hope that helps!

Hi @amir

Just to add on what @jmarrazola mentioned, you could take a circuit on N qubits, and run two copies of it on a “double wide” set of 2N qubits “in parallel” (i.e., never interacting between the first N qubits and the last N qubits). You can effectively think of that as having two independent QPUs.

However, it’s a lot cheaper to double the number of runs than to double the number of qubits, so this strategy is not very resource efficient :slightly_smiling_face: