Hi @amir!
How is this any different than numerical differentiation though, mentioned here?
The parameter-shift rule provides exact, analytic gradients, whereas numerical differentiation such as finite-difference provides only an approximation to the gradient.
As an example, we can compute the parameter-shift gradient rule for a simple function such as f(x) = \sin(x), and compare this to finite-differences
We know that the gradient is given by:
\begin{align}
f'(x) = \cos(x)
\end{align}
By making use of the trig identity
\begin{align}
\cos(x) = \frac{\sin(x+s) - \sin(x - s)}{2\sin(s)}
\end{align}
we can now write the gradient as
\begin{align}
f'(x) = \frac{f(x+s) - f(x - s)}{2\sin(s)}
\end{align}
That is, the exact gradient of f(x) is given by evaluating the function at the points x+s and x-s, and computing a linear combination. This is exact for any value of s.
Compare this instead to finite-differences:
\begin{align}
\cos(x) = \frac{\sin(x+h) - \sin(x - h)}{h} +\mathcal{O}(h^2)
\end{align}
This is an approximation that is only valid for h\ll 1. Furthermore, numerical differentiation is prone to numerical instability, unlike the parameter-shift rule which is exact.
So as you can see from the example, the parameter-shift rule takes into account structural information about f(x) to allow the exact gradient of f(x) to be computed by simply taking additional function evaluations.
However, this requires f(x) to satisfy particular conditions, so is not universal!
What’s a concrete example of s (i.e. what’s the significance of calling it out as finite vs. infinitesimal)?
Within PennyLane, we typically take s=\pi/2. It is advantageous to maximise the shift on near-term noisy quantum devices, as this allows us to compute expectation values further apart in parameter-space and to limit the effects of shot-noise.
simply the gradient is being computed on a quantum computer by calling the circuit multiple times with a parameter shift?
Yes this is exactly it!