# Single Shot Stochastic Gradient Descent

I recently went over the tutorial write-up " Doubly stochastic gradient descent"(https://pennylane.ai/qml/demos/tutorial_doubly_stochastic.html#doubly-stochastic-gradient-descent)

and there are a few things that I don’t understand. But here I just want to ask about the single-shot result.

How can a single-shot give you anything here? A single shot gives you a single eigenstate that either belongs into the +1 or -1 eigenspace of P_i where P_i \in {I,X,Y,Z}^{\otimes n}. It does not give you any info on the expectation of P_i, \langle P_i \rangle.

Thank you.

Hi @KAJ226! When computing expectation values of observables on hardware, we can only sample a finite number of shots N from the quantum device. For each shot, we are measuring an eigenvalue of the observable, \lambda, so we will get an output vector

\Lambda = (\lambda^{(0)}, \lambda^{(1)}, \dots, \lambda^{(N-1)}).

So when we compute the expectation value, we are always computing an approximation to the expectation value:

\langle P_i \rangle \approx \frac{1}{N} \sum_{n=0}^{N-1} \Lambda_n.

(This is in fact an unbiased estimator of the expectation value.) For a single shot, we have N=1. So we can interpret a single shot measurement as a 1-shot approximation to the expectation value.

This is exactly what happens in PennyLane if we ask for an expectation value with only a single shot:

dev = qml.device("default.qubit", wires=1, analytic=False, shots=1)

@qml.qnode(dev)
def circuit():
return qml.expval(qml.PauliZ(0)), qml.sample(qml.PauliZ(0))

>>> circuit()
>>> circuit()


Hope that helps, let me know if you have any other questions!

2 Likes

Hi @Josh! Thank you so much for the response.

Yes, I understand that you don’t get exact expectation and only in the limit of infinity that you achieve the true expectation here.

Suppose I have the Pauli term P_1 = XX. Now, to calculate \langle P_1 \rangle with respect to a state |\psi \rangle, I would do :

\langle \psi|(H\otimes H) (Z \otimes Z) (H\otimes H) |\psi \rangle = \langle \psi H\otimes H| Z\otimes Z |H\otimes H \psi \rangle

Now , if I do a measurement/shot on the device, I collapsed my state into either the state |00\rangle, |01\rangle, |10\rangle, or |11\rangle. If it is |00\rangle or |11\rangle, then I record it in the +1 eigenspace. If it is |01\rangle or |10\rangle then I record it in the -1 eigenspace. Now, suppose |\psi\rangle is in the state |00\rangle which we know that \langle \psi | XX | \psi \rangle = 0 . However, a single shot wouldn’t give us this information since we will just get a +1 or -1. If we had do 1000 shots, then the results will add up close to 0.

So to me a single shot is essentially useless as I only retrieve either a +1 or -1… I guess I am missing something here… thanks for the response again!

It definitely is a bit unintuitive!

However, it is worth noting that this result does not come purely from the quantum mechanics; this is a result that holds in classical settings whenever we use stochastic gradient descent (SGD).

In SGD, we typically consider a cost function of the following form:

C(w) = \frac{1}{N} \sum_{n=0}^N C_n(w),

where we are summing over multiple ‘observations’ C_i. (Note that this cost function has the same form as computing expectation values of quantum observables!).

In typical gradient descent, we perform iterative update steps on the parameters w to minimize the cost function:

w \rightarrow w - \eta \nabla C(w) = w - \frac{\eta}{N} \sum_{n=0}^N \nabla C_n(w).

However, the beauty of stochastic gradient descent is that we can replace the gradient of the cost function with the gradient of a single observation, randomly chosen for each update step, and convergence to a local minimum is still guaranteed (assuming the learning rate \eta decreases appropriately):

w \rightarrow w - \eta \nabla C_n(w).

This is a really nice result, especially if the full gradient \nabla C(w) is expensive to compute.

In Stochastic gradient descent for hybrid quantum-classical optimization, the authors generalize this result to the quantum case (after noting that C can be interpreted as an expectation value over finite shots, and the stochasticity coming from the process of measuring the quantum system):

So to me a single shot is essentially useless as I only retrieve either a +1 or −1 … I guess I am missing something here… thanks for the response again!

So here, the single shot expectation value is useful in the specific case of quantum gradient descent.

However, convergence will definitely be faster by increasing the shot number! It gives us a nice strategy though; we can begin the minimization process by starting with single shots in order to get a better approximate solution. As our approximation improves, we can increase the number of shots to fine-tune and converge towards the local minimum.

Note that this doesn’t apply to evaluating the expectation value in a general setting, as you correctly point out. We will need to increase the number of shots to get a better estimation.

1 Like

@josh Thanks for the explanation!

Is the term “Stochastic” Gradient Descent instead of Gradient Descent being used here is because we use estimate the expectation from our shots counts instead of doing it analytically? So if we have used statevector simulator and able to calculate the gradient exactly, then we would just say that it is Gradient Descent? I saw Vanilla Gradient Descent and Stochastic Gradient Descent being mentioned a lot… just want to make sure I understand them correctly.

Hi @KAJ226,

The term stochastic gradient descent is actually borrowed heavily from the classical ML literature. It usually means producing an (unbiased) estimator for the gradient in cases where computing the gradient exactly would be costly.

In ML, usually this means taking a minibatch of your training data to build your cost function, instead of using the entire dataset at once.

In the quantum case, there are a number of elements that could be stochastic (e.g., noise, finite-shot expectation values, etc). In the case where you use a simulator, then it is not a stochastic estimate, but the exact gradient (up to machine precision).

2 Likes