It definitely is a bit unintuitive!
However, it is worth noting that this result does not come purely from the quantum mechanics; this is a result that holds in classical settings whenever we use stochastic gradient descent (SGD).
In SGD, we typically consider a cost function of the following form:
C(w) = \frac{1}{N} \sum_{n=0}^N C_n(w),
where we are summing over multiple ‘observations’ C_i. (Note that this cost function has the same form as computing expectation values of quantum observables!).
In typical gradient descent, we perform iterative update steps on the parameters w to minimize the cost function:
w \rightarrow w - \eta \nabla C(w) = w - \frac{\eta}{N} \sum_{n=0}^N \nabla C_n(w).
However, the beauty of stochastic gradient descent is that we can replace the gradient of the cost function with the gradient of a single observation, randomly chosen for each update step, and convergence to a local minimum is still guaranteed (assuming the learning rate \eta decreases appropriately):
w \rightarrow w - \eta \nabla C_n(w).
This is a really nice result, especially if the full gradient \nabla C(w) is expensive to compute.
In Stochastic gradient descent for hybrid quantum-classical optimization, the authors generalize this result to the quantum case (after noting that C can be interpreted as an expectation value over finite shots, and the stochasticity coming from the process of measuring the quantum system):
So to me a single shot is essentially useless as I only retrieve either a +1 or −1 … I guess I am missing something here… thanks for the response again!
So here, the single shot expectation value is useful in the specific case of quantum gradient descent.
However, convergence will definitely be faster by increasing the shot number! It gives us a nice strategy though; we can begin the minimization process by starting with single shots in order to get a better approximate solution. As our approximation improves, we can increase the number of shots to fine-tune and converge towards the local minimum.
Note that this doesn’t apply to evaluating the expectation value in a general setting, as you correctly point out. We will need to increase the number of shots to get a better estimation.