There are some obvious cases where PennyLane could parallelize the execution on quantum devices, such as the computation of gradients.
Due to memory constrains it might not be a good idea to compute the expectations that are needed to compute a gradient in parallel when a simulator is used, but when a hardware backend is in use, it would be good to queue all the 2nshots many device runs to compute n gradients all at once instead of doing 2*n sequential calls to the quantum hardware.