Tips on how to improve performance on IBM hardware

Hello,
I am trying to get some examples (specifically, variations of the variational quantum classifier) working on IBM hardware. I am managing to send jobs to the simulated hardware, and “real” stuff too. However, the issues I am finding is that training time is extremely long.

I am assuming therefore, at the moment, it is only feasible to train very small models (~1000 training points, very small network architectures). Is this a correct assumption?

Does anyone have any suggestions about how to improve the time it takes to train a model (the value I should use for shots, max amount of data, max network size, etc). Is it worth sacrificing using validation data (I do not mean not using test data) in your training loop to reduce time training time? Should I view this as a transfer learning-esque problem, where I train a network classically, then add a very small quantum circuit to the end, to take some pressure of the quantum network and hopefully reduce training time?

I am just curious to see how people are getting around what seems to be long network training times, even using ibmq_qasm_simulator.

Thanks! I’m forward to using the package more.

EDIT:
To be more specific, I have a variational quantum classifier with 2 quibits and 3 layers. I tried training for only 1 epoch, on just 1 piece of data. In this example, it sent ibmq_qasm_simulator around 36 jobs and took around 230 seconds. This is just for the optimisation step as well. Is this the sort of numbers I should be expecting to see? Is there a way, by hand, I can calculate how many jobs it will send?

Hi @andrew,

Thank you for your questions and welcome to the Xanadu discussion forum.

Please find below some recommendations that you may find useful:

  • Train on a simulator locally, not a cloud simulator.

  • Use IBM hardware that supports reservations rather than a queuing system (not sure if this is publicly available to everyone)

  • If you can’t reserve, choose a device that’s got an empty queue - sounds obvious but it really helps.

  • The number of shots does noticeably change the training time, using 1000 shots may be a reasonable trade-off between accuracy and speed.

  • Definitely validation can slow things down on each step - especially if it’s over a big validation set.

I hope this general recommendations help. Please, do not hesitate to get back to us if you have further questions.

Xanadu team.

Thank you very much for the response! I am glad to know that the numbers I am getting during aren’t out of the ordinary. My aim will be then to use a small network, that uses small amounts of test data (<1000 points.) I will take into account all of your advice, thank you.

Is there an intuitive explanation as to why 1 datapoint, during training, results in >30 jobs on the IBM machine? Is it just that optimisation for each point is a process that requires a lot of calculations?

Hi @andrew,

If you are doing optimization (even for one data point), PennyLane will compute the gradient with respect to all relevant model parameters. So the number of jobs scales with the number of parameters, not just the number of datapoints :+1:

Hello Nathan, thanks for the reply. Sorry, I think I may be missing something. If I have a network with 2 qubits and three layers, including a bias value, would I not have 7 trainable parameters, therefore only 7 jobs per point?

Hi @andrew,

The default scaling for N variable parameters is something like 2 * N + 1 (2 * N for the gradient computation via the parameter-shift rule, and the 1 represents a single evaluation to compute the cost function).

This is a rough estimate though, if you wanted a more detailed resource count, you’d have to share the explicit code you used

Thank you! I think I misinterpreted for a moment what was meant by trainable parameters - sorry. Each gate has three trainable parameters, so 2*N+1 makes sense to me now.