Preferred method for long training runs?

I have been wondering; what is everyone’s favorite way to do long training/simulation runs? Ones that take hours or days to finish, and which you’d really rather not have tying up your PC for that length of time?

I’ve been running a variational binary classifier of 8 qubits and each training iteration takes a few minutes. Depending on the problem it can take hours or even days to converge on a good set of variational parameters that can actually classify with acceptable accuracy. I have been running it locally on my desktop but recently I have been experimenting with running my circuits on AWS, using both their EC2 and Sagemaker services. These work but I have noticed sometimes my time per iteration will more than quadruple for seemingly no reason. That wasn’t the problem I was hoping to address here though, I was just interested in seeing what methods people like to use for their “long haul” training runs.

Hey @Adrian_Kaczmarczyk,

Interesting question, I am curious what people will reply. For me, I just start simulations on a Friday and check the results on Monday - and never run anything for more than a day. A few simple lessons learned were:

  • Try to understand if there is a better hyperparameter setting for the problem. For example, parameter-shift rules may be super slow, and backprop differentiation much faster. Or you may be able to use compilation to simplify the circuit. VQE may be faster if you use grouping. Sometimes the default settings are just not the right ones for your problem.
  • Save intermediate results, even if it is just a text file that stores the latest parameters.
  • Spend some days or weeks first investigating smaller problems, and only run the big one once you have a good intuition for the problem. You’ll most likely have to rerun everything a few times anyways, one never gets it right the first time :slight_smile:

Finally a comment from the development side of things: For PennyLane we are trying to balance a very general, versatile and intuitive framework with speed requirements, and there is usually a trade-off between the two. We are busy making many improvements on the speed side of things at the moment, especially for remote pipelines, the default.qubit device and VQE problems (so watch the space :slight_smile: ). In other words, running your problem a few months later on the latest master may already change the time per step from minutes to seconds!

If you have a workflow that is painfully slow, feel free to also report it (here or as a github issue/feature request) and we will see if we can advise!

1 Like

@Adrian_Kaczmarczyk One thing I will recommend is to save data. Sometimes the run just fails. I once had a BSOD and had not saved any data. That was the last time I did that.

Overnight runs are the easiest to do :laughing:

Another thing I try before running the big one is to run small batches of data on smaller circuits. Just to see what is to be expected. Having an idea of how much time it is going to take helps to be patient.

Experimenting with other options might feel wasteful in the beginning as this time could have been spent on running the script, but it always helps in the long run.

Also having a secondary machine helps. You can give lighter jobs to it and keep it running at all times. This time is now free on your main machine to explore.

Basically, I agree with @Maria_Schuld on all points! And am definitely looking forward to the secret sauce they are planning to release later! :exploding_head:

Thank you for weighing in @Maria_Schuld @ankit27kh
I figured out the “saving intermediate results” idea independently, I have my model spit out a file with the trained parameters for each step it completes so I can resume it in the event of a crash, or if I find something promising that just needs a few more iterations to converge on a good classification accuracy. I have also figured out the “start on Friday, let it run over the weekend” idea :slight_smile:

The idea of trying to run my training on AWS was essentially the “second machine” idea that @ankit27kh suggested, though it’s turning out to be a little bit problematic in practice for the reasons I described in my original post.

Still interested in what other people have to say

Great question @Adrian_Kaczmarczyk! I’m sure many of us have come across this problem so I’m also happy to see different answers. I would add that, if you have the chance to work with a team, it helps to have someone else to share your ideas and you can also try to split the problem into smaller parts that each of you can run independently.
It’s not possible for every problem but it can be very helpful sometimes :slight_smile: .