Just some comments on the script: I believe the issue here is when returning the local probability vectors back to the hosts, since the OOM error seems to occur at the host level (if I am mistaken, any additional error logs you can provide here may help). In this instance, you may be using too much memory on the host too — for problems like this, the general best case option is to increase the number of resources (doubling the MPI ranks, nodes used), and examine if the failure still occurs.
Also, including additional overheads in the CUDA runtime library, custatevec library, MPI dispatch buffer, and caching mechanisms means (as @isaacdevlugt mentioned) we can very easily hit the memory ceiling with such a tight tolerance. Targeting to have about 50% memory utilisation for workloads like this is often optimal to allow everything to fit including any incurred overheads. Though, with HPC systems, this usually becomes a tuning problem to get the largest problem size to fit onto the available resources.
If increasing the node count doesn’t help here (and note that custatevec requires a power of 2 number of MPI ranks), then feel free to let us know and we can see if something else is going on.
Hi @mlxd - it used to work with the instructions you shared above - and it somehow it stopped working. I am sure i messsed up soemthing in. my setup… i started clean, and i still see the same problem - anythoughts on what could be the issue?
/global/homes/p/prmantha/.local/perlmutter/python-3.10/lib/python3.10/site-packages/pennylane_lightning_gpu/lightning_gpu.py:112: UserWarning: [/global/u1/p/prmantha/pennylane-lightning-gpu/pennylane_lightning_gpu/src/util/cuda_helpers.hpp][Line:603][Method:getGPUCount]: Error in PennyLane Lightning: no CUDA-capable device is detected
warn(str(e), UserWarning)
/global/homes/p/prmantha/.local/perlmutter/python-3.10/lib/python3.10/site-packages/pennylane_lightning_gpu/lightning_gpu.py:112: UserWarning: [/global/u1/p/prmantha/pennylane-lightning-gpu/pennylane_lightning_gpu/src/util/cuda_helpers.hpp][Line:603][Method:getGPUCount]: Error in PennyLane Lightning: no CUDA-capable device is detected
warn(str(e), UserWarning)
/global/homes/p/prmantha/.local/perlmutter/python-3.10/lib/python3.10/site-packages/pennylane_lightning_gpu/lightning_gpu.py:994: RuntimeWarning:
!!!#####################################################################################
!!!
!!! WARNING: INSUFFICIENT SUPPORT DETECTED FOR GPU DEVICE WITH `lightning.gpu`
!!! DEFAULTING TO CPU DEVICE `lightning.qubit`
!!!
!!!#####################################################################################
warn(