Hi @minn-bj
Thanks for the context. Yes, as lightning.gpu is more suited for larger HPC workloads, we see performance beating other approaches beyond the 18-20 qubits barrier. This was observed to be the case with Nvidia’s custatevec, the CUDA library we use with lightning.gpu, so I’m not sure if using cuda_quantum will help here, as the limitation may persist.
I’m not sure how your workload is configured, but there may be some gains to be had using catalyst.accelerate — Catalyst 0.13.0-dev11 documentation
This allows catalyst to offload computation to the GPU, if supported, for specific function uses.
While this is no guarantee, it could be possible yield somewhat improved performance using the lightning.kokkos CUDA backend, but I suspect this will run into the same issues as custatevec, given the smaller size of the problem.
For CPU scaling, you can try the lightning.kokkos backend, which will use OpenMP by default for all gate and measurement processes. If you wish to try lightning.qubit with OpenMP, my responses here should help.
Feel free to let us know if the above help, or if there’s any minimum example workload we can see to offer more suggestions.