QAOA layer and CUDA

Hi. I’m looking to leverage PyTorch + CUDA in exploring some QAOA simulations. However, I’m unsure if some aspects of qml.layer are not GPU-amenable.

For example, I am attempting some sort of on-GPU calculation similar to this example.

My code is below:

sw versions

pennylane == 0.19.1
torch == 1.10.0

QAOA circuit setup:

def qaoa_circuit_from_graph(graph, n_layers):
    n_wires = len(graph.nodes)
    cost_h, mixer_h = qaoa.maxcut(graph)

    def qaoa_layer(params):
        gamma, beta = params[0], params[1]
        qaoa.cost_layer(gamma, cost_h)
        qaoa.mixer_layer(beta, mixer_h)

    dev = qml.device("default.qubit", wires=n_wires)

    @qml.qnode(dev, interface='torch', diff_method="backprop")
    def circuit(params):
        for w in range(n_wires):
            qml.Hadamard(wires=w)
        qml.layer(qaoa_layer, n_layers, params)
        return [qml.expval(term) for term in cost_h.terms[1]]

    return circuit, cost_h.terms[0][-1]

Test code (CPU):

n_layers = 10
params_shape = [n_layers, 2]
params = torch.rand(params_shape)
circuit, offset = qaoa_circuit_from_graph(graphs[0], n_layers)
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    circuit(params)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                aten::mul        10.26%       6.465ms        28.59%      18.018ms      16.350us          1102  
                 aten::to         7.03%       4.431ms        26.76%      16.863ms       7.194us          2344  
           aten::_to_copy        13.96%       8.797ms        19.73%      12.432ms       7.655us          1624  
              aten::slice        14.06%       8.862ms        16.03%      10.100ms       5.404us          1869  
             aten::einsum         5.46%       3.438ms        15.12%       9.526ms      56.035us           170  
                aten::cat         1.18%     743.000us        11.15%       7.025ms      27.657us           254  
               aten::_cat         5.08%       3.204ms         9.97%       6.282ms      24.732us           254  
               aten::roll         1.56%     982.000us         9.35%       5.894ms      46.778us           126  
              aten::stack         1.44%     909.000us         9.28%       5.847ms      46.405us           126  
                aten::div         2.91%       1.834ms         7.85%       4.945ms      14.544us           340  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 63.022ms

CUDA output:

n_layers = 10
params_shape = [n_layers, 2]
params_cuda = torch.rand(params_shape, device='cuda')
circuit, offset = qaoa_circuit_from_graph(graphs[0], n_layers)
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    circuit(params_cuda)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                              aten::mul        13.12%      10.282ms        21.98%      17.220ms      15.626us       2.570ms        48.93%       2.570ms       2.332us          1102  
                                               aten::to         2.03%       1.593ms        17.66%      13.835ms      15.338us       0.000us         0.00%     349.000us       0.387us           902  
                                       cudaLaunchKernel        16.38%      12.831ms        16.38%      12.831ms       5.535us       0.000us         0.00%       0.000us       0.000us          2318  
                                           aten::einsum         4.89%       3.833ms        15.99%      12.529ms      73.700us       0.000us         0.00%     510.000us       3.000us           170  
                                         aten::_to_copy         4.23%       3.317ms        15.63%      12.242ms      32.472us       0.000us         0.00%     349.000us       0.926us           377  
                                            aten::stack         1.29%       1.013ms        12.87%      10.080ms      80.000us       0.000us         0.00%     549.000us       4.357us           126  
                                              aten::cat         0.59%     463.000us        10.28%       8.057ms      62.945us       0.000us         0.00%     559.000us       4.367us           128  
                                             aten::_cat         4.45%       3.488ms         9.69%       7.594ms      59.328us     559.000us        10.64%     559.000us       4.367us           128  
                                            aten::copy_         3.05%       2.388ms         8.53%       6.683ms      17.180us     385.000us         7.33%     385.000us       0.990us           389  
                                            aten::slice         7.17%       5.621ms         8.50%       6.659ms       4.404us       0.000us         0.00%       0.000us       0.000us          1512  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 78.345ms
Self CUDA time total: 5.252ms

Any insights are appreciated. Thanks!

How many qubits are you using? Previously, I had found that for systems less than 15 qubits, overhead dominates and the CPU is faster. See GPU and CUDA support.

For the question about qml.layer: Even though we do have that function, I would recommend just using a for-loop instead.

Hi @christina, I’m using about ~9 qubits. So yes, that is likely the issue here. I did read about the overhead but failed to test it myself. Doh!

Is it possible (or, does it even make sense) to reduce the overhead by batch loading these circuits on the GPU and training? Essentially, I’d like to explore the details of the required overhead. I assume it’s related to copying from CPU to GPU?

Thank you!

The overhead should be due to copying information to and from the GPU.

We delegate handling the GPU to torch. While our implementation currently works (though we are still encountering some bugs now and then), we haven’t yet investigated or optimized the torch device for performance very much. Hopefully, that will start to happen and you will probably see performance improvements over time.

If you come across any insights, we’d love to hear about them.