RuntimeError with Mixed Device Tensors when Integrating PennyLane with PyTorch

I’m encountering a runtime error while trying to integrate PennyLane with a PyTorch model for a hybrid quantum-classical neural network project. The error occurs during the forward pass of my model, specifically when executing a quantum circuit defined with PennyLane and integrating its output with PyTorch tensors.

Here’s the error message:

vbnetCopy code

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu, cuda:0!

This error arises when I attempt to pass tensors from my PyTorch model, which is on a GPU, to a PennyLane quantum node, which operates on the CPU. I’ve made sure to move all relevant tensors to the CPU before executing the quantum node and back to the GPU afterwards, yet the error persists.

Below is a simplified version of the code that leads to this error:

pythonCopy code

import torch
import pennylane as qml
from torch import nn
from pennylane import numpy as np
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#%% md
## Load Dataset

There are two columns in the dataset: label and text. The column "text" holds the message content, whereas "label" is a binary variable where 1 indicates that the message is spam and 0 indicates that it is not spam.
df = pd.read_csv("data/spamdata_v2.csv")

# check class distribution
df['label'].value_counts(normalize = True)


#%% md
This dataset will now be divided into three sets: train, validation, and test.

We will fine-tune the model using the train set and the validation set, and make predictions for the test set.
train_text, temp_text, train_labels, temp_labels = train_test_split(df['text'], df['label'], 

# we will use temp_text and temp_labels to create validation and test set
val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels, 
#%% md
## Import BERT Model and BERT Tokenizer
# import BERT-base pretrained model
bert = AutoModel.from_pretrained('bert-base-uncased')

# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
#%% md
## Tokenize the Sentences

Since the messages (text) in the dataset are of variable lengths, we will employ padding to ensure that the length of each message is the same. We can pad messages using the maximum sequence length. To determine the correct padding length, we may also examine the distribution of sequence lengths in the train set.
# get length of all the messages in the train set
seq_len = [len(i.split()) for i in train_text]

pd.Series(seq_len).hist(bins = 30)
#%% md
It is evident that the majority of the texts include little more than 25 words. In contrast to the maximum length of 175, If we choose 175 as the padding length, then all input sequences will have a length of 175 and the majority of tokens in those sequences will be padding tokens, which will not help the model learn anything useful and will also slow down training.

We will thus set the padding length to 25.
max_seq_len = 25
# tokenize and encode sequences in the training set
tokens_train = tokenizer.batch_encode_plus(
    max_length = max_seq_len,

# tokenize and encode sequences in the validation set
tokens_val = tokenizer.batch_encode_plus(
    max_length = max_seq_len,

# tokenize and encode sequences in the test set
tokens_test = tokenizer.batch_encode_plus(
    max_length = max_seq_len,
#%% md
## convert the integer sequences to tensors.
# for train set
train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())

# for validation set
val_seq = torch.tensor(tokens_val['input_ids'])
val_mask = torch.tensor(tokens_val['attention_mask'])
val_y = torch.tensor(val_labels.tolist())

# for test set
test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
test_y = torch.tensor(test_labels.tolist())
#%% md
## Create DataLoaders

Now, dataloaders will be created for both the train and validation sets. During the training phase, these dataloaders will send batches of train data and validation data to the model as input. 
from import TensorDataset, DataLoader, RandomSampler, SequentialSampler

#define a batch size
batch_size = 32

# wrap tensors
train_data = TensorDataset(train_seq, train_mask, train_y)

# sampler for sampling the data during training
train_sampler = RandomSampler(train_data)

# dataLoader for train set
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# wrap tensors
val_data = TensorDataset(val_seq, val_mask, val_y)

# sampler for sampling the data during training
val_sampler = SequentialSampler(val_data)

# dataLoader for validation set
val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)
#%% md
## Freeze BERT Parameters

We freeze all the layers of the BERT model and attach a few neural network layers of our own and train this new model. Note that the weights of only the attached layers will be updated during model training.
# freeze all the parameters
for param in bert.parameters():
    param.requires_grad = False
#%% md
## Define Model Architecture Hybrid transfer learning model (classical-to-quantum).

Setting of the main parameters of the network model and of the training process.
n_qubits = 5                     # Number of qubits
q_depth = 4                      # Depth of the quantum circuit (number of variational layers)
max_layers = 15                  # Keep 15 even if not all are used.
q_delta = 0.01                   # Initial spread of random quantum weights
#%% md
Let us initialize a PennyLane device with the default simulator.
dev = qml.device('default.qubit', wires=n_qubits)

class Quantumnet(nn.Module):

  def __init__(self, bert):
      super(Quantumnet, self).__init__()

      self.bert = bert

      self.pre_net = nn.Linear(768, n_qubits)
      self.q_params = nn.Parameter(q_delta * torch.randn(max_layers * n_qubits))
      self.post_net = nn.Linear(n_qubits, 2)
  def forward(self, sent_id, mask):

    _, cls_hs = self.bert(sent_id, attention_mask=mask,  return_dict=False)

    pre_out = self.pre_net(cls_hs) 
    q_in = torch.tanh(pre_out) * np.pi / 2.0 

    # Apply the quantum circuit to each element of the batch and append to q_out
    q_out = torch.Tensor(0, n_qubits)
    q_out =
    for elem in q_in:
      q_out_elem = q_net(elem,self.q_params).float().unsqueeze(0)
      q_out =, q_out_elem))

    return self.post_net(q_out)

dev = qml.device('default.qubit', wires=n_qubits)

@qml.qnode(dev, interface='torch')
def q_net(q_in, q_weights_flat):
        # Reshape weights
        q_weights = q_weights_flat.reshape(max_layers, n_qubits)
        # Start from state |+> , unbiased w.r.t. |0> and |1>
        # Embed features in the quantum node
        # Sequence of trainable variational layers
        for k in range(q_depth):
            RY_layer(q_weights[k + 1])

        # Expectation values in the Z basis
        return [qml.expval(qml.PauliZ(j)) for j in range(n_qubits)]

model = Quantumnet(bert, n_qubits, max_layers, q_delta).to(device)
# push the model to GPU
model =
from transformers import AdamW

# define the optimizer
optimizer = AdamW(model.parameters(), lr = 1e-3)
#%% md

#%% md
## Find Class Weights

Our dataset has an imbalance between classes. The vast majority of messages are not spam. Therefore, we will first calculate class weights for the labels in the train set and then send these weights to the loss function so that the class imbalance is taken care of.
from sklearn.utils.class_weight import compute_class_weight

#compute the class weights
class_wts = compute_class_weight(class_weight = "balanced", 
                                 classes = np.unique(train_labels), 
                                 y = train_labels)

# convert class weights to tensor
weights= torch.tensor(class_wts,dtype=torch.float)
weights =

# loss function
cross_entropy  = nn.CrossEntropyLoss(weight=weights) 

# number of training epochs
epochs = 10
model =

#%% md
## Fine-Tune BERT

So far, we have described the model's architecture, specified the optimizer and loss function, and prepared the dataloaders. Now we must  define a couple of functions to train (fine-tune) and evaluate the model, respectively.
# function to train the model
def train():

  total_loss, total_accuracy = 0, 0
  # empty list to save model predictions
  # iterate over batches
  for step,batch in enumerate(train_dataloader):
    # progress update after every 50 batches.
    if step % 50 == 0 and not step == 0:
      print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(train_dataloader)))

    # push the batch to gpu
    batch = [ for r in batch]
    sent_id, mask, labels = batch
    # clear previously calculated gradients 

    # get model predictions for the current batch
    preds = model(sent_id, mask)

    # compute the loss between actual and predicted values
    loss = cross_entropy(preds, labels)

    # add on to the total loss
    total_loss = total_loss + loss.item()

    # backward pass to calculate the gradients

    # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # update parameters

    # model predictions are stored on GPU. So, push it to CPU

    # append the model predictions

  # compute the training loss of the epoch
  avg_loss = total_loss / len(train_dataloader)
  # predictions are in the form of (no. of batches, size of batch, no. of classes).
  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)

  #returns the loss and predictions
  return avg_loss, total_preds
# function for evaluating the model
def evaluate():
  # deactivate dropout layers

  total_loss, total_accuracy = 0, 0
  # empty list to save the model predictions
  total_preds = []

  # iterate over batches
  for step,batch in enumerate(val_dataloader):
    # Progress update every 50 batches.
    if step % 50 == 0 and not step == 0:
      # Calculate elapsed time in minutes.
      elapsed = format_time(time.time() - t0)
      # Report progress.
      print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(val_dataloader)))

    # push the batch to gpu
    batch = [ for t in batch]

    sent_id, mask, labels = batch

    # deactivate autograd
    with torch.no_grad():
      # model predictions
      preds = model(sent_id, mask)

      # compute the validation loss between actual and predicted values
      loss = cross_entropy(preds,labels)

      total_loss = total_loss + loss.item()

      preds = preds.detach().cpu().numpy()


  # compute the validation loss of the epoch
  avg_loss = total_loss / len(val_dataloader) 

  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)

  return avg_loss, total_preds
#%% md
## Start Model Training
# set initial loss to infinite
best_valid_loss = float('inf')

weights= torch.tensor(class_wts,dtype=torch.float)
weights =

# loss function

# empty lists to store training and validation loss of each epoch

#for each epoch
for epoch in range(epochs):
    print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
    #train model
    train_loss, _ = train()
    #evaluate model
    valid_loss, _ = evaluate()
    #save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss, '')
    # append training and validation loss
    print(f'\nTraining Loss: {train_loss:.3f}')
    print(f'Validation Loss: {valid_loss:.3f}')

I've ensured that all tensors are correctly moved to the CPU before calling the quantum node and then moved back to the GPU as necessary. Despite this, the error suggests there's still a mismatch in device allocation.

Could you please help me understand what might be causing this issue and how to resolve it? Am I missing a step in correctly handling device allocation between PennyLane and PyTorch tensors?

Thank you for your assistance!
pytorch    2.2.0
pennylane-0.35.1 pennylane-lightning-0.35.1
cuda 4060

Hey @linzhongyan1, welcome to the forum! :sunglasses:

I think I’m missing the definition of bert in your code. Can you post your full code for me to copy-paste on my side and try to replicate your problem?

Thank you for your reply. I have posted full code ,After running this segment of code, the following error occurred:

RuntimeError Traceback (most recent call last)
Cell In[464], line 23
20 print(‘\n Epoch {:} / {:}’.format(epoch + 1, epochs))
22 #train model
—> 23 train_loss, _ = train()
25 #evaluate model
26 valid_loss, _ = evaluate()

Cell In[462], line 28, in train()
25 model.zero_grad()
27 # get model predictions for the current batch
—> 28 preds = model(sent_id, mask)
30 # compute the loss between actual and predicted values
31 loss = cross_entropy(preds, labels)

File ~.conda\envs\pytorch\Lib\site-packages\torch\nn\modules\, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
→ 1511 return self._call_impl(*args, **kwargs)

File ~.conda\envs\pytorch\Lib\site-packages\torch\nn\modules\, in Module._call_impl(self, *args, **kwargs)
1515 # If we don’t have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

Cell In[457], line 25, in Quantumnet.forward(self, sent_id, mask)
23 # 量子层操作在CPU上执行
24 for elem in‘cpu’):
—> 25 q_out_elem = q_net(elem, self.q_params).float().unsqueeze(0) # q_net的定义应该处理在CPU上的数据
26 q_out =, q_out_elem), dim=0)
28 # 将量子层的输出移回模型确定的设备

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\, in, *args, **kwargs)
1034 full_transform_program._set_all_argnums(
1035 self, args, kwargs, argnums
1036 ) # pylint: disable=protected-access
1038 # pylint: disable=unexpected-keyword-arg
→ 1039 res = qml.execute(
1040 (self._tape,),
1041 device=self.device,
1042 gradient_fn=self.gradient_fn,
1043 interface=self.interface,
1044 transform_program=full_transform_program,
1045 config=config,
1046 gradient_kwargs=self.gradient_kwargs,
1047 override_shots=override_shots,
1048 **self.execute_kwargs,
1049 )
1051 res = res[0]
1053 # convert result to the interface in case the qfunc has no parameters

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\interfaces\, in execute(tapes, device, gradient_fn, interface, transform_program, config, grad_on_execution, gradient_kwargs, cache, cachesize, max_diff, override_shots, expand_fn, max_expansion, device_batch_transform, device_vjp)
646 # Exiting early if we do not need to deal with an interface boundary
647 if no_interface_boundary_required:
→ 648 results = inner_execute(tapes)
649 return post_processing(results)
651 _grad_on_execution = False

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\interfaces\, in *make_inner_execute…inner_execute(tapes, ***)
259 if numpy_only:
260 tapes = tuple(qml.transforms.convert_to_numpy_parameters(t) for t in tapes)
→ 261 return cached_device_execution(tapes)

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\interfaces\, in cache_execute…wrapper(tapes, **kwargs)
378 return (res, ) if return_tuple else res
380 else:
381 # execute all unique tapes that do not exist in the cache
382 # convert to list as new device interface returns a tuple
→ 383 res = list(fn(tuple(execution_tapes.values()), **kwargs))
385 final_res =
387 for i, tape in enumerate(tapes):

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\devices\, in DefaultQubit.execute(self, circuits, execution_config)
517 interface = (
518 execution_config.interface
519 if execution_config.gradient_method in {“backprop”, None}
520 else None
521 )
522 if max_workers is None:
→ 523 results = tuple(
524 simulate(
525 c,
526 rng=self._rng,
527 prng_key=self._prng_key,
528 debugger=self._debugger,
529 interface=interface,
530 state_cache=self._state_cache,
531 )
532 for c in circuits
533 )
534 else:
535 vanilla_circuits = [convert_to_numpy_parameters(c) for c in circuits]

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\devices\, in (.0)
517 interface = (
518 execution_config.interface
519 if execution_config.gradient_method in {“backprop”, None}
520 else None
521 )
522 if max_workers is None:
523 results = tuple(
→ 524 simulate(
525 c,
526 rng=self._rng,
527 prng_key=self._prng_key,
528 debugger=self._debugger,
529 interface=interface,
530 state_cache=self._state_cache,
531 )
532 for c in circuits
533 )
534 else:
535 vanilla_circuits = [convert_to_numpy_parameters(c) for c in circuits]

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\devices\qubit\, in simulate(circuit, rng, prng_key, debugger, interface, state_cache)
203 def simulate(
204 circuit: qml.tape.QuantumScript,
205 rng=None,
209 state_cache: Optional[dict] = None,
210 ) → Result:
211 “”“Simulate a single quantum script.
213 This is an internal function that will be called by the successor to default.qubit.
239 “””
→ 240 state, is_state_batched = get_final_state(circuit, debugger=debugger, interface=interface)
241 if state_cache is not None:
242 state_cache[circuit.hash] = state

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\devices\qubit\, in get_final_state(circuit, debugger, interface)
126 is_state_batched = bool(prep and prep.batch_size is not None)
127 for op in circuit.operations[bool(prep) :]:
→ 128 state = apply_operation(op, state, is_state_batched=is_state_batched, debugger=debugger)
130 # Handle postselection on mid-circuit measurements
131 if isinstance(op, qml.Projector):

File ~.conda\envs\pytorch\Lib\, in singledispatch…wrapper(*args, **kw)
905 if not args:
906 raise TypeError(f’{funcname} requires at least ’
907 ‘1 positional argument’)
→ 909 return dispatch(args[0].class)(*args, **kw)

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\devices\qubit\, in apply_operation(op, state, is_state_batched, debugger)
150 @singledispatch
151 def apply_operation(
152 op: qml.operation.Operator, state, is_state_batched: bool = False, debugger=None
153 ):
154 “”“Apply and operator to a given state.
156 Args:
197 “””
→ 198 return _apply_operation_default(op, state, is_state_batched, debugger)

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\devices\qubit\, in _apply_operation_default(op, state, is_state_batched, debugger)
202 “”“The default behaviour of apply_operation, accessed through the standard dispatch
203 of apply_operation, as well as conditionally in other dispatches.”“”
204 if (
207 ) or (op.batch_size and is_state_batched):
→ 208 return apply_operation_einsum(op, state, is_state_batched=is_state_batched)
209 return apply_operation_tensordot(op, state, is_state_batched=is_state_batched)

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\devices\qubit\, in apply_operation_einsum(op, state, is_state_batched)
97 op._batch_size = batch_size # pylint:disable=protected-access
98 reshaped_mat = math.reshape(mat, new_mat_shape)
→ 100 return math.einsum(einsum_indices, reshaped_mat, state)

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\math\, in einsum(indices, like, optimize, *operands)
537 if like is None:
538 like = get_interface(*operands)
→ 539 operands = np.coerce(operands, like=like)
540 if optimize is None or like == “torch”:
541 # torch einsum doesn’t support the optimize keyword argument
542 return np.einsum(indices, *operands, like=like)

File ~.conda\envs\pytorch\Lib\site-packages\autoray\, in do(fn, like, *args, **kwargs)
31 “”“Do function named fn on (*args, **kwargs), peforming single
32 dispatch to retrieve fn based on whichever library defines the class of
33 the args[0], or the like keyword argument if specified.
77 <tf.Tensor: id=91, shape=(3, 3), dtype=float32>
78 “””
79 backend = choose_backend(fn, *args, like=like, **kwargs)
—> 80 return get_lib_fn(backend, fn)(*args, **kwargs)

File ~.conda\envs\pytorch\Lib\site-packages\pennylane\math\, in _coerce_types_torch(tensors)
598 if len(device_set) > 1: # pragma: no cover
599 # GPU specific case
600 device_names = ", “.join(str(d) for d in device_set)
→ 601 raise RuntimeError(
602 f"Expected all tensors to be on the same device, but found at least two devices, {device_names}!”
603 )
605 device = device_set.pop() if len(device_set) == 1 else None
606 tensors = [torch.as_tensor(t, device=device) for t in tensors]

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu, cuda:0!

Hey @linzhongyan1,

It’s a little hard to tell what’s going on because your code base is quite large. That, and I can’t run your code because I’m missing your input data. That said, it might be that your training data is the culprit; it might be on the cpu and everything else is on the gpu.

I think if you do torch.set_default_tensor_type('torch.cuda.FloatTensor'), then that makes all torch tensors created be on the gpu (see here: How to create a tensor on GPU as default - PyTorch Forums).

Let me know if that helps!