Hi @_risto,
As a general approach, that could be done indeed.
When dealing with image datasets, as was hinted previously, feeding the input data into a convolutional neural network (CNN) is also very popular for image classification. Several architectures use convolutional networks (including ResNet).
For example in the Training a classifier
tutorial includes the creation of a CNN which is then put as a layer before calculating the loss in each step.