Modularizing Deep Learning via Pairwise Learning With Kernels
- Shiyu Duan, Shujian Yu, Jose Principe
- The central idea here is to re-interpret deep networks, not with the nonlinearity as the output of a layer, but rather as the input of the layer, with the regression (weights) being performed on this nonlinear projection.
- In this sense, each re-defined layer is implementing the 'kernel trick': tasks (like classification) which are difficult in linear spaces, become easier when projected into some sort of kernel space.
- The kernel allows pairwise comparisons of datapoints. EG. a radial basis kernel measures the radial / gaussian distance between data points. A SVM is a kernel machine in this sense.
- As a natural extension (one that the authors have considered) is to take non-pointwise or non-one-to-one kernel functions -- those that e.g. multiply multiple layer outputs. This is of course part of standard kernel machines.
- Because you are comparing projected datapoints, it's natural to take contrastive loss on each layer to tune the weights to maximize the distance / discrimination between different classes.
- Hence this is semi-supervised contrastive classification, something that is quite popular these days.
- The last layer is of tuned with cross-entropy labels, but only a few are required since the data is well distributed already.
- Demonstrated on small-ish datasets, concordant with their computational resources ...
I think in general this is an important result, even if its not wholly unique / somewhat anticipated (it's a year old at the time of writing). Modular training of neural networks is great for efficiency, parallelization, and biological implementations! Transport of weights between layers is hence non-essential.
Classes still are, but I wonder if temporal continuity can solve some of these problems?
(There is plenty of other effort in this area -- see also {1544}) |