时间:ICLR 2015
本文是受到Hintin的 Distilling the knowledge in a neural network启发后,又一篇不错的论文。它的主要思想如下:使用更深和更细的student model可以获得更好的结果,并且使用中间层 的输出作为提示。
In this paper, we aim to address the network compression problem by taking advantage of depth. We propose a novel approach to train thin and deep networks, called FitNets, to compress wide and shallower (but still deep) networks. The method is rooted in the recently proposed Knowledge Distillation (KD) (Hinton & Dean, 2014) and extends the idea to allow for thinner and deeper student models. We introduce intermediate-level hints from the teacher hidden layers to guide the training process of the student, i.e., we want the student network (FitNet) to learn an intermediate representation that is predictive of the intermediate representations of the teacher network. Hints allow the training of thinner and deeper networks. Results confirm that having deeper models allow us to generalize better, whereas making these models thin help us reduce the computational burden significantly. We validate the proposed method on MNIST, CIFAR-10, CIFAR-100, SVHN and AFLW benchmark datasets and provide evidence that our method matches or outperforms the teacher’s performance, while requiring notably fewer parameters and multiplications.
论文使用的方法直觉上来讲也是直接的:既然网络很深直接训练会困难,那就通过在中间层加入loss的方法,如此分成两块来训练,应该能得到更好的结果。至于中间的loss怎么来?自然就是通过teacher network的中间feature map来得到的。paper中把这个方法叫做Hint-based Training (HT)
阅读论文的想法时,很自然的会想到一个问题,teacher model和student model的中间层维度和含义如何能对应呢?作者也提出了解决的办法。那就是使用regressor去分别连接两个中间层,建立对应关系。并且,用全链接的方式与中间层建立连接虽然容易想到,但是却造成了计算量的增加,而使用卷积层似乎是个更好的选择。具体如何利用conv层保证两个网络的中间层的输出是一样的,作者在论文中也做出了解释。
其他论文对于Fitnets的评价:
Lately, Romero et al. proposed FitNet to compress networks from wide and shallow to thin and deep. In order to learn from the intermediate representations of teacher network, FitNet makes the student mimic the full feature maps of the teacher. However, such assumptions are too strict since the capacities of teacher and student may differ greatly. In certain circumstances, FitNet may adversely affect the performance and convergence.
参考文档: