Parameter Sharing and Typing in Machine Learning

Last Updated : 17 Jun, 2024

We usually apply limitations or penalties to parameters in relation to a fixed region or point. L² regularisation (or weight decay) penalises model parameters that deviate from a fixed value of zero, for example.

However, we may occasionally require alternative means of expressing our prior knowledge of appropriate model parameter values. We may not know exactly what values the parameters should take, but we do know that there should be some dependencies between the model parameters based on our knowledge of the domain and model architecture.

We frequently want to communicate the dependency that various parameters should be near to one another.

Parameter Typing

Two models are doing the same classification task (with the same set of classes), but their input distributions are somewhat different.

We have model A has the parameters\boldsymbol{w}^{(A)}
Another model B has the parameters \boldsymbol{w}^{(B)}

\hat{y}^{(A)}=f\left(\boldsymbol{w}^{(A)}, \boldsymbol{x}\right)

and

\hat{y}^{(B)}=g\left(\boldsymbol{w}^{(B)}, \boldsymbol{x}\right)

are the two models that transfer the input to two different but related outputs.

Assume the tasks are comparable enough (possibly with similar input and output distributions) that the model parameters should be near to each other: \forall i, w_{i}^{(A)} should be close to w_{i}^{(B)} . We can take advantage of this data by regularising it. We can apply a parameter norm penalty of the following form: \Omega\left(\boldsymbol{w}^{(A)}, \boldsymbol{w}^{(B)}\right)=\left\|\boldsymbol{w}^{(A)}-\boldsymbol{w}^{(B)}\right\|_{2}^{2} . We utilised an L² penalty here, but there are other options.

Parameter Sharing

The parameters of one model, trained as a classifier in a supervised paradigm, were regularised to be close to the parameters of another model, trained in an unsupervised paradigm, using this method (to capture the distribution of the observed input data). Many of the parameters in the classifier model might be linked with similar parameters in the unsupervised model thanks to the designs. While a parameter norm penalty is one technique to require sets of parameters to be equal, constraints are a more prevalent way to regularise parameters to be close to one another. Because we view the numerous models or model components as sharing a unique set of parameters, this form of regularisation is commonly referred to as parameter sharing. The fact that only a subset of the parameters (the unique set) needs to be retained in memory is a significant advantage of parameter sharing over regularising the parameters to be close (through a norm penalty). This can result in a large reduction in the memory footprint of certain models, such as the convolutional neural network.

Convolutional neural networks (CNNs) used in computer vision are by far the most widespread and extensive usage of parameter sharing. Many statistical features of natural images are translation insensitive. A shot of a cat, for example, can be translated one pixel to the right and still be a shot of a cat. By sharing parameters across several picture locations, CNNs take this property into account. Different locations in the input are computed with the same feature (a hidden unit with the same weights). This indicates that whether the cat appears in column i or column i + 1 in the image, we can find it with the same cat detector.

CNN's have been able to reduce the number of unique model parameters and raise network sizes greatly without requiring a comparable increase in training data thanks to parameter sharing. It's still one of the best illustrations of how domain knowledge can be efficiently integrated into the network architecture.