Transfer Learning presentationslide.pptx

Index
 Limitations in machine learning
 What is transfer learning
 How Transfer learning solves
 Semisupervised Learning Vs Transfer Learning
 Multiview Learning
 Multitask Learning Vs. Transfer Learning
 Definition
 Categorization of Transfer Learning
 Data-based Interpretation
 Model-based Interpretation
 Application

Limitations in machine learning
 There are abundant labelled training instances, which have the same distribution
as the test data.
 However, in many scenarios, collecting sufficient training data is often
expensive, time-consuming, or even unrealistic.
 Semi-supervised learning can partly solve this problem by relaxing the need of
mass labeled data.
 Typically, a semi-supervised approach only requires a limited number of labeled
data, and it utilizes a large amount of unlabeled data to improve the learning
accuracy.
 But in many cases, unlabeled instances are also difficult to collect, which usually
makes the resultant traditional models unsatisfactory.

How Transfer learning solves
 Transfer learning focuses on transferring the knowledge across domains.
 Transfer learning may initially come from educational psychology. According to
the generalization theory of transfer, learning to transfer is the result of the
generalization of experience.
 According to this theory, the prerequisite of transfer is that there needs to be a
connection between two learning activities.
 In practice, a person who has learned the violin can learn the piano faster than
others, since both the violin and the piano are musical instruments and may share
some common knowledge.

Fig:Intuitive examples about transfer learning

 Transfer learning aims to leverage knowledge from a related domain (called
source domain) to improve the learning performance or minimize the number of
labeled examples required in a target domain.
 It is worth mentioning that the transferred knowledge does not always bring a
positive impact on new tasks. If there is little in common between domains,
knowledge transfer could be unsuccessful.
 For example, although Spanish and French have a close relationship with each
other and both belong to the Romance group of languages, people who learn
Spanish may experience difficulties in learning French, such as using the wrong
vocabulary or conjugation. This occurs because previous successful experience in
Spanish can interfere with learning the word formation, usage, pronunciation,
conjugation, and so on, in French.

 According to the discrepancy between domains, transfer learning can be further
divided into two categories.
 Transfer learning
Homogeneous Transfer learning Heterogeneous Transfer learning
 Homogeneous transfer learning approaches are developed and proposed for
handling the situations where the domains are of the same feature space.
 Heterogeneous transfer learning refers to the knowledge transfer process in the
situations where the domains have different feature spaces.

Semisupervised Learning Vs Transfer Learning
Semisupervised Learning Transfer Learning
A method that lies between supervised
learning and unsupervised learning. A
semisupervised method utilizes abundant
unlabeled instances combined with a
limited number of labeled instances to
train a learner.
Transfer learning focuses on transferring
the knowledge across domains.
In semi-supervised learning, both the
labeled and unlabeled instances are
drawn from the same distribution.
In transfer learning, the data distributions
of the source and the
target domains are usually different.

Multiview Learning
 Multiview learning focuses on the machine learning problems with multiview
data.
 A view represents a distinct feature set.
 An example about multiple views is that a video object can be described from
two different viewpoints, i.e., the image signal and the audio signal.
 Multiview learning describes an object from multiple views, which results in
abundant information.
 By properly considering the information from all views, the learner’s
performance can be improved.
 There are several strategies adopted in Multiview learning such as subspace
learning, multikernel learning, and co-training.
 A multiview transfer learning approach for activity learning, which transfers
activity knowledge between heterogeneous sensor platforms.

Multitask Learning Vs. Transfer Learning
Multitask Learning Transfer Learning
Multitask learning reinforces each task
by taking advantage of the
interconnections between task, that is,
considering both the intertask relevance
and the intertask difference.
Transfer learning focuses on transferring
the knowledge across domains.
Transfer the knowledge via
simultaneously learning some related
tasks.
Transfer the knowledge contained
in the related domains.
multitask learning pays equal attention to
each task.
transfer learning pays more attention to
the target task than to the source task.
Besides, they adopt some similar strategies for constructing models, such as feature
transformation and parameter sharing.

Definition
 Domain: A domain D is composed of two parts, that is, a feature space ꭕ and a
marginal distribution P(X). In otherwords, D = {ꭕ, P(X)}. And the symbol X
denotes an instance set, which is defined as X = {x|xi ∈ ꭕ, i = 1, . . . , n}.
 Task: A task Ƭ consists of a label space У and a decision function ƒ, i.e.,
Ƭ = {У, ƒ}. The decision function ƒ is an implicit one, which is expected to be
learned from the sample data.

Categorization of Transfer Learning
 Transfer learning problems can be divided into three categories,
Transductive: Situations where the label
Information only comes from the source
domain.
Transfer Learning problems Inductive: The label information of the
target-domain
instances is
available.
Unsupervised transfer learning: The label
information is unknown for both source and target domains.

 Another categorization is based on the consistency between the source and the
target feature spaces and label spaces.
 If ꭕS
= ꭕT
and yS
= yT
, the scenario is termed as homogeneous transfer learning.
 Otherwise, if ꭕS
≠ ꭕT
or/and yS
≠ yT
, the scenario is referred to as heterogeneous
transfer learning.
 The transfer learning approaches can be categorized into four groups:
instance based, feature-based, parameter-based, and relational based
approaches.

Fig: Categorization of transfer learning

 Instance-based transfer learning approaches are mainly based on the instance
weighting strategy.
 Feature-based approaches transform the original features to create a new feature
representation; they can be further divided into two subcategories, that is,
asymmetric and symmetric feature-based transfer learning.
 Asymmetric approaches transform the source features to match the target ones.
 In contrast, symmetric approaches attempt to find a common latent feature
space and then transform both the source and the target features into a new
feature representation.
 The parameter-based transfer learning approaches transfer the knowledge at the
model/parameter level.
 Relational-based transfer learning approaches mainly focus on the problems in
relational domains. Such approaches transfer the logical relationship or rules
learned in the source domain to the target domain.

Data-based Interpretation
 Many transfer learning approaches, especially the data-based approaches, focus
on transferring the knowledge via the adjustment and the transformation of data.
The following figure shows the strategies and the objectives of the approaches
from the data perspective.

Fig: Strategies and the objectives of the transfer learning approaches from the data
perspective.

A) Instance Weighting Strategy
 A large number of labeled source-domain and a limited number of target-domain
instances are available and domains differ only in marginal distributions,
i.e., PS
(X) ≠ PT
(X) and PS
(Y |X) = PT
(Y |X)
 For example, we need to build a model to diagnose cancer in a specific region
where the elderly are the majority. Limited target-domain instances are given, and
relevant data are available from another region where young people are the
majority. Directly transferring all the data from another region may be
unsuccessful, since the marginal distribution difference exists, and the elderly
have a higher risk of cancer than younger people. In this scenario, it is natural to
consider adapting the marginal distributions. A simple idea is to assign weights to
the source-domain instances in the loss function.

 Based on the studies of weight estimation, some instance-based transfer learning
frameworks or algorithms are proposed.
 A multisource framework termed two-stage weighting framework for
multisource-domain adaptation (2SW-MDA) with the following two stages.
 1) Instance weighting: The source-domain instances are assigned with weights to
reduce the marginal distribution difference, which is similar to KMM(Kernel
mean matching).
 2) Domain weighting: Weights are assigned to each source domain for reducing
the conditional distribution difference based on the smoothness assumption.
 Then, the source-domain instances are reweighted based on the instance
weights and the domain weights. These reweighted instances and the labeled
target-domain instances are used to train the target classifier.

 In addition to directly estimating the weighting parameters, adjusting weights
iteratively is also effective. The key is to design a mechanism to decrease the
weights of the instances that have negative effects on the target learner.
 A representative work is TrAdaBoost, is a framework an extension of AdaBoost.
 AdaBoost is an effective boosting algorithm designed for traditional machine
learning tasks. In each iteration of AdaBoost, a learner is trained on the instances
with updated weights, which results in a weak classifier. The weighting
mechanism of instances ensures that the instances with incorrect classification are
given more attention. Finally, the resultant weak classifiers are combined to form
a strong classifier.

 TrAdaBoost extends the AdaBoost to the transfer learning scenario;
 A new weighting mechanism is designed to reduce the impact of the distribution
difference.
 Specifically, in TrAdaBoost, the labeled source-domain and labeled target-domain
instances are combined as a whole, i.e., a training set, to train the weak classifier.
The weighting operations are different for the source-domain and the target-
domain instances.
 In each iteration, a temporary Variable ẟ ￣ , which measures the classification
error rate on the labeled target-domain instances, is calculated.
 Then, the weights of the target-domain instances are updated based on ẟ ￣ and
the individual classification results, while the weights of the source-domain
instances are updated based on a designed constant and the individual
classification results.

 A Multi-Source TrAdaBoost (MsTrAdaBoost) algorithm, which mainly has the
following two steps in each iteration.
 1) Candidate classifier construction: A group of candidate weak classifiers are
trained on the weighted instances in the pairs of each source domain and the
target domain.
 2) Instance weighting: A Classifier which has the minimal classification error rate
on the target-domain instances is selected, and then is used for updating the
weights of the instances.
 Finally, the selected classifiers from each iteration are combined to form the final
classifier.

B)Feature Transformation Strategy
 Feature transformation strategy is often adopted in feature-based approaches.
 Feature-based approaches transform each original feature into a new feature
representation for knowledge transfer.
 The objectives of constructing a new feature representation include minimizing
the marginal and the conditional distribution difference, preserving the properties
or the potential structures of the data, and finding the correspondence between
features.
 The operations of feature transformation can be divided into three types, i.e.,
 feature augmentation,
 feature reduction,
 and feature alignment.

 Besides, feature reduction can be further divided into several types such as
feature mapping, feature clustering, feature selection, and feature encoding.
1) Distribution Difference Metric: The objective of feature transformation is to
reduce the distribution difference of the source and the target-domain instances.
2) Feature Augmentation: Feature augmentation operations are widely used in
feature transformation, especially in symmetric feature-based approaches.
 A feature augmentation method transforms the original features by simple
feature replication. Specifically, in single-source transfer learning scenario,
the feature space is augmented to three times its original size. The new feature
representation consists of general features, source-specific features, and
target-specific features.
 For the transformed source-domain instances, their target-specific features are
set to zero. Similarly, for the transformed target-domain instances, their
source-specific features are set to zero.

3) Feature Mapping: In traditional machine learning, there are many feasible
mapping-based methods of extracting features such as principal component analysis
(PCA) and kernelized-PCA (KPCA).
 However, these methods mainly focus on the data variance rather than the
distribution difference.
 In order to solve the distribution difference, some feature extraction methods are
proposed for transfer learning.
4) Feature Clustering: Feature clustering aims to find a more abstract feature
representation of the original features.
 Although it can be regarded as a way of feature extraction.

 5) Feature Selection: Feature selection is another kind of operation for feature
reduction, which is used to extract the pivot features.
 The pivot features are the ones that behave in the same way in different domains.
 Due to the stability of these features, they can be used as the bridge to transfer the
knowledge.
 An approach termed structural correspondence learning (SCL). Briefly, SCL
consists of the following steps to construct a new feature representation.
1) Feature selection: SCL first performs feature selection operations to obtain
the pivot features.
2) Mapping learning: The pivot features are utilized to find a low-dimensional
common latent feature space by using the structural learning technique.
3) Feature stacking: A new feature representation is constructed by feature
augmentation, i.e., stacking the original features with the obtained low-
dimensional features.

6) Feature Encoding: In addition to feature extraction and selection, feature encoding
is also an effective tool.
 For example, autoencoders, which are often adopted in deep learning area, can be
used for feature encoding.
 An autoencoder consists of an encoder and a decoder.
 The encoder tries to produce a more abstract representation of the input, while
the decoder aims to map back that representation and to minimize the
reconstruction error.
 Autoencoders can be stacked to build a deep learning architecture.
 Once an autoencoder completes the training process, another autoencoder can be
stacked at the top of it.
 The newly added autoencoder is then trained by using the encoded output of the
upper-level autoencoder as its input. In this way, deep learning architectures can
thus be constructed.

 Some transfer learning approaches are developed based on autoencoders such as stacked
denoising autoencoder (SDA).
 The denoising autoencoder, which can enhance the robustness, is an extension of the basic
one.
 This kind of autoencoder contains a randomly corrupting mechanism that adds noise to the
input before mapping.
 For example, an input can be corrupted or partially destroyed by adding a masking noise or
Gaussian noise. The denoising autoencoder is then trained to minimize the denoising
reconstruction error between the original clean input and the output.
 The SDA algorithm has the following steps.
1) Autoencoder training: The source-domain and target-domain instances are used to
train a stack of denoising autoencoders in a greedy layer-by-layer way.
2) Feature encoding and stacking: A new feature representation is constructed by
stacking the encoding output of intermediate layers, and the features of the instances are
transformed into the obtained new representation.
3) Learner training: The target classifier is trained on the transformed labeled instances.

 Although the SDA algorithm has excellent performance for feature extraction, it
still has some drawbacks such as high computational and parameter-estimation
cost.
 In order to reduce the training time and to speed up traditional SDA algorithms, a
modified version of SDA, i.e., marginalized stacked linear denoising autoencoder
(mSLDA) was proposed.
 This algorithm adopts linear autoencoders and marginalizes the randomly
corrupting step in a closed form. It may seem that linear autoencoders are too
simple to learn complex features.
 The linear autoencoders are often sufficient to achieve competent performance
when encountering high-dimensional data.

7) Feature Alignment: Feature augmentation and feature reduction mainly focus on
the explicit features in a feature space.
 In contrast, in addition to the explicit features, feature alignment also focuses on
some implicit features such as the statistic features and the spectral features.
 Therefore, feature alignment can play various roles in the feature transformation
process. For example, the explicit features can be aligned to generate a new
feature representation, or the implicit features can be aligned to construct a
satisfied feature transformation.
 There are several kinds of features that can be aligned, which includes
subspace features, spectral features, and statistic features.

Subspace feature alignment
 It has the following steps.
1) Subspace generation: In this step, the instances are used to generate the
respective subspaces for the source and the target domains. The orthonormal
bases of the source- and the target-domain subspaces are then obtained, which
are denoted by MS and MT , respectively. These bases are used to learn the shift
between the subspaces.
2) Subspace alignment (SA): In the second step, a mapping, which aligns the
bases MS and MT of the subspaces, is learned. And the features of the instances
are projected to the aligned subspaces to generate new feature representation.
3) Learner training: Finally, the target learner is trained on the transformed
instances.

 Another representative subspace feature alignment approach is geodesic flow
kernel (GFK).
 GFK is closely related to geodesic flow subspaces (GFSs).
 GFS generally takes the following steps to align features.
1) Subspace generation: GFS first generates two subspaces of the source and
the target domains by performing PCA, respectively.
2) Subspace interpolation: The two obtained subspaces can be viewed as two
points on the Grassmann Manifold. A finite number of the interpolated
subspaces are generated between these two subspaces based on the geometric
properties of the manifold.
3) Feature projection and stacking: The original features are transformed by
stacking the corresponding projections from all the obtained subspaces.

 Despite the usefulness and superiority of GFS, there is a problem about how to
determine the number of the interpolated subspaces.
 GFK resolves this problem by integrating infinite number of the subspaces
located on the geodesic curve from the source subspace to the target one.
 The key of GFK is to construct an infinite-dimensional feature space that
incorporating the information of all the subspaces lying on the geodesic flow.
 In order to compute the inner product in the resultant infinite-dimensional space,
the GFK is defined and derived.

Statistic feature alignment
 Statistic feature alignment is another kind of feature alignment.
 There is an approach known as co-relation alignment (CORAL). CORAL
constructs the transformation matrix of the source features by aligning the
second-order statistic features, i.e., the covariance matrices.
Spectral feature alignment (SFA)
 In traditional machine learning area, spectral clustering is a clustering technique
based on graph theory.
 The key of this technique is to utilize the spectrum, i.e., eigenvalues, of the
similarity matrix to reduce the dimension of the features before clustering.
 The similarity matrix is constructed to quantitatively assess the relative similarity
of each pair of data/vertices.

 SFA generally contains the following five steps.
1) Feature selection: In this step, feature selection operations are performed to select
the domain-independent/ pivot features.
 There are three strategies to select domain-independent features.
 These strategies are based on the occurrence frequency of words, the mutual
information between features and labels, and the mutual information between
features and domains, respectively.
2) Similarity matrix construction:
 Once the domain-specific and the domain-independent features are
identified, a bipartite graph is constructed.
 Each edge of this bipartite graph is assigned with a weight that measures the
co-occurrence relationship between a domain-specific word and a domain-
independent word.
 Based on the bipartite graph, a similarity matrix is then constructed.

3) SFA: In this step, a spectral clustering algorithm is adapted and performed to align
domain-specific features.
Specifically, based on the eigen-vectors of the graph Laplacian, a feature alignment
mapping is constructed, and the domain-specific features are mapped into a low-
dimensional feature space.
4) Feature stacking: The original features and the low-dimensional features are
stacked to produce the final feature representation.
5) Learner training: The target learner is trained on the labeled instances with the
final feature representation.

some other spectral transfer learning approaches such as cross-domain spectral
classifier (CDSC). The general ideas and steps of this approach are presented as
follows.
1) Similarity matrix construction: In the first step, two similarity matrices are
constructed corresponding to the whole instances and the target-domain
instances, respectively.
2) SFA: An objective function is designed with respect to a graph-partition
indicator vector; a constraint matrix is constructed, which contains pair-wise
must-link information. Instead of seeking the discrete solution of the indicator
vector, the solution is relaxed to be continuous, and the eigen-system problem
corresponding to the objective function is solved to construct the aligned
spectral features.
3) Learner training: A traditional classifier is trained on the transformed
instances.

Model-based Interpretation
 Transfer learning approaches can also be interpreted from the model perspective.
 The main objective of a transfer learning model is to make accurate prediction
results on the target domain, for example, classification or clustering results.
 A transfer learning model may consist of a few sub-modules such as classifiers,
extractors, or encoders. These submodules may play different roles, for example,
feature adaptation or pseudo-label generation.
Fig: Strategies and objectives of the transfer learning approaches from the model

A) Model Control Strategy:
 From the perspective of model, a natural thought is to directly add the model-
level that regularizers to the learner’s objective function.
 In this way, the knowledge contained in the preobtained source models can be
transferred into the target model during the training process.
 A general framework termed as domain adaptation machine (DAM), which is
designed for multisource transfer learning.
 The goal of DAM is to construct a robust classifier for the target domain with the
help of some preobtained base classifiers that are, respectively, trained on
multiple source domains.

1) Consensus regularizer:
 CRF is designed for multisource transfer learning with no labeled target-domain
instances.
 The framework constructs mS
classifiers corresponding to each source domain, and
these classifiers are required to reach mutual consensuses on the target domain.
2) Domain-dependent regularizer: Fast-DAM is a specific algorithm of DAM. In light
of the manifold assumption and the graph-based regularizer, fast-DAM designs a
domain-dependent regularizer.
3) Domain-dependent regularizer + universum regularizer:
 Univer-DAM is an extension of the fast-DAM. Its objective function contains an
additional regularizer, i.e., Universum regularizer.
 This regularizer usually utilizes an additional data set termed Universum where
the instances do not belong to either the positive or the negative class. The source-
domain instances as the Universum for the target domain.

B) Parameter Control Strategy
 The parameter control strategy focuses on the parameters of models.
 In the application of object categorization, the knowledge from known source
categories can be transferred into target categories via object attributes such as
shape and color.
 The attribute priors, i.e., probabilistic distribution parameters of the image
features corresponding to each attribute, can be learned from the source domain
and then used to facilitate learning the target classifier.
 The parameters of a model actually reflect the knowledge learned by the model.
Therefore, it is possible to transfer the knowledge at the parametric level.

1) Parameter Sharing:
 An intuitive way of controlling the parameters is to directly share the parameters
of the source learner to the target learner.
 Parameter sharing is widely employed especially in the network-based
approaches.
 For example, if we have a neural network for the source task, we can freeze (or
say, share) most of its layers and only finetune the last few layers to produce a
target network.
 In addition to network-based parameter sharing, matrix factorization-based
parameter sharing is also workable.
2) Parameter Restriction:
 Another parameter-control type strategy is to restrict the parameters.
 Different from the parameter sharing strategy that enforces the models share some
parameters, parameter restriction strategy only requires the parameters of the
source and the target models to be similar.

C) Model Ensemble Strategy
 In sentiment analysis, applications related to product reviews, data or models
from multiple product domains are available and can be used as the source
domains.
 Combining data or models directly into a single domain may not be successful
because the distributions of these domains are different from each other.
 Model ensemble is another commonly used strategy. This strategy aims to
combine a number of weak classifiers to make the final predictions.
 Some previously mentioned transfer learning approaches already adopted this
strategy. For example, TrAdaBoost and MsTrAdaBoost ensemble the weak
classifiers via voting and weighting, respectively.

D) Deep Learning Technique
 Deep learning methods are particularly popular in the field of machine learning.
 Many researchers utilize the deep learning techniques to construct transfer
learning models.
 The SDA and the mSLDA approaches utilize the deep learning techniques.
 The deep learning approaches introduced are divided into two types, i.e.,
nonadversarial (or say, traditional) ones and adversarial ones.
1) Traditional Deep Learning:
 Autoencoders are often used in deep learning area.
 In addition to SDA and mSLDA, there are some other reconstruction-based
transfer learning approaches.
 TLDA adopts two autoencoders for the source and the target domains,
respectively.

 These two autoencoders share the same parameters. The encoder and the decoder
both have two layers with activation functions.
 There are several objectives of TLDA, which are listed as follows.
1) Reconstruction error minimization: The output of the decoder should be
extremely close to the input of encoder.
2) Distribution adaptation: The distribution difference between QS
and QT
should be minimized.
3) Regression error minimization: The output of the encoder on the labeled
source-domain instances, that is, RS
, should be consistent with the
corresponding label information YS
.

2) Adversarial Deep Learning:
 The thought of adversarial learning can be integrated into deep-learning-based
transfer learning approaches. As mentioned above, in the DAN framework, the
network and the kernel play a minimax game, which reflects the thought of
adversarial learning.
 However, the DAN framework is a little different from the traditional GAN-based
methods in terms of the adversarial matching.
 In the DAN framework, there is only a few parameters to be optimized in the max
game, which makes the optimization easier to achieve equilibrium.
 The original GAN, which is inspired by the two-player game, is composed of two
models, a generator G and a discriminator D.
 The generator produces the counterfeits of the true data for the purpose of
confusing the discriminator and making the discriminator produce wrong detection.
 The discriminator is fed with the mixture of the true data and the counterfeits, and it
aims to detect whether a data is the true one or the fake one. These two models
actually play a two-player minimax game.

 Motivated by GAN, many transfer learning approaches are established based on the
assumption that a good feature representation contains almost no discriminative
information about the instances’ original domains.
 Domain-adversarial neural network (DANN) for domain adaptation, assumes that there is
no labeled target-domain instance to work with. Its architecture consists of a feature
extractor, a label predictor, and a domain classifier.
 The feature extractor acts like the generator, which aims to produce the domain-
independent feature representation for confusing the domain classifier.
 The domain classifier plays the role like the discriminator, which attempts to detect
whether the extracted features come from the source domain or the target domain.
 Besides, the label predictor produces the label prediction of the instances, which is trained
on the extracted features of the labeled source-domain instances.
 DANN can be trained by inserting a special gradient reversal layer (GRL).
 After the training of the whole system, the feature extractor learns the deep feature of the
instances, and the output is the predicted labels of the unlabeled target-domain instances.

Applications
 A number of representative transfer learning approaches are introduced,
which have been applied to solve a variety of text-related/image related
problems.
 For example, MTrick and TriTL utilize the matrix factorization technique to
solve cross-domain text classification problems.
 The deep-learning-based approaches such as DAN, DCORAL and DANN are
applied to solve image classification problems.
 In addition to text-related/image related problems, we have several transfer
learning applications in specific areas such as medicine, bioinformatics,
transportation, and recommender systems.

Transfer Learning presentationslide.pptx

Transfer Learning presentationslide.pptx

More Related Content

Similar to Transfer Learning presentationslide.pptx

More from ArunaB36

Recently uploaded

Transfer Learning presentationslide.pptx