Feature Selection and
Principal Component
Analysis
2 Introduction of Feature Selection
It is not always a good thing to deal with feature sets having maybe thousands
of features or even more.
More features tend to make models more complex and difficult to interpret.
Besides this, it can often lead to models over-fitting on the training data
The ultimate objective is to select an optimal number of features to train and
build models that generalize very well on the data and prevent overfitting.
3 Feature Selection Strategies
Feature selection strategies can be divided into three main areas based
on the type of strategy and techniques employed for the same.
Filter methods: These techniques select features purely based on metrics like
correlation, mutual information and so on. It includes Threshold-Based
Methods and statistical tests.
Wrapper methods: Using a recursive approach to build multiple models using
feature subsets and select the best subset of features
Embedded methods: These techniques try to combine the benefits of the
other two methods by leveraging Machine Learning models themselves to
rank and score feature variables based on their importance.
4 Feature Selection: Threshold-Based
Methods
This is a filter based feature selection strategy, where you can use some form of cut-off or
thresholding for limiting the total number of features during feature selection.
Thresholds can be of various forms. Some of them can be used during the feature
engineering process itself, where you can specify threshold parameters.
A simple example of using thresholds is to use variance-based thresholding where
features having low variance (below a user-specified threshold) are removed. This
signifies that we want to remove features that have values that are more or less
constant across all the observations in our datasets.
5 …..
We can apply this to Pokémon dataset. First we convert the Generation
feature to a categorical feature as follows.
Next, we want to remove features from the one hot encoded features where the
variance is less than 0.15.
6 …….
To view the variances as well as which features were finally selected by this
algorithm, we can use the variances_ property and the get_support(...) function
respectively.
7 Feature Selection: Statistical Methods
Filter based feature selection method, which is slightly more sophisticated, to select
features based on univariate statistical tests.
Use several statistical tests for regression and classification based models including
mutual information, ANOVA (analysis of variance) and chi-square tests.
Based on scores obtained from these statistical tests, can select the best features on
the basis of their score.
8 Feature Selection: Recursive Feature
Elimination
Can also rank and score features with the help of a Machine Learning based model estimator such
that recursively keep eliminating lower scored features till you arrive at the specific feature subset
count.
Recursive Feature Elimination, also known as RFE.
The basic idea is to start off with a specific Machine Learning estimator like the Logistic Regression
algorithm. Next take the entire feature set of features and the corresponding response class
variables.
RFE aims to assign weights to these features based on the model fit. Features with the smallest
weights are pruned out and then a model is fit again on the remaining features to obtain the new
weights or scores.
This process is recursively carried out multiple times and each time features with the lowest
scores/weights are eliminated, until the pruned feature subset contains the desired number of
features that the user wanted to select (this is taken as an input parameter at the start).
9 Feature Selection: Model-Based
Selection
Tree based models like decision trees and ensemble models like random forests
(ensemble of trees) can be utilized not just for modeling alone but for feature selection.
These models can be used to compute feature importances when building the model
that can inturn be used for selecting the best features and discarding irrelevant features
with lower scores.
10 Dimensionality Reduction
Dealing with a lot of features can lead to issues like model overfitting, complex models.
Dimensionality reduction is the process of reducing the total number of features in our
feature set using strategies like feature selection or feature extraction.
Feature extraction where the basic objective is to extract new features from the existing
set of features such that the higher-dimensional dataset with many features can be
reduced into a lower-dimensional dataset of these newly created features.
A very popular technique of linear data transformation from higher to lower dimensions is
Principal Component Analysis, also known as PCA
11 Principal Component Analysis
Principal component analysis (PCA) refers to the process by which principal
components are computed, and the subsequent use of these components in
understanding the data.
PCA is an unsupervised approach, since it involves only a set of features X1,X2,...,Xp, and
no associated response Y.
Apart from producing derived variables for use in supervised learning problems, PCA
also serves as a tool for data visualization (visualization of the observations or
visualization of the variables)..
12 What Are Principal Components?
Suppose that we wish to visualize n observations with measurements on a set of p
features, X1,X2,...,Xp, as part of an exploratory data analysis. We could do this by
examining two-dimensional scatterplots of the data, each of which contains the n
observations’ measurements on two of the features. However, there are p(p−1)/2 such
scatterplots; for example, with p=10 there are 45 plots!
We would like to find a low-dimensional representation of the data that captures as much
of the information as possible.
PCA provides a tool to do just this
13 What Are Principal Components?
(cont.)
The idea is that each of the n observations lives in p-dimensional space,
but not all of these dimensions are equally interesting.
PCA seeks a small number of dimensions that are as interesting as possible,
where the concept of interesting is measured by the amount that the
observations vary along each dimension.
Each of the dimensions found by PCA is a linear combination of the p
features.
14
The manner of finding principal
components
The first principal component of a set of features X1,X2,...,Xp is the normalized linear
combination of the features:
We refer to the elements as the loadings of the first principal
the principal component loading vector
Given a n x p data set X, how do we compute the first principal com ponent? Since we are
only interested in variance, we assume that each of the variables in X has been centered to
have mean zero (that is, the column means of X are zero). We then look for the linear
combination of the sample feature values of the form
15 The manner of finding principal
components (cont.)
The first principal component loading vector solves the optimization problem:
The problem can be solved via an eigen decomposition, a standard technique
in linear algebra.
16
Ninety observations simulated in three dimensions.
Left: the first two principal component directions span the plane that best fits the data. It
minimizes the sum of squared distances from each point to the plane.
Right: the first two principal component score vectors give the coordinates of the projection
of the 90 observations on to the plane. The variance in the plane is maximized.
17 Feature Extraction with Principal
Component Analysis
In any PCA transformation, the total number of PCs is always less than or equal to
the initial number of features.
The first principal component tries to capture the maximum variance of the original
set of features. Each of the succeeding components tries to capture more of the
variance such that they are orthogonal to the preceding components.
An important point to remember is that PCA is sensitive to feature scaling.
Feature Extraction with Principal Component
18
Analysis (cont.)
Singular Value Decomposition
The process of singular value decomposition, also known as SVD, is another
matrix decomposition or factorization process such that we are able to
break down a matrix to obtain singular vectors and singular values.
Considering a matrix M having dimensions m x n such that m denotes total
rows and n denotes total columns, the SVD of the matrix can be
represented with the following equation.
Feature Extraction with Principal
19
Component Analysis (cont.)
Considering we have a data matrix F(n x D) , where we have n observations
and D dimensions (features), we can depict SVD of the feature matrix as
(F(n x D) ) = USVT such that all the principal components are contained in the
component VT, which can be depicted as follows:
The principal components are represented by {PC1 , PC2 , ... PCD } , which
are all one-dimensional vectors of dimensions (1 x D). For extracting the first
d principal components, we can first transpose this matrix to obtain the
following representation.
20 Feature Extraction with Principal
Component Analysis (cont.)
Now we can extract out the first d principal components such that d ≤ D and the
reduced principal component set can be depicted as follows.
𝑃𝐶(𝐷𝑥𝑑) = 𝑉 𝑇 𝑇 = 𝑃𝐶1 𝐷𝑥1 |𝑃𝐶2 𝐷𝑥1 | ⋯ |𝑃𝐶𝑑 𝐷𝑥1
Finally, to perform dimensionality reduction, we can get the reduced feature set using the
following mathematical transformation
F(n x d) = F(n x D) ⋅PC(D x d)
where the dot product between the original feature matrix and the reduced subset of principal
components gives us a reduced feature set of d features.
A very important point to remember here is that might need to center initial feature matrix
by removing the mean because by default, PCA assumes that your data is centered
around the origin
21 How Principal Component Analysis (PCA)
Work
Standardize the Data: If the features of your dataset are on different scales, it’s essential
to standardize them (subtract the mean and divide by the standard deviation).
Compute the Covariance Matrix: Calculate the covariance matrix for the standardized
dataset.
Compute Eigenvectors and Eigenvalues: Find the eigenvectors and eigenvalues of the
covariance matrix. The eigenvectors represent the directions of maximum variance,
and the corresponding eigenvalues indicate the magnitude of variance along those
directions.
Sort Eigenvectors by Eigenvalues: Sort the eigenvectors based on their corresponding
eigenvalues in descending order.
Choose Principal Components: Select the top k eigenvectors (principal components)
where k is the desired dimensionality of the reduced dataset.
Transform the Data: Multiply the original standardized data by the selected principal
components to obtain the new, lower-dimensional representation of the data.
import numpy as np
from [Link] import PCA
from [Link] import StandardScaler
22 import [Link] as plt
# Example Data
[Link](42)
X = [Link](100, 3) # 100 samples with 3 features
# Step 1: Standardize the Data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Step 2-5: PCA
pca = PCA()
X_pca = pca.fit_transform(X_std)
# Plot Explained Variance Ratio
explained_var_ratio = pca.explained_variance_ratio_
cumulative_var_ratio = [Link](explained_var_ratio)
[Link](range(1, len(cumulative_var_ratio) + 1), cumulative_var_ratio, marker='o')
[Link]('Number of Principal Components')
[Link]('Cumulative Explained Variance Ratio')
[Link]('Explained Variance Ratio vs. Number of Principal Components')
[Link]()
23 PCA Demo: Iris Data Set
24 End of Lesson