Feature Engineering
Feature Engineering
Sources:
https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
https://www.slideshare.net/0xdata/feature-engineering-83511751
Announcements
• Project evaluation: The teams with the results same as baseline on the
scoreboard for Evaluation1 got the full grade for the first project assignment, i.e.,
User05, User06, User08, User09, User14, User17, User22, User25, User26,
User29, User30.
• The other teams will be graded this week based on their scores for Evaluation 2.
• We have two invited talks next week on Tuesday (Geospatial and Time series data
analysis) and Thursday (Privacy and Transparency in Machine Learning).
!2
Feature Engineering
!3
Feature engineering cycle
• data collection
• Removing duplicates
!4
Feature engineering cycle
!5
Feature engineering cycle
!6
Feature engineering is hard
• Usually requires domain knowledge about how features interact with each
other
!7
Key Elements of Feature Engineering
Target Transformation
Feature Extraction
Feature Encoding
!8
Target Transformation
• Use it when variable shows a skewed distribution make the residuals more
close to “normal distribution” (bell curve)
!9
Target Transformation
!10
Key Elements of Feature Engineering
Target Transformation
Feature Extraction
Feature Encoding
!11
Imputation
• Human errors
• Privacy concerns
• What to do?
!12
Imputation
• Human errors
• Privacy concerns
• What to do?
!13
Imputation
• Numerical Imputation
• Assing zero
• Assing NA
• Categorical Imputation
!14
Outliers
mistake ?
variance ?
!15
Outlier detection
• Outlier: A data object that deviates significantly from the normal objects as if it were
generated by a different mechanism
Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, …
• Outliers are interesting: It violates the mechanism that generates the normal data
Outlier detection vs. novelty detection: early stage, outlier; but later merged into the
model
• Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
!16
Types of Outliers
!17
Types of Outliers
■ Collective Outliers
■ A subset of data objects collectively deviate significantly from the whole data
objects
■ Need to have the background knowledge on the relationship among data
!18
Finding Outliers
• Box plot
• Scatter plot
• Z-score
• IQR score
We will cover more advanced techniques for anomaly detection (after the mid-term)
!19
Finding Outliers
Box plot
Wikipedia Definition
!20
Finding Outliers
• Box plot
!21
Finding Outliers
• Above plot shows three points between 10 to 12, these are outliers as there are
not included in the box of other observation i.e no where near the quartiles.
• Here we analysed Uni-variate outlier i.e. we used DIS column only to check the
outlier.
!22
Finding Outliers
Scatter plot
Wikipedia Definition
• This definition suggests, the scatter plot is the collection of points that shows
values for two variables.
!23
Finding Outliers
• We can try and draw scatter plot for two variables from our housing dataset.
!24
Finding Outliers
• Looking at the plot above, we can most of data points are lying bottom left
side but there are points which are far from the population like top right
corner.
!25
Finding Outliers
Standard deviation
In statistics, the standard deviation (SD, also represented by the lower case Greek
letter sigma σ for the population standard deviation or the Latin letter s for the
sample standard deviation) is a measure of the amount of variation or dispersion of a
set of values.
Wikipedia Definition
!26
Finding Outliers
!27
Finding Outliers
Z-score
The Z-score is the signed number of standard deviations by which the value of an
observation or data point is above the mean value of what is being observed or
measured.
Wikipedia Definition
• The intuition behind Z-score is to describe any data point by finding their
relationship with the Standard Deviation and Mean of the group of data
points. Z-score is finding the distribution of data where mean is 0 and
standard deviation is 1 i.e. normal distribution.
!28
Finding Outliers
• Z-score: While calculating the Z-score we re-scale and center the data and look for data
points which are too far from zero. These data points which are way too far from zero will
be treated as the outliers. In most of the cases a threshold of 3 or -3 is used i.e if the Z-
score value is greater than or less than 3 or -3 respectively, that data point will be
identified as outliers.
• We will use Z-score function defined in scipy library to detect the outliers.
!29
Finding Outliers
• Looking the code and the output above, it is difficult to say which data point
is an outlier. Let’s try and define a threshold to identify an outlier.
List of
arrow numbers
List of
Column numbers
!30
Finding Outliers
IQR score
The interquartile range (IQR), also called the midspread or middle 50%, or
technically H-spread, is a measure of statistical dispersion, being equal to the
difference between 75th and 25th percentiles, or between upper and lower quartiles,
IQR = Q3 − Q1.
Wikipedia Definition
!31
Finding Outliers
• Let’s find out we can box plot uses IQR and how we can use it to find the list
of outliers as we did using Z-score calculation. First we will calculate IQR
!32
Finding Outliers
• The data point where we have False that means these values are valid
whereas True indicates presence of an outlier.
!33
Finding Outliers
Percentiles
• You can assume a certain percent of the value from the top or the
bottom as an outlier.
!34
Handling Outliers
• Correcting
• Removing
• Z-score:
• IQR score:
!35
Binning
• Example
!36
Binning
• The main motivation of binning is to make the model more robust and
prevent overfitting, however, it has a cost to the performance.
• The trade-off between performance and overfitting is the key point of the
binning process.
• Categorical binning: the labels with low frequencies probably affect the
robustness of statistical models negatively. Thus, assigning a general
category to these less frequent values helps to keep the robustness of the
model.
!37
Log Transformation
!38
Log Transformation
• The data you apply log transform must have only positive values, otherwise
you receive an error. Also, you can add 1 to your data before transform it.
Thus, you ensure the output of the transformation to be positive.
!39
Grouping
• Tidy data where each row represents an instance and each column represent a
feature.
• Datasets such as transactions rarely fit the definition of tidy data -> we use grouping.
• The key point of group by operations is to decide the aggregation functions of the
features.
!40
Grouping
• Aggregating categorical columns:
• Make a Pivot table: This would be a good option if you aim to go beyond
binary flag columns and merge multiple features into aggregated features,
which are more informative.
!41
Grouping
• Sum
• Mean
!42
Splitting
• Most of the time the dataset contains string columns that violates tidy
data principles.
!43
Scaling
• In real life, it is nonsense to expect age and income columns to have the
same range. But from the machine learning point of view, how these two
columns can be compared?
!44
Normalization
!45
Normalization
• Example:
!46
Standardization
• If the standard deviation of features is different, their range also would differ
from each other. This reduces the effect of the outliers in the features.
!47
Standardization
• Example:
!48
Key Elements of Feature Engineering
Target Transformation
Feature Extraction
Feature Encoding
!49
Feature Encoding
!50
Feature Encoding
• Labeled Encoding
Interpret the categories as ordered integers (mostly wrong)
Python scikit-learn: LabelEncoder • Ok for tree-based methods
!51
One Hot Encoding
• This method spreads the values in a column to multiple flag columns and
assigns 0 or 1 to them. These binary values express the relationship between
grouped and encoded column.
!52
Frequency encoding
!53
Target mean encoding
!54
Target mean encoding
• The weights are based on the frequency of the levels i.e. if a category only
appears a few times in the dataset then its encoded value will be close to the
overall mean instead of the mean of that level.
!55
Target mean encoding
Smoothing
!56
Target mean encoding
Smoothing
!57
Target mean encoding
leave-one-out schema
• To avoid overfitting we could use leave-one-out schema
!58
Target mean encoding
leave-one-out schema
• To avoid overfitting we could use leave-one-out schema
!59
Target mean encoding
leave-one-out schema
• To avoid overfitting we could use leave-one-out schema
!60
Target mean encoding
leave-one-out schema
• To avoid overfitting we could use leave-one-out schema
!61
Target mean encoding
leave-one-out schema
• To avoid overfitting we could use leave-one-out schema
!62
Weight of Evidence and Information
Value
• Weight of evidence:
• Information Value:
!63
Weight of Evidence and Information
Value
0.221
!64
Weight of Evidence and Information
Value
!65
More of Feature Engineerings …
!66