100% found this document useful (1 vote)

633 views

Feature Engineering PDF

This document discusses various techniques for feature engineering when developing predictive models from raw data. It describes feature engineering as the process of transforming raw data into features that better represent the underlying problem. The document outlines several feature engineering techniques for numerical, categorical, temporal and textual features that can improve model accuracy, including data munging, aggregation, pivoting, imputation, binning, scaling and more. Feature engineering is presented as an important part of applied machine learning that requires expert knowledge.

Uploaded by

jc224

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

633 views

Feature Engineering PDF

Uploaded by

jc224

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

2017

Feature Engineering
Getting the most out of data for predictive models

Gabriel Moreira
@gspmoreira
Lead Data Scientist DSc. student
Agenda
● Machine Learning Pipeline
● Data Munging
● Feature Engineering
○ Numerical features
○ Categorical features
○ Temporal and Spatial features
○ Textual features
● Feature Selection
Extra slides marker
Data

Features
Models

Useful attributes for your modeling task

"Feature engineering is the process of
transforming raw data into features that better
represent the underlying problem to the
predictive models, resulting in improved
model accuracy on unseen data."
– Jason Brownlee
“Coming up with features is difficult,
time-consuming,
requires expert knowledge.
'Applied machine learning' is basically
feature engineering.”
– Andrew Ng
The Dream...

Raw data Dataset Model Task

… The Reality

?
Features ML Ready
dataset
?
Model Task

Raw data
Here are some Feature Engineering techniques
for your Data Science toolbox...
Case Study
Outbrain Click Prediction - Kaggle competition
Can you predict which recommended content each user will click?

Dataset
● Sample of users page views
and clicks during
14 days on June, 2016
● 2 Billion page views
● 17 million click records
● 700 Million unique users
● 560 sites
I got 19th position
from about
1000 competitors
(top 2%),
mostly due to
Feature Engineering
techniques.
Data Munging
First at all … a closer look at your data

● What does the data model look like?

● What is the features distribution?
● What are the features with missing
or inconsistent values?
● What are the most predictive features?

● Conduct a Exploratory Data Analysis (EDA)

Outbrain Click Prediction - Data Model
Numerical

Temporal

Spatial

Categorical

Target
ML-Ready Dataset
Fields (Features)
Instances

Tabular data (rows and columns)

● Usually denormalized in a single file/dataset
● Each row contains information about one instance
● Each column is a feature that describes a property of the instance
Data Cleansing
Homogenize missing values and different types of in the same feature, fix input errors, types, etc.

Cleaned data

Original data
Aggregating
Necessary when the entity to model is an aggregation from the provided data.

Original data (list of playbacks)

Aggregated data (list of users)

Pivoting
Necessary when the entity to model is an aggregation from the provided data.

Original data

Aggregated data with pivoted columns # playbacks by device Play duration by device
Numerical Features
Numerical features

● Usually easy to ingest by mathematical

models, but feature engineering is indeed
necessary.
● Can be floats, counts, ...
● Easier to impute missing data
● Distribution and scale matters to some
models
Imputation for missing values
● Datasets contain missing values, often encoded as blanks, NaNs or other
placeholders

● Ignoring rows and/or columns with missing values is possible, but at the price of
loosing data which might be valuable

● Better strategy is to infer them from the known part of data

● Strategies

○ Mean: Basic approach

○ Median: More robust to outliers

○ Mode: Most frequent value

○ Using a model: Can expose algorithmic bias

Imputation for missing values

>>> import numpy as np

>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean',
verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))
[[ 4. 2. ]
[ 6. 3.666...]
[ 7. 6. ]]

Missing values imputation with scikit-learn

Rounding
● Form of lossy compression: retain most significant features of the data.
● Sometimes too much precision is just noise
● Rounded variables can be treated as categorical variables
● Example:
Some models like Association Rules work only with categorical features. It is
possible to convert a percentage into categorial feature this way

Extra slides marker

Binarization
● Transform discrete or continuous numeric features in binary features
Example: Number of user views of the same document

>>> from sklearn import preprocessing

>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]

>>> binarizer =
preprocessing.Binarizer(threshold=1.0)
>>> binarizer.transform(X)
array([[ 1., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
Binarization with scikit-learn
Binning

● Split numerical values into bins and encode with a bin ID

● Can be set arbitrarily or based on distribution
● Fixed-width binning
Does fixed-width binning make sense for this long-tailed distribution?

Most users (458,234,809 ~ 5*10^8) had only 1 pageview during the period.
Binning

● Adaptative or Quantile binning

Divides data into equal portions (eg. by median, quartiles, deciles)

>>> deciles = dataframe['review_count'].quantile([.1, .2, .3, .4, .5, .6, .7,

.8, .9])
>>> deciles
0.1 3.0
0.2 4.0
0.3 5.0
0.4 6.0
0.5 8.0
0.6 12.0
0.7 17.0
0.8 28.0
0.9 58.0

Quantile binning with Pandas

Log transformation
Compresses the range of large numbers and expand the range of small numbers.
Eg. The larger x is, the slower log(x) increments.
Log transformation
Smoothing long-tailed data with log

Histogram of # views by user Histogram of # views by user

smoothed by log(1+x)
Scaling
● Models that are smooth functions of input features are sensitive to the scale
of the input (eg. Linear Regression)
● Scale numerical variables into a certain range, dividing values by a
normalization constant (no changes in single-feature distribution)

● Popular techniques

○ MinMax Scaling

○ Standard (Z) Scaling

Min-max scaling

● Squeezes (or stretches) all values within the range of [0, 1] to add robustness to
very small standard deviations and preserving zeros for sparse data.

>>> from sklearn import preprocessing

>>> X_train = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
...
>>> min_max_scaler =
preprocessing.MinMaxScaler()
>>> X_train_minmax =
min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5 , 0. , 1. ],
[ 1. , 0.5 , 0.33333333],
[ 0. , 1. , 0. ]])

Min-max scaling with scikit-learn

Standard (Z) Scaling
After Standardization, a feature has mean of 0 and variance of 1 (assumption of
many learning algorithms)

>>> from sklearn import preprocessing

>>> import numpy as np
>>> X = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
Standardization with scikit-learn
Normalization
● Scales individual samples (rows) to have unit vector, dividing values by
vector’s L2 norm, a.k.a. the Euclidean norm
● Useful for quadratic form (like dot-product) or any other kernel to quantify
similarity of pairs of samples. This assumption is the base of the Vector
Space Model often used in text classification and clustering contexts

Normalized vector

Euclidean (L2) norm

Normalization

>>> from sklearn import preprocessing

>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]
>>> X_normalized = preprocessing.normalize(X, norm='l2')
>>> X_normalized
array([[ 0.40..., -0.40..., 0.81...],
[ 1. ..., 0. ..., 0. ...],
[ 0. ..., 0.70..., -0.70...]])

Normalization with scikit-learn

Interaction Features

● Simple linear models use a linear combination of the individual input

features, x1, x2, ... xn to predict the outcome y.

y = w1x1 + w2x2 + ... + wnxn

● An easy way to increase the complexity of the linear model is to create
feature combinations (nonlinear features).

● Example:
Degree 2 interaction features for vector x = (x1,x2)
y = w1x1 + w2x2 + w3x1x2 + w4x12 + w4x22
Interaction Features

>>> import numpy as np

>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
[2, 3],
[4, 5]])
>>> poly = poly = PolynomialFeatures(degree=2, interaction_only=False,
include_bias=True)
>>> poly.fit_transform(X)
array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])
Polynomial features with scikit-learn
Interaction Features - Vowpal Wabbit
vw_line = '{} |i {} |m {} |z {} |c {}\n'.format(
label,
' '.join(integer_features),
' '.join(ctr_features),
' '.join(similarity_features),
' '.join(categorical_features))

Separating features in namespaces in Vowpal Wabbit (VW) sparse format

1 |i 12:5 18:126 |m 2:0.015 45:0.123 |z 32:0.576 17:0.121 |c 16:1 295:1 3554:1

Sample data point (line in VW format file)

vw --loss_function logistic --link=logistic --ftrl --ftrl_alpha 0.005 --ftrl_beta 0.1

-q cc -q zc -q zm Interacting (quadratic) features of some namespaces
-l 0.01 --l1 1.0 --l2 1.0 -b 28 --hash all
--compressed -d data/train_fv.vw -f output.model

Feature interactions with VW

Categorical Features
Categorical Features

● Nearly always need some treatment to be suitable for models

● High cardinality can create very sparse data

● Difficult to impute missing

● Examples:
Platform: [“desktop”, “tablet”, “mobile”]
Document_ID or User_ID: [121545, 64845, 121545]
One-Hot Encoding (OHE)

● Transform a categorical feature with m possible values into m binary features.

● If the variable cannot be multiple categories at once, then only one bit in the
group can be on.

● Sparse format is memory-friendly

● Example: “platform=tablet” can be sparsely encoded as “2:1”
One-Hot Encoding (OHE)
from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")

model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")

encoded = encoder.transform(indexed)
encoded.show()

One-hot encoding with Spark ML

Large Categorical Variables

● Common in applications like targeted advertising and fraud detection

● Example:

Some large categorical features from Outbrain Click Prediction competition

Feature hashing

● Hashes categorical values into vectors with fixed-length.

● Lower sparsity and higher compression compared to OHE

● Deals with new and rare categorical values (eg: new user-agents)

● May introduce collisions

100 hashed columns

Feature hashing
import hashlib
def hashstr(s, nr_bins):
return int(hashlib.md5(s.encode('utf8')).hexdigest(), 16)%(nr_bins-1)+1

CATEGORICAL_VALUE='ad_id=354424' Original category

MAX_BINS=100000
>>> hashstr(CATEGORICAL_VALUE, MAX_BINS)
49389 Hashed category
Feature hashing with pure Python

import tensorflow as tf
ad_id_hashed = tf.contrib.layers.sparse_column_with_hash_bucket('ad_id',
hash_bucket_size=250000, dtype=tf.int64, combiner="sum")

Feature hashing with TensorFlow

Feature hashing

vw --loss_function logistic --link=logistic --ftrl --ftrl_alpha 0.005 --ftrl_beta 0.1

-q cc -q zc -q zm -l 0.01 --l1 1.0 --l2 1.0
-b 18 --hash all Hashes values to a feature space of 218 positions (columns)
--compressed -d data/train_fv.vw -f output.model

Feature hashing with Vowpal Wabbit

Bin-counting

● Instead of using the actual categorical value, use a global statistic of this
category on historical data.

● Useful for both linear and non-linear algorithms

● May give collisions (same encoding for different categories)

● Be careful about leakage

● Strategies

○ Count

○ Average CTR
Bin-counting

or or

Counts Click-Through Rate

P(click | ad) = ad_clicks / ad_views
LabelCount encoding

● Rank categorical variables by count in train set

● Useful for both linear and non-linear algorithms (eg: decision trees)

● Not sensitive to outliers

● Won’t give same encoding to different variables

Category Embedding

● Use a Neural Network to create dense embeddings from categorical

variables.

● Map categorical variables in a function approximation problem into Euclidean

spaces

● Faster model training.

● Less memory overhead.

● Can give better accuracy than 1-hot encoded.

Category Embedding

import tensorflow as tf

def get_embedding_size(unique_val_count):
return int(math.floor(6 * unique_val_count**0.25))

ad_id_hashed_feature =
tf.contrib.layers.sparse_column_with_hash_bucket('ad_id',
hash_bucket_size=250000, dtype=tf.int64, combiner="sum")
embedding_size = get_embedding_size( ad_id_hashed_feature.length)
ad_embedding_feature = tf.contrib.layers.embedding_column(
ad_id_hashed_feature, dimension=embedding_size, combiner="sum")

● Binning a time in hours or periods of day, like below.

Hour range Bin ID Bin Description

[5, 8) 1 Early Morning

[8, 11) 2 Morning

[11, 14) 3 Midday

[14, 19) 4 Afternoon

[19, 22) 5 Evening

[22-24) and (00-05] 6 Night

● Extraction: weekday/weekend, weeks, months, quarters, years...

Trendlines

● Instead of encoding: total spend, encode things like:

Spend in last week, spend in last month, spend in last
year.
● Gives a trend to the algorithm: two customers with equal
spend, can have wildly different behavior — one
customer may be starting to spend more, while the other
is starting to decline spending.
Closeness to major events

● Hardcode categorical features from dates

● Example: Factors that might have major influence on spending behavior

● Proximity to major events (holidays, major sports events)

○ Eg. date_X_days_before_holidays

● Proximity to wages payment date (monthly seasonality)

○ Eg. first_saturday_of_the_month
Time differences

● Differences between dates might be relevant

● Examples:

○ user_interaction_date - published_doc_date
To model how recent was the ad when the user viewed it.
Hypothesis: user interests on a topic may decay over time

○ last_user_interaction_date - user_interaction_date
To model how old was a given user interaction compared to his last
interaction
Spatial Features
Spatial Variables
● Spatial variables encode a location in space, like:

○ GPS-coordinates (lat. / long.) - sometimes require projection to a different

coordinate system

○ Street Addresses - require geocoding

○ ZipCodes, Cities, States, Countries - usually enriched with the centroid

coordinate of the polygon (from external GIS data)

● Derived features

○ Distance between a user location and searched hotels (Expedia competition)

○ Impossible travel speed (fraud detection)

Spatial Enrichment
Usually useful to enrich with external geographic data (eg. Census demographics)

Beverage Containers Redemption Fraud Detection: Usage of # containers redeemed (red circles) by
store and Census households median income by Census Tracts
Textual data
Natural Language Processing
Cleaning Removing
• Lowercasing • Stopwords
• Convert accented characters • Rare words
• Removing non-alphanumeric • Common words
• Repairing
Roots
Tokenizing • Spelling correction
• Encode punctuation marks • Chop
• Tokenize • Stem
• N-Grams • Lemmatize
• Skip-grams Enrich
• Char-grams • Entity Insertion / Extraction
• Affixes • Parse Trees
• Reading Level
Text vectorization
Represent each document as a feature vector in the vector space, where each
position represents a word (token) and the contained value is its relevance in the
document.

● BoW (Bag of words)

● TF-IDF (Term Frequency - Inverse Document Frequency)
● Embeddings (eg. Word2Vec, Glove)
● Topic models (e.g LDA)

Document Term Matrix - Bag of Words

Text vectorization
Text vectorization - TF-IDF
tokens

face person guide lock cat dog sleep micro pool gym

0 1 2 3 4 5 6 7 8 9
...
D1 0.05 0.25
documents
D2 0.02 0.32 0.45
...
TF-IDF sparse matrix example

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.5, max_features=1000,
min_df=2, stop_words='english')
tfidf_corpus = vectorizer.fit_transform(text_corpus)
TF-IDF with scikit-learn
Cosine Similarity

Similarity metric between two vectors is cosine among the angle between them

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

Cosine Similarity with scikit-learn

Textual Similarities
• Token similarity: Count number of tokens that appear in
two texts.
• Levenshtein/Hamming/Jaccard Distance: Check
similarity between two strings, by looking at number of
operations needed to transform one in the other.
• Word2Vec / Glove: Check cosine similarity between two
word embedding vectors
Topic Modeling
Topic Modeling
● Latent Dirichlet Allocation (LDA) -> Probabilistic
● Latent Semantic Indexing / Analysis (LSI / LSA) -> Matrix Factorization
● Non-Negative Matrix Factorization (NMF) -> Matrix Factorization
Feature Selection
Feature Selection

Reduces model complexity and training time

● Filtering - Eg. Correlation our Mutual Information between

each feature and the response variable

● Wrapper methods - Expensive, trying to optimize the best

subset of features (eg. Stepwise Regression)
● Embedded methods - Feature selection as part of model
training process (eg. Feature Importances of Decision Trees or
Trees Ensembles)
“More data beats clever algorithms,
but better data beats more data.”
– Peter Norvig
Diverse set of Features and Models leads to different results!

Outbrain Click Prediction - Leaderboard score of my approaches

Towards Automated Feature
Engineering

Deep Learning....
“...some machine learning projects
succeed and some fail.
Where is the difference?
Easily the most important factor is the
features used.”
– Pedro Domingos
References

Scikit-learn - Preprocessing data

Spark ML - Feature extraction
Discover Feature Engineering...
Data Scientists wanted!
bit.ly/ds4cit

Slides: bit.ly/feature_eng

Blog: medium.com/unstructured

Thanks! Gabriel Moreira

Lead Data Scientist

Python Seaborn Notes
No ratings yet
Python Seaborn Notes
28 pages
Deep Learning TensorFlow and Keras
No ratings yet
Deep Learning TensorFlow and Keras
454 pages
Feature Engineering PDF
No ratings yet
Feature Engineering PDF
19 pages
ML Performance Improvement Cheatsheet
No ratings yet
ML Performance Improvement Cheatsheet
11 pages
Machine Learning Cheat Sheet
100% (1)
Machine Learning Cheat Sheet
211 pages
Machine Learning Interview
No ratings yet
Machine Learning Interview
14 pages
Abhishek Thakur - Approaching (Almost) Any Machine Learning Problem-Abhishek Thakur (2020) PDF
100% (6)
Abhishek Thakur - Approaching (Almost) Any Machine Learning Problem-Abhishek Thakur (2020) PDF
301 pages
Natural Language Processing
100% (6)
Natural Language Processing
309 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Python Libraries and Packages For Data Science
100% (1)
Python Libraries and Packages For Data Science
5 pages
A Tutorial On Scattering and Diffusion Coefficients
No ratings yet
A Tutorial On Scattering and Diffusion Coefficients
16 pages
Filmmaking Techniques Film Blocking Worksheet
No ratings yet
Filmmaking Techniques Film Blocking Worksheet
1 page
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Feature Engineering Handout
No ratings yet
Feature Engineering Handout
33 pages
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
No ratings yet
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
27 pages
Feature Engineering
100% (2)
Feature Engineering
76 pages
Feature Engineering
No ratings yet
Feature Engineering
13 pages
Deploy A Machine Learning Model Using Flask - Towards Data Science
No ratings yet
Deploy A Machine Learning Model Using Flask - Towards Data Science
12 pages
Feature Selection Techniques in ML With Python-1
No ratings yet
Feature Selection Techniques in ML With Python-1
7 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Cheat Sheet - Machine Learning - Data Science Interview PDF
No ratings yet
Cheat Sheet - Machine Learning - Data Science Interview PDF
16 pages
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
100% (1)
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
1 page
Data Science Hiring Guide
50% (2)
Data Science Hiring Guide
56 pages
Statistics in Details
100% (2)
Statistics in Details
283 pages
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
100% (1)
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
11 pages
Machine Learning Interviews - Lessons From Both Sides - FSDL
100% (2)
Machine Learning Interviews - Lessons From Both Sides - FSDL
70 pages
Data Science Course Content
No ratings yet
Data Science Course Content
4 pages
Data Science Interview Questions Leaked
100% (3)
Data Science Interview Questions Leaked
12 pages
Keras For Beginners: Implementing A Recurrent Neural Network
No ratings yet
Keras For Beginners: Implementing A Recurrent Neural Network
13 pages
Statquest Gentle Introduction To Rna Seq
100% (1)
Statquest Gentle Introduction To Rna Seq
188 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
16 pages
Statistical Machine Learning
100% (1)
Statistical Machine Learning
12 pages
Tensorflow 2 Tutorial PDF
100% (4)
Tensorflow 2 Tutorial PDF
66 pages
Supervised Learning
No ratings yet
Supervised Learning
19 pages
Deep Learning With Keras
100% (5)
Deep Learning With Keras
136 pages
50 Machine Learning Interview
No ratings yet
50 Machine Learning Interview
8 pages
Machine Learning Python
100% (1)
Machine Learning Python
9 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
List of Deep Learning and NLP Resources
No ratings yet
List of Deep Learning and NLP Resources
69 pages
Deep Learning For Cloud and Mobile
100% (2)
Deep Learning For Cloud and Mobile
42 pages
Python Data Science
100% (1)
Python Data Science
173 pages
LlamaIndex Prompt Engineering Tutorial (FlowGPT)
No ratings yet
LlamaIndex Prompt Engineering Tutorial (FlowGPT)
20 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Stanford CS224W Graph Representation Learning 09-Node2vec PDF
No ratings yet
Stanford CS224W Graph Representation Learning 09-Node2vec PDF
60 pages
Top 100 ML Interview Q&A
100% (1)
Top 100 ML Interview Q&A
39 pages
Supervised and Unsupervised
100% (1)
Supervised and Unsupervised
191 pages
Deep Learning Interview Questions - Deep Learning Questions
No ratings yet
Deep Learning Interview Questions - Deep Learning Questions
21 pages
Scikit Learn
No ratings yet
Scikit Learn
25 pages
Fundamentals of Statistics For Data Science
No ratings yet
Fundamentals of Statistics For Data Science
23 pages
40 Interview Questions Asked at Startups in Machine Learning - Data Science
100% (3)
40 Interview Questions Asked at Startups in Machine Learning - Data Science
33 pages
Data Science Solutions Sample
100% (6)
Data Science Solutions Sample
53 pages
Machine Learning Interview Questions
100% (1)
Machine Learning Interview Questions
4 pages
Scikit Learn Cheat Sheet
No ratings yet
Scikit Learn Cheat Sheet
9 pages
"Hello World" of Deep Learning
No ratings yet
"Hello World" of Deep Learning
26 pages
Tensorflow Presentation
No ratings yet
Tensorflow Presentation
13 pages
Python ML Book
No ratings yet
Python ML Book
211 pages
17 Free Data Science Projects To Boost Your Knowledge & Skills
No ratings yet
17 Free Data Science Projects To Boost Your Knowledge & Skills
18 pages
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
100% (2)
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
32 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
12 Useful Pandas Techniques in Python For Data Manipulation
100% (2)
12 Useful Pandas Techniques in Python For Data Manipulation
19 pages
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
GRE Mnemonics
No ratings yet
GRE Mnemonics
138 pages
Mbamission Waitlist Presentation
No ratings yet
Mbamission Waitlist Presentation
34 pages
Snacking Startups 2.0 - CB Insights 2017
100% (1)
Snacking Startups 2.0 - CB Insights 2017
68 pages
WAIMLAp2017 OverviewProgram PDF
No ratings yet
WAIMLAp2017 OverviewProgram PDF
18 pages
Stock Watson 20 Years of Time Series Econometrics in 10 Pictures
No ratings yet
Stock Watson 20 Years of Time Series Econometrics in 10 Pictures
28 pages
Intro Input Output Analysis at The Regional Level
No ratings yet
Intro Input Output Analysis at The Regional Level
102 pages
WAIMLAp2017 OverviewProgram PDF
No ratings yet
WAIMLAp2017 OverviewProgram PDF
18 pages
Comando Xtprobit
No ratings yet
Comando Xtprobit
20 pages
TD 2014.04 - Versão Final
No ratings yet
TD 2014.04 - Versão Final
43 pages
Notes1 Stochastic Proccesses KENT U
No ratings yet
Notes1 Stochastic Proccesses KENT U
13 pages
Kuhn - Machine Learning With Class Imbalances
No ratings yet
Kuhn - Machine Learning With Class Imbalances
103 pages
Fractional Programming
No ratings yet
Fractional Programming
53 pages
Intro Heterogeneous Agents Model
No ratings yet
Intro Heterogeneous Agents Model
84 pages
Studying For Finals - Let Classical Music Help - USC News
No ratings yet
Studying For Finals - Let Classical Music Help - USC News
5 pages
Critical Values Unit Root Tests
No ratings yet
Critical Values Unit Root Tests
4 pages
Academy of Technology: Power Point Presentation On: POTENTIOMETER Presented by
No ratings yet
Academy of Technology: Power Point Presentation On: POTENTIOMETER Presented by
9 pages
Tonepad Wahwahbasic
No ratings yet
Tonepad Wahwahbasic
1 page
Succession Planning - Ref
No ratings yet
Succession Planning - Ref
3 pages
HL740 9S
0% (1)
HL740 9S
4 pages
Application Form - CAS
No ratings yet
Application Form - CAS
22 pages
Wks SBD NCB Preface
No ratings yet
Wks SBD NCB Preface
5 pages
Chapter 7-Conditional Operations in Java: 7.1 Objective
No ratings yet
Chapter 7-Conditional Operations in Java: 7.1 Objective
8 pages
PORTAL 8A - WB - Final With Answers
100% (1)
PORTAL 8A - WB - Final With Answers
74 pages
Travel Order Within The Province Standard
No ratings yet
Travel Order Within The Province Standard
9 pages
Media Planning
No ratings yet
Media Planning
47 pages
Wind-Related Heat Losses of A Flat-Plate Collector: Abstract - The
No ratings yet
Wind-Related Heat Losses of A Flat-Plate Collector: Abstract - The
5 pages
Checklist Pressure Transmitters Commissioning
No ratings yet
Checklist Pressure Transmitters Commissioning
8 pages
Transmittal Template Method Statements
No ratings yet
Transmittal Template Method Statements
2 pages
CUSU: Media Pack
No ratings yet
CUSU: Media Pack
4 pages
Bor11999 QLD Manual Bk12
No ratings yet
Bor11999 QLD Manual Bk12
58 pages
SANS1200C
No ratings yet
SANS1200C
8 pages
Motivations of Fuzzy Logic
No ratings yet
Motivations of Fuzzy Logic
3 pages
IO 230010S - Trias Sentosa
No ratings yet
IO 230010S - Trias Sentosa
1 page
HBM BG160TA-4 specifications
No ratings yet
HBM BG160TA-4 specifications
4 pages
Case Study Grading Rubric
No ratings yet
Case Study Grading Rubric
2 pages
m10 BW
No ratings yet
m10 BW
2 pages
Exhibitor 24199 PDF
No ratings yet
Exhibitor 24199 PDF
15 pages
Techtalk Devops Slide Deck
No ratings yet
Techtalk Devops Slide Deck
65 pages
Alternative Methods of Measurement
No ratings yet
Alternative Methods of Measurement
14 pages
Permission Letter
No ratings yet
Permission Letter
1 page
Teknik Lipatan Minggu 14
No ratings yet
Teknik Lipatan Minggu 14
42 pages
Electrostatics Hand Written Notes
No ratings yet
Electrostatics Hand Written Notes
14 pages
Fire-Friend-and-Foe
No ratings yet
Fire-Friend-and-Foe
18 pages

Feature Engineering PDF

Uploaded by

Feature Engineering PDF

Uploaded by

2017

Useful attributes for your modeling task

Raw data Dataset Model Task

● What does the data model look like?

● Conduct a Exploratory Data Analysis (EDA)

Tabular data (rows and columns)

Original data (list of playbacks)

Aggregated data (list of users)

● Usually easy to ingest by mathematical

● Better strategy is to infer them from the known part of data

○ Mean: Basic approach

○ Median: More robust to outliers

○ Mode: Most frequent value

○ Using a model: Can expose algorithmic bias

>>> import numpy as np

Missing values imputation with scikit-learn

Extra slides marker

>>> from sklearn import preprocessing

● Split numerical values into bins and encode with a bin ID

● Adaptative or Quantile binning

>>> deciles = dataframe['review_count'].quantile([.1, .2, .3, .4, .5, .6, .7,

Quantile binning with Pandas

Histogram of # views by user Histogram of # views by user

○ Standard (Z) Scaling

>>> from sklearn import preprocessing

Min-max scaling with scikit-learn

>>> from sklearn import preprocessing

Euclidean (L2) norm

>>> from sklearn import preprocessing

Normalization with scikit-learn

● Simple linear models use a linear combination of the individual input

y = w1x1 + w2x2 + ... + wnxn

>>> import numpy as np

Separating features in namespaces in Vowpal Wabbit (VW) sparse format

1 |i 12:5 18:126 |m 2:0.015 45:0.123 |z 32:0.576 17:0.121 |c 16:1 295:1 3554:1

vw --loss_function logistic --link=logistic --ftrl --ftrl_alpha 0.005 --ftrl_beta 0.1

Feature interactions with VW

● Nearly always need some treatment to be suitable for models

● High cardinality can create very sparse data

● Difficult to impute missing

● Transform a categorical feature with m possible values into m binary features.

● Sparse format is memory-friendly

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")

One-hot encoding with Spark ML

● Common in applications like targeted advertising and fraud detection

Some large categorical features from Outbrain Click Prediction competition

● Hashes categorical values into vectors with fixed-length.

● Lower sparsity and higher compression compared to OHE

● May introduce collisions

100 hashed columns

CATEGORICAL_VALUE='ad_id=354424' Original category

Feature hashing with TensorFlow

vw --loss_function logistic --link=logistic --ftrl --ftrl_alpha 0.005 --ftrl_beta 0.1

Feature hashing with Vowpal Wabbit

● Useful for both linear and non-linear algorithms

● May give collisions (same encoding for different categories)

● Be careful about leakage

Counts Click-Through Rate

● Rank categorical variables by count in train set

● Not sensitive to outliers

● Won’t give same encoding to different variables

● Use a Neural Network to create dense embeddings from categorical

● Map categorical variables in a function approximation problem into Euclidean

● Faster model training.

● Less memory overhead.

● Can give better accuracy than 1-hot encoded.

Category Embedding using TensorFlow

● Binning a time in hours or periods of day, like below.

Hour range Bin ID Bin Description

[5, 8) 1 Early Morning

[8, 11) 2 Morning

[11, 14) 3 Midday

[14, 19) 4 Afternoon

[19, 22) 5 Evening

[22-24) and (00-05] 6 Night