Feature Engineering PDF
Feature Engineering PDF
Feature Engineering
Getting the most out of data for predictive models
Gabriel Moreira
@gspmoreira
Lead Data Scientist DSc. student
Agenda
● Machine Learning Pipeline
● Data Munging
● Feature Engineering
○ Numerical features
○ Categorical features
○ Temporal and Spatial features
○ Textual features
● Feature Selection
Extra slides marker
Data
Features
Models
?
Features ML Ready
dataset
?
Model Task
Raw data
Here are some Feature Engineering techniques
for your Data Science toolbox...
Case Study
Outbrain Click Prediction - Kaggle competition
Can you predict which recommended content each user will click?
Dataset
● Sample of users page views
and clicks during
14 days on June, 2016
● 2 Billion page views
● 17 million click records
● 700 Million unique users
● 560 sites
I got 19th position
from about
1000 competitors
(top 2%),
mostly due to
Feature Engineering
techniques.
Data Munging
First at all … a closer look at your data
Temporal
Spatial
Categorical
Target
ML-Ready Dataset
Fields (Features)
Instances
Cleaned data
Original data
Aggregating
Necessary when the entity to model is an aggregation from the provided data.
Original data
Aggregated data with pivoted columns # playbacks by device Play duration by device
Numerical Features
Numerical features
● Ignoring rows and/or columns with missing values is possible, but at the price of
loosing data which might be valuable
● Strategies
>>> binarizer =
preprocessing.Binarizer(threshold=1.0)
>>> binarizer.transform(X)
array([[ 1., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
Binarization with scikit-learn
Binning
Most users (458,234,809 ~ 5*10^8) had only 1 pageview during the period.
Binning
● Popular techniques
○ MinMax Scaling
● Squeezes (or stretches) all values within the range of [0, 1] to add robustness to
very small standard deviations and preserving zeros for sparse data.
Normalized vector
● Example:
Degree 2 interaction features for vector x = (x1,x2)
y = w1x1 + w2x2 + w3x1x2 + w4x12 + w4x22
Interaction Features
● Examples:
Platform: [“desktop”, “tablet”, “mobile”]
Document_ID or User_ID: [121545, 64845, 121545]
One-Hot Encoding (OHE)
● If the variable cannot be multiple categories at once, then only one bit in the
group can be on.
df = spark.createDataFrame([
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
], ["id", "category"])
● Example:
● Deals with new and rare categorical values (eg: new user-agents)
import tensorflow as tf
ad_id_hashed = tf.contrib.layers.sparse_column_with_hash_bucket('ad_id',
hash_bucket_size=250000, dtype=tf.int64, combiner="sum")
● Instead of using the actual categorical value, use a global statistic of this
category on historical data.
● Strategies
○ Count
○ Average CTR
Bin-counting
or or
● Useful for both linear and non-linear algorithms (eg: decision trees)
import tensorflow as tf
def get_embedding_size(unique_val_count):
return int(math.floor(6 * unique_val_count**0.25))
ad_id_hashed_feature =
tf.contrib.layers.sparse_column_with_hash_bucket('ad_id',
hash_bucket_size=250000, dtype=tf.int64, combiner="sum")
embedding_size = get_embedding_size( ad_id_hashed_feature.length)
ad_embedding_feature = tf.contrib.layers.embedding_column(
ad_id_hashed_feature, dimension=embedding_size, combiner="sum")
○ Eg. date_X_days_before_holidays
○ Eg. first_saturday_of_the_month
Time differences
● Examples:
○ user_interaction_date - published_doc_date
To model how recent was the ad when the user viewed it.
Hypothesis: user interests on a topic may decay over time
○ last_user_interaction_date - user_interaction_date
To model how old was a given user interaction compared to his last
interaction
Spatial Features
Spatial Variables
● Spatial variables encode a location in space, like:
● Derived features
Beverage Containers Redemption Fraud Detection: Usage of # containers redeemed (red circles) by
store and Census households median income by Census Tracts
Textual data
Natural Language Processing
Cleaning Removing
• Lowercasing • Stopwords
• Convert accented characters • Rare words
• Removing non-alphanumeric • Common words
• Repairing
Roots
Tokenizing • Spelling correction
• Encode punctuation marks • Chop
• Tokenize • Stem
• N-Grams • Lemmatize
• Skip-grams Enrich
• Char-grams • Entity Insertion / Extraction
• Affixes • Parse Trees
• Reading Level
Text vectorization
Represent each document as a feature vector in the vector space, where each
position represents a word (token) and the contained value is its relevance in the
document.
face person guide lock cat dog sleep micro pool gym
0 1 2 3 4 5 6 7 8 9
...
D1 0.05 0.25
documents
D2 0.02 0.32 0.45
...
TF-IDF sparse matrix example
Similarity metric between two vectors is cosine among the angle between them
Deep Learning....
“...some machine learning projects
succeed and some fail.
Where is the difference?
Easily the most important factor is the
features used.”
– Pedro Domingos
References
Slides: bit.ly/feature_eng
Blog: medium.com/unstructured