0% found this document useful (0 votes)
62 views52 pages

Marketing Engineering and Analytics

Chapter 1 discusses the importance of data preprocessing and exploratory analysis in analytics, emphasizing the need to prepare customer signatures and derive new variables for effective data mining. It covers techniques for handling missing values, outliers, and the creation of derived variables to enhance model performance. The chapter highlights the balance between technical and business considerations in defining customer data and the challenges posed by data sparsity and variability.

Uploaded by

Sandra Pedra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views52 pages

Marketing Engineering and Analytics

Chapter 1 discusses the importance of data preprocessing and exploratory analysis in analytics, emphasizing the need to prepare customer signatures and derive new variables for effective data mining. It covers techniques for handling missing values, outliers, and the creation of derived variables to enhance model performance. The chapter highlights the balance between technical and business considerations in defining customer data and the challenges posed by data sparsity and variability.

Uploaded by

Sandra Pedra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 1 – Data Preprocessing & Exploratory Analysis

• “The world’s most valuable resource is no longer oil, but data”

1.1. Why data preprocessing?


• Technology is increasingly more powerful and accessible
• Increased pace of Technology Adoption.
• Data Mining virtuous cycle
o 1. Identify opportunities
o 2. Transform Data
o 3. Act
o 4. Act (measure the results)
• Predictive vs Descriptive Techniques

• Overall assumptions/requirements of analytic models


o “Assumption/Requires”
o “KMeans”
o “Decision Trees”
o “Regression”
o “Neural Networks”
o “K-NNs”
1.2. Preparing the Customer signatures

• What are signatures?


o This chapter focuses on finding the right individual unit (e.g., customers, credit
cards, households, etc.) in data, by gathering the traces they leave when
interacting with an organization and its IT/IS.
o In most cases the observation signatures, also known as analytic-based tables
(ABTs) are the data in data mining/analytics. In other words, they are the data
in the right format to be used by most methods. They are the defining aspect,
for example, if the task will be of a predictive or descriptive nature.
• What is a customer?
o The question seems simple, but the answer depends on the business goal.
Thus, extracting and treating data has as much technical (objective) as business
(subjective) decisions.
o Consider, for example, three types of roles that one (a so-called customer) may
have:
▪ Payer;
▪ Decider;
▪ User.
• Finding “Customers” in Data
o Signature tables, or analytic-based tables.
o Designing signatures.
• Process of creating signatures
o Copying
o Pivoting
o Aggregation
o Table lookup
o Summarization of values from data
o Derivation of new variables
o 1. Irregular and rare transaction
▪ When transactions occur infrequently and irregularly, one must choose
between an extensive aggregation period or accepting many zeroes.
o 2. Creating transaction summary
▪ Besides time series data, it is useful to summarize the lifetime
transaction:
• Time since the first transaction;
• Time since most recent transaction;
• Proportion of transactions at each location;
• Proportion of transactions by time of day;
• Proportion of transactions by channel;
• Largest transaction;
• Smallest transaction;
• Average transaction.
• Types of measurement scales
o Nonmetric scales
▪ Nominal
▪ Ordinal
o Metric scales
▪ Interval
▪ Ratio (0 = absence. We can make divisions)

1.3. Deriving New Variables


• Derived Variables
o Creating new (derived) variables is arguably the most human and creatively-
demand process in marketing analytics.
o These derived variables are not originally presented in the dataset and
completely depend on the analyst’s insights.
o Techniques use existing variables, so if one can add new and relevant ones,
techniques will inevitably perform better.
• Single Variables
o Creating derived variables includes standard transformations of a single variable
– e.g., mean correcting, rescaling, standardization, and changes in
representation.
o This type of transformation is mostly technical rather than creative.
• Single-variable transformations
o Mean correcting (centering) variable;
o Standardizing (rescaling or normalizing) variables;
o The main objective is to put variables measured in incompatible units onto a
common scale so that a one-unit difference for a given variable has the same
effect as the same difference in another.
o This is especially important for techniques, such as clustering and memory-
based reasoning.
• Turning numeric variables into percentiles
o Converting numeric values into percentiles has many of the same advantages as
standardization. It translates data into a new scale that works for any numeric
measure. They also do not require knowledge about the underlying distribution.
• Turning Counts into Rates
o Many databases contain counts, when these tallies are for events that occur
over time, it makes sense to convert the counts to rates by dividing by some
fixed time unit to get daily calls or withdrawals per month. This allows customers
with different tenures to be compared.
o Note that a customer with 10 purchases in one month since acquisition has the
same rate as one who bought 20 times in two months. Fundamentally, the new
variable represents that.
o A word of caution: When combining data from multiple sources, you must be
sure that they contain data from the same timeframe.
• Replacing Categorical Variables with Numeric Ones
o Numeric variables are sometimes preferable to categorical ones. Certain
modelling techniques —including regression, neural networks, and most
implementations of clustering —prefer numbers and cannot readily accept
categorical inputs.
o A common mistake is to replace categorical values with arbitrary numbers,
which leads to spurious information that algorithms have no way to ignore.
• Binning/Discretization
o This process consists of dividing the range of a numeric variable into a set of
bins/modalities, replacing its (original) value by the bin or modality. This is also
extremely useful for highly skewed distributed variables as outliers are all placed
together in one bin (e.g., recency). Nevertheless, dealing with outliers will be
further later in this chapter.
o The three major approaches to creating bins are:
▪ 1. Equal width binning
▪ 2. Equal weight binning
▪ 3. Supervised binning
o They differ on how the ranges are set. The histogram presents an example of a
distribution of months since the last purchase (recency). This variable is often
used in descriptive or predictive techniques as it reflects an interesting
firmographic characteristic of customers.
• Binning/Discretization – Equal width and equal-weighted binning
o The upper histogram created five-equally distanced in terms of range bins. The
lower one presents five equally sized ones. Which one is more advisable?

• Binning/Discretization – Supervised binning


o A more sophisticated approach would be to use supervised methods to produce
the five bins’ optimal thresholds (in the case of explanatory analysis).
• Spreading the Histogram
o Some transformations can be employed to correct variables’ distributions. Look
at the example below (monetary: amount spent by customers in the last 18
months).

• Correcting right-skewed distributions

o One can use the square root or log(x) to correct right-skewed variables. Square
roots do it more softly, whereas ln(x) and log(x) are a stronger way, respectively.

• Combining variables
o One of the most common ways of creating derived variables is by combining two
or more existing ones to expose information that is not present originally.
o Quite commonly, variables are combined by summing them, computing
differences, multiplication, or dividing one by the other as in the existing credit
payments/income ratio or sales on a specific type of product/total spending, but
products, sums, differences, squared differences, and even more imaginative
combinations also prove useful.
• Classic Combinations – BMI as an example
o Some derived variables have become so well known that it is easy to forget that
someone once had to invent them. Insurance companies track loss ratios, banks
track effort rates, investors pick stocks based on price/earnings ratios, etc.
o A good example is body mass index (BMI) – of height/ weight. The histogram
below depicts the association between type II diabetes and BMI.

• Combining Highly Correlated Variables


o In business data, having many strongly correlated variables is relatively
common. Although for some types of techniques (e.g., principal components
analysis or decision trees) it does not affect its performance, for other
techniques (e.g., cluster algorithms and linear regressions) it does so.
o The Degree of Correlation
▪ Another interesting thing to look at with correlated variables is whether
the degree or direction of correlation is different in different
circumstances, or at different levels of a hierarchy.
• Extracting features from time series
o Most customer (or other units) signature tables include time series data (e.g.,
calls sent and received, quarterly spending, etc.). A difficulty with the time series
of quarterly spending is that, at the customer level, it is very sparse. Any one
customer has very few quarters with a purchase. Thus, it does not make sense
to consider whether a customer’s spending increasing or decreasing over time.
o Thus, how can one extract assertive variables from time series data?

▪ Seasonability
• In the previous slide it is noticeable that sales in the 4th quarter
are stronger than in any other. While this pattern is obvious (and
plausible) for humans, it is not noticeable but most techniques
without using derived variables:
o At least one could create a variable indicating whether
it is 4thquarter or not;
o Even better would be to create a categorical variable
with the quarter.
• The ideal would be creating a variable with the expected effect
on sales due to seasonality (which is not straightforward) –
stationary series.
• One could represent the quarters with the average difference
from the trendline (almost all below except the 4thquarters (Q1
= -10.633; Q2 = -37.312; Q3 = 38.923; Q4 = 82.837).

▪ Geography
• Location – Arguably an important characteristic of virtually
every business. The key point is discovering what aspects of
location are relevant for a specific problem (similar to the
handset example at the beginning of this chapter).
• Examples are geographic coordinates, temperature, average
income, education, age, etc.
• Using geographic IS
o One reason for geocoding is to be able to place things
on a map, using software as ArcGIS or Quantum GIS.
Representing data in form of maps can be extremely
useful to gain business insights.

1.4. Missing Values


• Missing Data
o Missing data may be defined as the case where valid values on one or more
variables are not available for analysis. An important reason for this is the fact
that algorithms for data analysis were originally designed for data matrices with
no missing values.
o Missing values are arguably one of the worst nightmares data analysts and
scientists need to handle in the pre-processing stage. Missing values have many
natures and sources. Unfortunately, they also have the power to “ruin” virtually
every analysis as their presence violates almost every method’s assumptions.
o Thus, when one is building a signature table, knowing how to tackle (the very
likely to exist) missing values is an important issue. Usually, but not always,
missing values present higher challenges in numerical data than in categorical
ones.
o The first step is always to identify and recognize missing values as such in the
original data source(s) where they may have been conscientiously hidden. Data
sources in this context may be other datasets from the relational database or
questions in surveys.
o The need to focus on the reasons for missing data comes from the fact that the
researcher must understand the processes leading to the missing data in order
to select the appropriate course of action.
o Unknown or Non-Existent?
▪ A null or missing value can result from either of two quite different
situations:
• A value exists, but it is unknown because it has not been
captured. Customers do have dates of birth even if they choose
not to reveal them;
• The variable (e.g., question) simply does not apply. The
“Spending in the last six months” field is not defined for
customers who have less than six months of tenure.
▪ This distinction is important because although imputing the unknown
age of a customer might make sense, imputing the non-existent
spending of a non-customer does not. For categorical variables, these
two situations can be distinguished using different codes such as
“Unknown” and “Not Applicable.”
• Types of Missing Values
o Before any decision about handling missing values is made, it is critical to
characterize missing values as one of the following:
▪ 1. Missing completely at random (MCAR) (exam question)
• Y is not dependent on X nor Y –“Someone simply did not
answer.”
▪ 2. Missing (should be conditionally) at random (MAR)
• Y depends on X, but not on Y –“Men (gender x) may be more
likely to decline to answer some questions (y) than women, and
not because of the answers itself.”
▪ 3. Missing not at random (MNAR)
• Y depends on Y -“Individuals with very high incomes are more
likely to decline to answer questions about their income (not
because they are men but because they have very high/low
income.”
o What not to do with missing values
▪ 1. Do not throw records away;
▪ 2. Don’t Replace with a “special” numeric value;
▪ 3. Don’t Replace with average, median, or mode.
o What to do (at least consider) with missing values
▪ So, what can one do about missing values? There are several
alternatives.
▪ The goal is always to preserve the information in the non-missing fields
so it can contribute to the model/analysis. When this requires replacing
the missing values with an imputed value, the imputed values should be
chosen to do the least harm.
• 1. Consider doing nothing (some techniques handle missing
values very well);
• 2. Consider multiple models;
• 3. Consider imputation;
• 4. Remember that imputed values should never be surprising.

1.5. Outliers and Sparse Data


• What is an Outlier?
o In statistics, an outlier is an observation point that is distant from other
observations.
o An outlier may be due to variability in the measurement, or it may indicate
experimental error -sometimes excluded from the data set. Outliers are extreme
cases in one or more variables and with great impact on the interpretation of
results. They may come from:
▪ Unusual but correct situations (the Bill Gates effect),
▪ Incorrect measurements,
▪ Errors in data collection;
▪ Lack of, or wrong, code for missing data.
o Outlier – Leverage effect

o “se um outlier é impactante o suficiente para alterar a média notavelmente,


então, provavelmente, será importante estudá-lo usando algumas das técnicas
que vamos conhecendo.”
• Remedies for outliers
o There are several remedies for coping with outliers. These inevitably vary with:
▪ 1. The type of the outlier;
▪ 2. The data in the dataset;
▪ 3. The analytic methods to be employed in the analysis stage;
▪ 4. The distribution of the respective variable;
▪ 5. The “philosophic” approach, i.e., the problem.
o Remedies for outliers – Automatic limitation / or thresholding
▪ As the name describes, one defines a floor and a ceiling in which data
points falling outside them are deleted or addressed (e.g., 100 <= age
<=16; 10,500€ <= monthly income <= 0).

o Remedies for outliers – Statistical Criterion


▪ Definition, variable by variable, the potential minimum and maximum
allowed values based on the distribution and context.

o Remedies for outliers – Manual limitation / thresholding


▪ Tukey’s boxplot can be used to identify outliers in a Gaussian variable.
This method should be used with caution.

o Remedies for outliers – 68 / 95 / 99.7 rule


▪ If the variables follows a Normal distribution, then one can use standard
deviations to define outliers’ thresholds.
• Multidimensional Outliers
o The problem of identifying outliers is much more complex in case of multivariate
outliers. Multivariate outliers are characterized by having admissible values in
individual variables, but not in or two or more jointly. Sadly, these are usually
the most interesting.

• Is more data always good?


o Dose makes the poison: The issue with too much data is emphasized,
particularly regarding an excess of variables rather than observations.
o Variable overload: Excess variables pose a significant challenge to algorithms
and human approaches, causing complications.
o Sparse input data: Too many variables often result in sparse data, with numerous
zeros or missing values, creating difficulties for many techniques.
o Overfitting risk: The abundance of variables increases the likelihood of
overfitting, where models memorize instead of learning underlying patterns,
especially with small nuances in observations' distribution across the space.
o Computing time impact: While excess observations used to be a concern for
computing time, it's no longer a significant issue.
o Context of predictive modelling/analysis: The problem of too much data is
primarily discussed in the context of predictive modelling and analysis.
o Dimension reduction methods: Strategies to address challenges posed by a high
number of dimensions include reducing their number.
o Variable selection: One approach involves selecting variables based on their
higher explanatory power regarding the target variable.
o Principal components analysis (PCA): Another method is PCA, which combines
original variables to create new (composite) ones. These new variables aim to
condense the most variance possible while minimizing the overall number of
dimensions.
• Problems with too many variables
o Too many variables have as much good as bad.
o If variables are necessary to find patterns in data, in excess it may also cause:
▪ 1. Risk of correlation among input variables -> Multicollinearity;
▪ 2. Risk of overfitting -> Models will memorize not learn the patterns;
▪ 3. Sparse data -> Too many zeros…
• Handling Sparse Data
o What is the remedy for having too many variables (≈ sparse data)?
o Some options exist:
▪ Feature selection;
▪ Directed variable selection methods;
▪ Principal Components Analysis;
▪ Variable Clustering.
• Types of Variable Reduction Techniques
o Variable reduction techniques: Various methods are available to reduce the
number of variables, and their choice depends on factors such as the use of the
target variable and the derivation of new variables.
o Incorporating the target variable: Many feature selection methods involve using
the target variable to identify the best input variables. While there is a risk of
target leakage, this concern is mitigated by employing a validation set. In data
mining, the focus is often on prediction rather than explaining effects.
o Original vs. derived variables: Choosing between original and new derived
variables depends on the goals. Using a subset of original variables enhances
understandability. However, methods that generate new variables can be
effective, capturing most of the information from the original variables.
• Exhaustive Feature Selection
o One possible way to create the best model, given a set of input variables, is to
exhaustively try all combinations. Doing so would definitely create the best
model but it is virtually impossible.

• Selection of Features
o The most popular way to select which of many input variables to use is by using
sequential selection methods. That is, one variable at a time is considered for,
either inclusion in or exclusion from the model.
o Again, several methods exist:
▪ Forward Selection;
▪ Stepwise Selection;
▪ Backward Selection.
o Hence, at least one measure of performance is needed.

• Principal Components
o Principal Component is a slight variation of the best-fit line. It minimizes the sum
of the squares of the distances from the line to the data points, not only the
vertical distance as does linear regression.

• Variable Clustering
o Variable clustering concept: Variable clustering extends beyond previous feature
selection techniques by introducing the idea that input variables have inherent
structures among them.
o Analogous to hierarchical analysis: The structure among variables operates
similarly to hierarchical analysis for observations, allowing the selection of a
specific number of variables to model.
o Cluster-specific principal components analysis: In variable clustering, for each
selected cluster of variables, a principal components analysis is executed. The
variables within that cluster are then replaced by the principal components,
contributing to a more streamlined representation of the data.
Chapter 2 – Customer Segmentation
Recap
• Some relevant variables often used in
Segmentation
o Recency – day since last visit/purchase
o Frequency – number of transactions per
customer
o Monetary Value – total value of sales
(different from profit)
o Average Purchase – an average of the
purchase per visit
o Average Time Between Transactions –
Transaction Interval
o Standard Deviation of Transactional Interval
o Customer Stability Index – Standard
Deviation of Transactional Interval / Average
Time Between Transactions
o Relative Spend on Each Product
o Normalize Relative Spend (NRS) – ratio
between what product A represents (%) in
the expenses of customer 1 and the on the
average of the database
o Average Number of Different Products Purchased per Transaction
o …
o Firmographic and demographic variables

2.1. A-priori Segmentation

2.1.1. Cohort Analysis

• What is Cohort Analysis?


o A cohort is a group of users who share a common characteristic.
o For example, all users with the same acquisition date belong to the same cohort.
o The Cohort Analysis report lets you isolate and analyse cohort behaviour.
▪ Examine individual cohorts to gauge response to short-term marketing
efforts like single-day email campaigns.
▪ See how the behaviour and performance of individual groups of users
changes day to day, week to week, and month to month, relative to
when you acquired those users.
2.1.2. Quantile-based Segmentation

• Quantile-based Segments
o A percentile is a measure used in statistics indicating the value below which a
given percentage of observations in a group of observations fall.
o For example, the 20th percentile is the value (or score) below which 20 percent
of the observations may be found.
o The 25th percentile is also known as the first quartile (Q1), the 50th percentile as
the median or second quartile (Q2), and the 75th percentile as the third quartile
(Q3).
o In general, percentiles and quartiles are specific types of quantiles.
2.1.3. RFM

• Based on the following principles:


o Customers who have purchased more recently are more likely to purchase again;
o Customers who have made more purchases are more likely to purchase again;
o Customers who have made larger purchases are more likely to purchase again.
• Has been in active use in Direct Marketing for more than 50 years;
• It can be used only for customer files that contain purchase history;
• There are two methods:
o Exact Quintiles;
o Hard coding;
• RFM (exact quintiles)
o We sort the customer signature table according to recency (R) and divide into
five quintiles (five equally sized groups) or three terciles (three equally sized
segments);
o Do the same for the frequency and monetary variables (FM);
o Result: 125 cells of equal size (5*5*5) or 27 cells of equal size (3*3*3).

• RFM (hard coding)


o Hard coding
▪ Categories are divided by exact values (0-3 months; 4-6 months; 7-9
months; etc.);
▪ Less complex in terms of programming, categories tend to change over
time;
▪ Very different quantities from cell to cell.
o Its popularity comes from its simplicity, low cost and capacity to classify
customers based on their behaviour;
o Opportunity to carry out tests in small, representative groups of each cell;
o A more sophisticated modelling is almost always better, but is it worth it? Not
always.

2.2. Cluster Analysis

• Cluster Analysis is a basic conceptual activity of human beings;


• A fundamental process, common to many sciences, essential to the development of
scientific theories;
• The possibility of reducing the infinite complexity of rea to sets of objects or similar
phenomena, is one of the most powerful tools in the service of mankind.
• Cluster analysis is a generic name for a variety of methods that are used to group entities;
• Objective: to form groups of objects that are similar to each other;
• From a data collection about a group of entities, seeks to organize them into
homogeneous groups, assessing a “frame” of similarities/differences between units.

• Classification (not the scope of this chapter):


o Starts with a pre-classified training set, that is, the method has a set of data
which contains not only the variables to use in classification but also the class to
which each of the records belongs;
o Attempts to develop a model capable of predicting how a new record will be
classified.
• Clustering:
o There is no pre-classified data;
o We search for groups of records (clusters) that are similar to one another;
o Underlying is the expectation that similar customers in terms of the variables
used will behave in similar ways.
[Link]. Variables to Use

• Cluster Analysis – Variables to Use

• Deciding which variables to use:


o The type of problem determines the variables to choose;
o If the purpose is to group objects, the choice of variables with discrimination
ability is crucial;
o The quality of any cluster analysis is, first of all, conditioned by the variables
used.
o The choice of variables should replicate a theoretical context a reasoning;
o This process is carried out based on a set of variables that we know to be good
discriminators for the problem at hand;
o First of all, the quality of the cluster analysis reflects the discrimination ability of
the variables we decided to use in our study.

• Similarity Criterion - Distances


o Similarity criterion:
▪ The analysis of similarity relations has been dominated by metrics based
on Euclidean Spaces;
▪ Objects as points in a multidimensional space, in a way that the
observed dissimilarities between the objects correspond to distances
between the respective points;
▪ Thus, the use of clustering methods most times means the use of
similarity ratios that respect these metrics:


▪ Summary of the most popular distances

2.2.1. Hierarchical Methods

• Hierarchical Clustering
o Geometrical view of Cluster Analysis
▪ Geometrically, the concept of hierarchical cluster analysis is
straightforward. Consider the following hypothetical data:

table cluster analysis


▪ As it is known, each observation can be represented as a point in a p-
dimensional space. In this case, p=2. Let’s suppose that we are
interested in forming three homogeneous groups. An examination of
the observations projected in the two-dimensional space is given in the
next slide.
▪ An examination of the projected subjects suggests that S1 and S2 will
form one group, S3 and S4 form another; whereas S5 and S6 form the
third one.

▪ Dissimilarity Matrix
• Dissimilarity Matrix for the hypothetical data using the Squared
Euclidean Distance.

• The question is then how can one use the similarities given for
forming the groups or clusters? The answer to this question lies
in the two main types of analytic clustering techniques,
hierarchical and non-hierarchical.
▪ As can be seen, cluster analysis groups observations such that the
observations in each group are similar with respect with respect to the
clustering variables.
▪ The graphical procedures for identifying clusters may not be feasible
when we have many observations or when we have more than three
variables or dimensions. What is needed in such cases is an analytical
technique for identifying groups or clusters of points in a given
dimensional space.
o Hierarchical Clustering Techniques
▪ From the table cluster analysis, subjects S1 and S2 are similar to each
other, as are subjects S3 and S4, since the Squared Euclidean distance
between these two pairs of points is the same –two. Either of these two
pairs could be selected as the first pair to be formed. The tie is broken
randomly. Nevertheless, let us choose subjects S1 and S2 and merge
these two individuals into one cluster. We now have five clusters: cluster
1 comprised of S1 and S2; cluster 2 comprised of S3; cluster 3 comprised
of S4; cluster 4 comprised of S5; and cluster 5 comprised of S6. The next
step would be to develop another similarity matrix, given by the
previous 5 clusters.
▪ The algorithm finishes when all observations are in the same cluster
(which is not a good solution as we can't differentiate any customer
from another!).
▪ The arising question would be as follows: Since cluster 1 is comprised of
S1 and S2, we must define a rule for determining the distance between
this cluster with every other subject included in our data set. The answer
to this question is what differentiates between the various hierarchical
clustering algorithms.
▪ In this course the following hierarchical methods are briefly introduced:
• CENTROID METHOD;
o In the Centroid Method each group is represented by
an Average Subject which is the centroid of that group.
For example, the first cluster is represented by the
centroid of the Subjects S1 and S2. In other words,
Cluster 1 has an average education of 5.5 years and an
average income of 5.5 thousand dollars.
o The next table gives the similarity between the clusters
is obtained by using the Squared Euclidean Distance:

o As it can be seen, S3 and S4 have the smallest distance


and, therefore, are most similar. Hence, we can group
these two subjects into a new cluster. Again, this new
cluster will be represented by the centroid of its
observations.
o Again, this new cluster will be represented by the
centroid of its observations. The next table gives the
similarity between the clusters is obtained by using the
Squared Euclidean Distance:
o As it can be seen, S5 and S6 have the smallest distance
and, therefore, are most similar. Hence, we can group
these two subjects into a new cluster. The similarity is
as follows:

o As it can be seen from the previous table, the clusters


comprised of Subjects (S3 & S4) and (S5 & S6) have the
smallest distance, Therefore, these two clusters are
combined to form a new cluster comprised of Subjects
(S3 & S4 & S5 & S6). The other cluster consists of
subjects S1 and S2. Naturally, the next step would be to
group all the subjects into one cluster. Note what would
this mean in practical terms…
o It is now obvious that the hierarchical cluster algorithm
forms clusters hierarchically. In other words, the
number of clusters at each stage is one less than the
previous one. If there are no observations then at step
1, step 2, step 3, …., step n-1 of the hierarchical process
the number of clusters will be, respectively, n-1, n-2, n-
3, …., 1. In the case of the Centroid method, each cluster
is represented by the centroid of that cluster for
computing the distances between the clusters.
o Given the iterative process of hierarchical clustering, it
is very frequent to plot the formation path of the
observations in what is called a dendrogram or tree. In
these graphic representations, the observations are
listed on the horizontal axis and the Squared Euclidean
Distance between the centroids on the vertical one.
o Note that in the case where we have a high number of
observations the dendrogram may not be very useful.
• NEAREST-NEIGHBOUR OR SINGLE-LINKAGE METHOD;
o Consider the similarity matrix without any groups of
subjects yet formed.

o In the Centroid Method, the distance between clusters


was obtained by computing the Squared Euclidean
Distance between the centroids of the respective
clusters. In the Single-Linkage Method. The distance
between the two clusters is represented by the
minimum of the distance between cluster 1 (S1 & S2)
and Subject S3.
o In this case we have Min(181; 145)=145.
o The process then continues interactively until only one
cluster is formed.

• FARTHEST-NEIGHBOUR OR COMPLETE-LINKAGE METHOD;


o The Complete Method is the exact opposite of the
Single Method.
o The distance between any two clusters is given by the
maximum of the distances between all possible pairs of
observations in the two clusters. Once again, consider
the initial similarity matrix given for this hypothetical
data:

o The distance between the two clusters is represented


by the maximum of the distance between cluster 1 (S1
& S2) and Subject S3.
o In this case we have Max(181; 145)=181.
o The process then continues interactively until only one
cluster is formed.

• AVERAGE-LINKAGE METHOD;
o In the Average Method the distance between two
clusters is obtained by taking the average distance
between all pairs of subjects in the two clusters. Once
again, consider the initial similarity matrix given for this
hypothetical data:

o The distance between the two clusters is represented


by the average of the distance between cluster 1 (S1 &
S2) and Subject S3.
o In this case we have Average(181; 145)=163.
o The process then continues interactively until only one
cluster is formed.

• WARD’S METHOD.
o The Ward’s method does not compute distances
between clusters. Rather, it forms clusters by
maximizing within-cluster homogeneity. The within-
group sum of squares is used as a measure of
homogeneity.
o Clusters are formed at each step in such a way that the
resulting cluster solution has the fewest within-cluster
sums of squares. This measure is also known as Error
Sum of Squares (ESS). The first two iterations of Ward’s
method are presented in the next slide.
o The ESS is computed as:

o The process then continues interactively until only one


cluster is formed.

o Hierarchical Methods Overview


▪ Centroid: less sensitive to outliers than other methods; may present
some limitations when the clusters have very different sizes since
individuals tend to be agglomerate in larger clusters;
▪ Single-Linkage: tends to produce elongated clusters, which may involve
very different individuals in the same cluster;
▪ Complete-Linkage: particularly sensitive to outliers as it tends to
produce clusters with similar diameters;
▪ Average-Linkage: tends to combine clusters with small and similar
variances;
▪ Wards Method: tends to combine clusters with few and similar number
of observations and it is also very sensitive to outliers.

2.2.2. K-Means Algorithm

• Non-Hierarchical Cluster Analysis (K-Means) Algorithm


o The goal is to minimize intra-group variance (sum of squared error):

o x is a data point in cluster Ci and mi is the representative point (centroid) for


cluster Ci
o One easy way to reduce SSE is to increase K (number of clusters)
o A good clustering with smaller K can have a lower SSE than a poor clustering
with higher K
o K-means is a partitional clustering algorithm
o Let the set of data points (or instances) D be
▪ {x1, x2, …, xn},
▪ where xi = (xi1, xi2, …, xir) is a vector in a real-valued space, and r is the
number of attributes (dimensions) in the data.
o The k-means algorithm partitions the given data into k clusters.
▪ Each cluster has a cluster centre, called centroid.
▪ k is specified by the user
▪ k << n.
o Classifies the data into K groups, by satisfying the following requirements:
▪ each group contains at least one point;
▪ each point belongs to exactly one cluster.
o Given k, the partition method creates an initial partition (typically randomly);
o Next, uses an iterative relocation technique that tries to improve the partition,
moving objects from one group to another;
o Generically, the criterion for good partitioning is that objects belonging to the
same cluster should be close or related to each other.
o Steps:
1. Choose the seeds;
2. Each individual is associated with the nearest seed;
3. Calculate the centroids of the formed clusters;
4. Go back to step 2;
5. End when the centroids cease to be recentered.
• K-Means Algorithm (Geometric Approach)
o Initial Data

o Outliers Identification

o Outliers Removal

o Iterative Process
o Final Solution
• Non-Hierarchical Cluster Analysis (K-Means)
o K-means algorithm (strengths)
▪ Simple: easy to understand and to implement
▪ Efficient: Time complexity O(tkn),
• Where n is the number of data points
• k is the number of clusters
• t is the number of interactions
▪ Since both k and t are small, k-means is considered a linear algorithm.
▪ K-means is the most popular clustering algorithm.
▪ Note: it terminates at a local optimum if SSe is used. The global optimum
is hard to find due to its complexity.
o K-means algorithm (weaknesses)
▪ Very sensitive to the existence of outliers
▪ Very sensitive to the initial positions of the seeds
▪ Partitioning methods work well with spherical-shaped clusters
▪ Partitioning methods are not the most suitable for finding clusters with
complex shapes and different densities
▪ The need to set from the start the number of clusters to create

▪ The algorithm is sensitive to initial seeds.


▪ Use multiple forms of initialization
▪ Re-initialize several times
▪ Use more than one method
▪ Use a relatively large number of clusters and proceed to their regrouping
by the choice of centroids

▪ Have difficulties in dealing with clusters of different size and density.

▪ Each individual either belongs or does not belong to the cluster, having
no notion of probability of belonging, in other words, there is no
consideration of the quality of the representation of a particular
individual in a given cluster.
o K-means algorithm the number of clusters

▪ This is always a difficult problem to solve, and there are no recipes to fix
this.
▪ One way to minimize the problem is to create various classifications with
different K and choose the best.
▪ Use a hierarchical method in order to choose the number of clusters
based on the dendrogram.
▪ The choice should be guided by three fundamental criteria:
• Intra-cluster variance
• Evaluation of the profile of the cluster (subjective)
• Operational considerations
▪ Regarding the first criterion, the analysis is simple and not too
subjective, since we know that the lower the intra-cluster variability the
greater the cohesion of the cluster, a highly desirable feature in this type
of analysis. However, as k increases, variability decreases;
▪ Regarding the second criterion, the question is not as simple in the
sense that it requires much more subjective assessments, which relate
to the interpretation of the obtained clusters;
▪ The third criterion is relatively simple in the sense that these issues are
imposed on the analyst.
▪ To test the results by varying k (number of clusters);
▪ This procedure allows a series of analyses that can instruct the choice of
the number of clusters;
▪ To compare the R-squared of the different Solutions (as done in the
Practical classes with Excel).
▪ Operational considerations are related to the business environment and
usually affect the decisions of the analyst:
• A number small enough for developing a specific strategy;
• Several individuals large enough to be worth it to develop a
specific strategy;
• A good way to accommodate these considerations is the use of
a high initial k and then proceed to the grouping of clusters.
▪ This evaluation is carried out by comparing the mean values for each
variable in each cluster with the mean values of the population for each
variable;
▪ In this case, it is particularly relevant to take into account the most
important differences within the different clusters and the mean
population.
▪ That is why profiling is so important.

2.2.3. Profiling

• Cluster Analysis – Profiling


o Profiling (size of the clusters)

o Profiling (comparing averages)

o Profiling (comparing profiles)


o Profiles (leverage)

• Migration
Chapter 3 – Predictive Modelling
3.1. Introduction
• Predictive Models

• Learning
o Inductive VS Deductive

▪ Inductive: This cat is black. That cat is black. A third cat is black.
Therefore, all cats are black.
• (In inductive reasoning, the conclusion is reached by
extrapolating from specific cases to general rules)
▪ Deductive: All men are mortal. Joe is a man. Therefore, Joe is
mortal.
• (In deductive reasoning, a conclusion is reached by
applying general rules to specific instances)
• Types of Models
o Model Development
▪ Typically, mathematical equations that characterize the relationship
between inputs and outputs (that map the inputs to the output);
▪ The core issue of modeling is to formulate these equations;
▪ The best way to model is to formulate equations that define how
outputs can be calculated from inputs.
o Knowledge-based techniques (deterministic models)
▪ How long does it take a stone to hit the ground (on Earth)

• (h meters)
▪ Most of the problems are nowhere near as well understood as gravity
o Non-deterministic
▪ When faced with a regression problem, we can have a good idea of how
inputs and outputs are related, but not as good as to define a
deterministic model;
▪ A rock falling on another planet (different gravitational acceleration).

• Assumption-based models

• Non-parametric (flexible) models (data-driven)


• Overfitting

3.2. Overfitting and Data Partition


• Overfitting
o The objective is to fit a "model" to a set of training data, to make reliable
predictions on unseen test data.
▪ In overfitting, a model describes random error or noise instead of the
underlying relationship.
▪ Overfitting occurs when a model is excessively complex, such as having
too many parameters relative to the number of observations.
▪ A model that has been overfitted has poor predictive performance, as it
overreacts to minor fluctuations in the training data.

o The objective is not to learn about the training set but about the “unseen”
instances of the problem! How to prepare for the unknown?
▪ Have a “spare” dataset!!!!
▪ To ensure that the knowledge extracted by the tool is generalizable to
the universe of interest;
▪ If the model is being evaluated by the number of mistakes it makes, then
it will try to make the least mistakes possible.
o Overfitting occurs when a model begins to "Underfit" training data rather than
"learning" to generalize from a trend.
▪ As an extreme example, if the number of parameters is the great error
equal to the number of observations, a simple model or learning
process can perfectly predict the training data simply by memorizing the
training data in its entirety, but such a model will typically fail drastically
when making predictions about new or unseen data since the simple
model has not learned to generalize at all.

• Underfitting
o The objective is to fit a “model” to a set of training data, to make reliable
predictions on unseen test data.
▪ Underfitting occurs when a model or machine learning algorithm cannot
capture the underlying trend of the data.
▪ Underfitting would occur, for example, when fitting a linear model to
non-linear data. Such a model would have poor predictive performance.
• Data Partition
o Hold-out
▪ Training and validation

o Training set
▪ The bigger, the better the classifier obtained
o Validation set
▪ The bigger, the better the estimation of the optimal training
o Testing set
▪ The bigger, the better the estimation of the performance of the classifier

• Approach

• Data Partition: Hold-out


o Hold-out
▪ Training, validation, and test

• Data Partition: Cross-validation


o k-fold cross-validation
▪ The original sample is randomly partitioned into k equal-sized
subsamples.

3.3. k-Nearest Neighbour


• k Nearest
• k Nearest Centroid
o The closest centroid
▪ Measure the Euclidean distance to all centroids and choose the closest
one.

▪ Define k and find the k-nearest

▪ Attribute the class of the majority

o Instance-based classification
▪ Simplest form of learning;
▪ Training instances are searched for instances that most closely resemble
new instances;
▪ The instances themselves represent the knowledge;
▪ Also called instance-based learning;
▪ Similarity function defines what’s “learned”.
o Requires three things
▪ The set of stored reColords;
▪ Distance Metric to compute the distance between reColords;
▪ The value of k, the number of nearest neighbours to retrieve.

o Assumptions

▪ Notes: Multicollinearity means that there is a high degree of linear


correlation amongst independent variables. In heteroscedasticity, the
variance of the dependent variable is non-constant over different values
of the independent variables.
• Similarity Measures
o Selecting a measure of distance
▪ Euclidean Distance

▪ Squared-Euclidean Distance

▪ City-Block Distance

▪ Minkowski Distance

▪ Mahalanobis Distance

o Where dij is the specific distance between subjects i and j, xik is the value of the
kth variable for the ith subject, xjk is the value of the kth variable for the jth
subject, and p is the number of variables.

3.4. Decision Trees


• Decision Trees
o Decision Trees
▪ Classification trees are typically
considered to be classification and
regression tools;
▪ On of its most important advantages
relates to the simplicity of the
interpretation of its results;
▪ In some problems we are just interested
in achieving the best precision possible;
▪ In others we are more interested in
understanding the results and the way
the model is producing the estimates;
▪ Thus, the result of a classification tree can
easily be expressed in English or SQL.
o Classification Trees
▪ Example
o Regression Trees
▪ Example

o Advantages and Disadvantages


▪ Advantages
• Easy to understand;
• Easy to implement;
• Are non-parametric models.
▪ Disadvantages
• Usual have poor performance in respect to
other methods;
• Usually require categorical target variables;
• Perpendicular boundaries to axes.
o Assumptions

▪ Notes: Multicollinearity means that there is a high degree of linear


correlation amongst independent variables. In heteroscedasticity, the
variance of the dependent variable is non-constant over different values
of the independent variables.

• Avoid Overfitting
o An induced tree may overfit the training data
▪ Too many branches, some may reflect anomalies due to noise or outliers
▪ Poor accuracy for unseen samples
o Two approaches to avoid overfitting
▪ Prepruning:
• Do not split a node if this would result in the goodness measure
falling below a threshold
• Difficult to choose an appropriate threshold
▪ Postpruning:
• Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
• Use a set of data different from the training data to decide
which is the “best-pruned tree”

3.5. Linear Regression


• Linear Regression – not lectured, hence not subject to questions in the exam
o Regression Analysis
▪ The objective of regression analysis is to use the independent variables
whose values are known to predict the single dependent value selected
by the researcher.
▪ It is used to analyse the impact of a set of explanatory variables on the
target variable
▪ Only one variable is explained by the model, the remainder being
considered explanatory.
▪ Each independent variable is weighted by the regression analysis
procedure to ensure maximal prediction from the set of independent
variables.

▪ The weights (β) denote the relative contribution of the independent


variables to the dependent variable and facilitate interpretation as to
the influence of each variable in making the prediction.
▪ Simple Regression Analysis

▪ Setting a baseline: Prediction without an Independent Variable


▪ Predicted number of credit cards = Average number of credit cards
Or
Try to find a better predicting value than the mean
▪ Predicted number of credit cards = Change in number of credit cards
used associated with unit change in Xi
▪ X value of Xi

▪ Or
▪ When estimating a regression equation, it is usually beneficial to include
a constant, which is termed the intercept.

▪ Multiple Regression Analysis

o Assumptions
▪ The assumptions to be examined are in four areas:
• Linearity of the phenomenon measured.
• Constant variance of the error terms.
• Independence of the error terms.
• Normality of the error term distribution.
o Assumptions

▪ Notes: Multicollinearity means that there is a high degree of linear


correlation amongst independent variables. In heteroscedasticity, the
variance of the dependent variable is non-constant over different values
of the independent variables.
3.6. Neural Networks
• Neural Networks
o Neural networks —the “artificial” is usually dropped —is a powerful technique
for prediction, estimation, and classification problems. Its application include
detecting fraudulent insurance claims, modelling attrition, recognizing satellite
images, amount many others.
o They are inspired in the biological neuron.

o Combination Function
▪ The combination function typically uses a set of weights assigned to
each of the inputs.
▪ A typical combination function is the weighted sum, where each input
is multiplied by its weight and these products are added together. The
weighted sum is the default in most data mining tools.

o Transfer Functions
▪ The choice of transfer function determines how closely the artificial
neuron mimics the behaviour of a biological neuron, which exhibits all-
or-nothing responses for its inputs.
▪ To mimic the biological process, a step function, which has the value 1
when the weighted sum of the inputs is above some threshold, and 0
otherwise, is appropriate.
o Multilayer networks
▪ It has an input layer, where data enters the network; and a second layer,
known as the hidden layer, comprised of artificial neurons, each of
which receives multiple inputs from the input layer. The artificial
neurons summarize their inputs and pass the results to the output layer
where they are combined again. Networks with this architecture are
called multi-layer perceptron (MLPs).
▪ If the input data presents imbalanced scales, standardizing data is a
good idea, as discussed in the next chapter.

o Estimation and error


▪ The process of training is just adjusting weights, usually through back
propagation:
• The network gets a training example and, using the existing
weights in the network, it calculates the output or outputs.
Existing weights are first randomly set.
• Back propagation then calculates the error by taking the
difference between the calculated result and the actual target
value.
• The error is fed back through the network and the weights are
adjusted to minimize the error — hence the name back
propagation because the errors are sent back through the
network.
o Assumptions

▪ Notes: Multicollinearity means that there is a high degree of linear


correlation amongst independent variables. In heteroscedasticity, the
variance of the dependent variable is non-constant over different values
of the independent variables.

• Neural Networks (backpropagation)

3.7. Model Assessment


• Model Assessment – Confusion Matrix

• Assessment Metrics (binary target)


• Assessment Metrics (metric target)
o If we are performing regression instead of classification
▪ How to evaluate the models?
▪ Since predictors return a continuous value rather than a categorical
label, it is difficult to say exactly whether the predicted value is correct
▪ How far off the predicted value 𝑦′𝑖 is from the actual known value 𝑦𝑖
▪ Loss functions measure the error between the predicted value (𝑦′𝑖 )
and the target (𝑦𝑖 )

You might also like