Data Science
Data Science
Structured Data
Structured data is organized and easier to work with.
Data Pre-processing
Data in the real world is often dirty; that is, it is in need of being cleaned up before it can be
used for a desired purpose. This is often called data pre-processing. What makes data “dirty”?
Here are some of the factors that indicate that data is not clean or ready to process:
Incomplete. When some of the attribute values are lacking, certain attributes of interest
are lacking, or attributes contain only aggregate data.
Noisy. When data contains errors or outliers. For example, some of the data points in a
dataset may contain extreme values that can severely affect the dataset’s range.
Inconsistent. Data contains discrepancies in codes or names. For example, if the
“Name” column for registration records of employees contains values other than
alphabetical letters, or if records do not start with a capital letter, discrepancies are
present.
Figure shows the most important tasks involved in data pre-processing
Fig: Forms of data pre-processing (N.H. Son, Data Cleaning and Data Pre-processing21).
Data Cleaning:
Since there are several reasons why data could be “dirty,” there are just as many ways to
“clean” it. For this discussion, we will look at three key methods that describe ways in which
data may be “cleaned,” or better organized, or scrubbed of potentially incorrect,incomplete, or
duplicated information
i)Data Munging:
Often, the data is not in a format that is easy to work with. For example, it may be stored
or presented in a way that is hard to process. Thus, we need to convert it to something more
suitable for a computer to understand
Consider the following text recipe.
“Add two diced tomatoes, three cloves of garlic, and a pinch of salt in the mix.”
This can be turned into a table (Table 2.2).
This table conveys the same information as the text, but it is more “analysis friendly.” Of
course, the real question is – How did that sentence get turned into the table? A not-so-
encouraging answer is “using whatever means necessary”! I know that is not what you want to
hear because it does not sound systematic. Unfortunately,
often there is no better or systematic method for wrangling. Not surprisingly, there are people
who are hired to do specifically just this – wrangle ill-formatted data into something more
manageable.
Handling Missing Data
Sometimes data may be in the right format, but some of the values are missing.
Other times data may be missing due to problems with the process of collecting data, or an
equipment malfunction. Or, comprehensiveness may not have been considered important at the
time of collection. For instance, when we started collecting that customer data, it was limited
to a certain city or region, and so the area code for a phone number was not necessary to collect.
Well, we may be in trouble once we decide to expand beyond that city or region, because now
we will have numbers from all kinds of area codes.
So, what to do when we encounter missing data? There is no single good answer. We need to
find a suitable strategy based on the situation. Strategies to combat missing data include
ignoring that record, using a global constant to fill in all missing values, imputation,
inference-based solutions (Bayesian formula or a decision tree), etc.We will revisit some of
these inference techniques later in the book in chapters on machine learning and data mining.
Data Integration
To be as efficient and effective for various data analyses as possible, data from various sources
commonly needs to be integrated. The following steps describe how to integrate multiple
databases or files.
1. Combine data from multiple sources into a coherent storage place (e.g., a single file or
a database).
2. Engage in schema integration, or the combining of metadata from different sources.
3. Detect and resolve data value conflicts.
For example:
a. A conflict may arise; for instance, such as the presence of different attributes and values
from various sources for the same real-world entity.
b. Reasons for this conflict could be different representations or different scales; for
example, metric vs. British units.
4. Address redundant data in data integration. Redundant data is commonly generated in
the process of integrating multiple databases. For example:
a. The same attribute may have different names in different databases.
b. One attribute may be a “derived” attribute in another table; for example, annual
revenue.
c. Correlation analysis may detect instances of redundant data.
Data Transformation
Data must be transformed so it is consistent and readable (by a system). The following five
processes may be used for data transformation. For the time being, do not worry if these seem
too abstract. We will revisit some of them in the next section as we work through an example
of data pre-processing.
1. Smoothing: Remove noise from data.
2. Aggregation: Summarization, data cube construction.
3. Generalization: Concept hierarchy climbing.
4. Normalization: Scaled to fall within a small, specified range and aggregation. Some of the
techniques that are used for accomplishing normalization (but we will not be covering
them here) are:
a. Min–max normalization.
b. Z-score normalization.
c. Normalization by decimal scaling.
5. Attribute or feature construction.
a. New attributes constructed from the given ones.
Data Reduction
Data reduction is a key process in which a reduced representation of a dataset that produces the
same or similar analytical results is obtained. One example of a large dataset that could warrant
reduction is a data cube. Data cubes are multidimensional sets of data that can be stored in a
spreadsheet. But do not let the name fool you. A data cube could be
in two, three, or a higher dimension. Each dimension typically represents an attribute of
interest. Now, consider that you are trying to make a decision using this multidimensional data.
Sure, each of its attributes (dimensions) provides some information, but perhaps not
all of them are equally useful for a given situation. In fact, often we could reduce information
from all those dimensions to something much smaller and manageable
without losing much.
This leads us to two of the most common techniques used for data reduction.
1. Data Cube Aggregation. The lowest level of a data cube is the aggregated data for an
individual entity of interest. To do this, use the smallest representation that is sufficient
to address the given task. In other words, we reduce the data to its more meaningful
size and structure for the task at hand.
2. Dimensionality Reduction. In contrast with the data cube aggregation method, where
the data reduction was with the consideration of the task, dimensionality reduction
method works with respect to the nature of the data. Here, a dimension or a column in
your data spreadsheet is referred to as a “feature,” and the goal of the process is to
identify which features to remove or collapse to a combined feature. This requires
identifying redundancy.
Data Analysis and Data Analytics:
These two terms – data analysis and data analytics – are often used interchangeably and could
be confusing. Data analysis refers to hands-on data exploration and evaluation. Data analytics
is a broader term and includes data analysis as [a] necessary subcomponent. Analytics defines
the science behind the analysis.
The science means understanding the cognitive processes an analyst uses to understand
problems and explore data in 2 meaningful ways.
One way to understand the difference between analysis and analytics is to think in terms of
past and future. Analysis looks backwards, providing marketers with a historical view of what
has happened. Analytics, on the other hand, models the future or predicts a result.
Analytics makes extensive use of mathematics and statistics and the use of descriptive
techniques and predictive models to gain valuable knowledge from data. These insights from
data are used to recommend action or to guide decision-making in a business context.
Thus, analytics is not so much concerned with individual analysis or analysis steps, but with
the entire methodology.
Descriptive Analysis:
Descriptive analysis is about: “What is happening now based on incoming data.” It is a method
for quantitatively describing the main features of a collection of data. Here are a few key points
about descriptive analysis:
• Typically, it is the first kind of data analysis performed on a dataset.
• Usually it is applied to large volumes of data, such as census data.
• Description and interpretation processes are different steps
Descriptive analysis can be useful in the sales cycle, for example, to categorize customers by
their likely product preferences and purchasing patterns. Another example is the Census Data
Set, where descriptive analysis is applied on a whole population
Frequency Distribution:
Of course, data needs to be displayed. Once some data has been collected, it is useful to plot a
graph showing how many times each score occurs. This is known as a frequency distribution.
Frequency distributions come in different shapes and sizes. The following are some of the ways
in which statisticians can present numerical findings.
Histogram. Histograms plot values of observations on the horizontal axis, with a bar showing
how many times each value occurred in the dataset.
Normal Distribution. In an ideal world, data would be distributed symmetrically around the
center of all scores. Thus, if we drew a vertical line through the center of a distribution, both
sides should look the same. This so-called normal distribution is characterized bya bell-shaped
curve, an example of which is shown in Figure 3.4.
There are two ways in which a distribution can deviate from normal:
• Lack of symmetry (called skew)
• Pointiness (called kurtosis)
As shown in Figure 3.5, a skewed distribution can be either positively skewed (Figure 3.5a)
or negatively skewed (Figure 3.5b).
Kurtosis, on the other hand, refers to the degree to which scores cluster at the end of
a distribution (platykurtic) and how “pointy” a distribution is (leptokurtic), as shown in below
Figure
Diagnostic Analytics
Correlations
Correlation is a statistical analysis that is used to measure and describe the strength and
direction of the relationship between two variables. Strength indicates how closely two
variables are related to each other, and direction indicates how one variable would change its
value as the value of the other variable changes.
Correlation is a simple statistical measure that examines how two variables change together
over time. Take, for example, “umbrella” and “rain.” If someone who grew up in a place where
it never rained saw rain for the first time, this person would observe that, whenever it rains,
people use umbrellas. They may also notice that, on dry days, folks do not carry umbrellas. By
definition, “rain” and “umbrella” are said
to be correlated! More specifically, this relationship is strong and positive.
An important statistic, the Pearson’s r correlation, is widely used to measure the degree of the
relationship between linear related variables. When examining the stock market, for example,
the Pearson’s r correlation can measure the degree to which two commodities are
related. The following formula is used to calculate the Pearson’s r correlation:
where
r = Pearson’s r correlation coefficient,
N = number of values in each dataset,
Σxy = sum of the products of paired scores,
Σx = sum of x scores,
Σy = sum of y scores,
Σx2 = sum of squared x scores, and
Σy2 = sum of squared y scores
Predictive Analytics:
Predictive analytics provides companies with actionable insights based on data. Such
information includes estimates about the likelihood of a future outcome. It is important to
remember that no statistical algorithm can “predict” the future with 100% certainty because the
foundation of predictive analytics is based on probabilities. Companies use these
statistics to forecast what might happen. Some of the software most commonly used by data
science professionals for predictive analytics are SAS predictive analytics, IBM predictive
analytics, RapidMiner, and others.
As Figure 3.11 suggests, predictive analytics is done in stages.
1. First, once the data collection is complete, it needs to go through the process of cleaning.
2. Cleaned data can help us obtain hindsight in relationships between different variables.
Plotting the data (e.g., on a scatterplot) is a good place to look for hindsight.
3. Next, we need to confirm the existence of such relationships in the data. This is where
regression comes into play. From the regression equation, we can confirm the pattern of
distribution inside the data. In other words, we obtain insight from hindsight.
4. Finally, based on the identified patterns, or insight, we can predict the future, i.e.,
foresight.
Prescriptive Analytics:
• Prescriptive analytics10 is the area of business analytics dedicated to finding the best
course of action for a given situation. This may start by first analyzing the situation
(using descriptive analysis), but then moves toward finding connections among various
parameters/ variables, and their relation to each other to address a specific problem,
more likely that of prediction.
• Prescriptive analytics can also suggest options for taking advantage of a future
opportunity or mitigate a future risk and illustrate the implications of each.
• In practice, prescriptive analytics can continually and automatically process new data
to improve the accuracy of
predictions and provide advantageous decision options.
• Specific techniques used in prescriptive analytics include optimization, simulation,
game theory and decision-analysis methods.
Exploratory Analysis :
• Exploratory analysis is an approach to analyzing datasets to find previously unknown
relationships. Often such analysis involves using various data visualization approaches.
• Exploratory data analysis is an approach that postpones the usual assumptions about
what kind of model the data follows with the more direct approach of allowing the data
itself to reveal its underlying structure in the form of a model.
• Thus, exploratory analysis is not a mere collection of techniques; rather, it offers a
philosophy as to how to dissect a dataset; what to look for; how to look; and how to
interpret the outcomes.
Mechanistic Analysis :
• Mechanistic analysis involves understanding the exact changes in variables that lead to
changes in other variables for individual objects. For instance, we may want to know
how the number of free doughnuts per employee per day affects employee productivity.
• Perhaps by giving them one extra doughnut we gain a 5% productivity boost, but two
extra doughnuts could end up making them lazy (and diabetic)
• Such relationships are often explored using regression
• In statistical modeling, regression analysis is a process for estimating the relationships
among variables
• Beyond estimating a relationship, regression analysis is a way of predicting an outcome
variable from one predictor variable (simple linear regression) or several predictor
variables (multiple linear regression).
Unit-II
Extracting Meaning from Data
William Cukierski Will went to Cornell for a BA in physics and to Rutgers to get his PhD in
biomedical engineering. He focused on cancer research, studying pathology images. While
working on writing his dissertation, he got more and more involved in Kaggle competitions (more
about Kaggle in a bit), finishing very near the top in multiple competitions, and now works for
Kaggle.
After giving us some background in data science competitions and crowdsourcing, Will explain
how his company works for the participants in the platform as well as for the larger community.
Feature selection is the process of constructing a subset of the data or functions of the data to be
the predictors or variables for your models and algorithms.
Background: Data Science Competitions
There is a history in the machine learning community of data science competitions—where
individuals or teams compete over a period of several weeks or months to design a prediction
algorithm. What it predicts depends on the particular dataset, but some examples include whether
or not a given person will get in a car crash, or like a particular film. A training set is provided, an
evaluation metric determined up front, and some set of rules is provided about, for example, how
often competitors can submit their predictions, whether or not teams can
merge into larger teams, and so on.
Examples of machine learning competitions include the annual Knowledge Discovery and Data
Mining (KDD) competition, the onetime million-dollar Netflix prize (a competition that lasted two
years), and, as we’ll learn a little later, Kaggle itself Some remarks about data science competitions
are warranted. First,data science competitions are part of the data science ecosystem—one of the
cultural forces at play in the current data science landscape, and so aspiring data scientists ought
to be aware of them.
Second, creating these competitions puts one in a position to codify data science, or define its
scope. By thinking about the challenges that they’ve issued, it provides a set of examples for us to
explore the central question of this book: what is data science? This is not to say that we will
unquestionably accept such a definition, but we can at least use it as a starting point: what attributes
of the existing competitions capture data science, and what aspects of data science are missing?
Finally, competitors in the the various competitions get ranked, and so one metric of a “top” data
scientist could be their standings in these competitions. But notice that many top data scientists,
especially women, and including the authors of this book, don’t compete. In fact, there are few
women at the top, and we think this phenomenon needs to be explicitly thought through when we
expect top ranking to act as a proxy for data science talent.
Background: Crowdsourcing
There are two kinds of crowdsourcing models. First, we have the distributive
crowdsourcing model, like Wikipedia, which is for relatively simplistic but large-scale
contributions. On Wikipedia, the online encyclopedia, anyone in the world can contribute to the
content, and there is a system of regulation and quality control set up by volunteers.
The net effect is a fairly high-quality compendium of all of human
knowledge (more or less).
Then, there’s the singular, focused, difficult problems that Kaggle, DARPA, InnoCentive, and
other companies specialize in. These companies issue a challenge to the public, but generally only
a set of people with highly specialized skills compete. There is usually a cash prize, and glory or
the respect of your community, associated with winning.
Feature Selection
The idea of feature selection is identifying the subset of data or transformed
data that you want to put into your model.Prior to working at Kaggle, Will placed highly in
competitions (which is how he got the job), so he knows firsthand what it takes to build effective
predictive models. Feature selection is not only useful for
winning competitions—it’s an important part of building statistical models and algorithms in
general. Just because you have data doesn’t mean it all has to go into the model.
For example, it’s possible you have many redundancies or correlated variables in your raw data,
and so you don’t want to include all those variables in your model. Similarly you might want to
construct new variables by transforming the variables with a logarithm, say, or turning a
continuous variable into a binary variable, before feeding them into the model.
Why? We are getting bigger and bigger datasets, but that’s not always helpful. If the number of
features is larger than the number of observations, or if we have a sparsity problem, then large isn’t
necessarily good. And if the huge data just makes it hard to manipulate because of computational
reasons (e.g., it can’t all fit on one computer, so the data needs to be sharded across multiple
machines) without improving our signal, then that’s a net negative.
To improve the performance of your predictive models, you want to
improve your feature selection process.
User Retention
Let’s give an example for you to keep in mind before we dig into some possible methods. Suppose
you have an app that you designed, let’s call it Chasing Dragons (shown in Figure 7-2), and users
pay a monthly subscription fee to use it. The more users you have, the more money you make.
Suppose you realize that only 10% of new users ever come back after the first month. So you have
two options to increase your revenue: find a way to increase the retention rate of existing users, or
acquire new users. Generally it costs less to keep an existing customer around than to market and
advertise to new users. But setting aside that particular cost-benefit analysis of acquistion or
retention, let’s choose to focus on your user retention situation by building a model that predicts
whether or not a new user will come back next month based on their behavior this month. You
could build such a model in order to understand your retention situation, but let’s focus instead on
building an algorithm that is highly accurate at predicting. You might want to use this model to
give a free month to users who you predict need the extra incentive to stick around,
A good, crude, simple model you could start out with would be logistic regression, which you first
saw back in Chapter 4. This would give you the probability the user returns their second month
conditional on their activities in the first month. (There is a rich set of statistical literature called
Survival Analysis that could also work well, but that’s not necessary in this case—the modeling
part isn’t want we want to focus on here, it’s the data.) You record each user’s behavior for the
first 30 days after sign-up. You could log every action the user took with timestamps: user clicked
the button that said “level 6” at 5:22 a.m., user slew a dragon at 5:23 a.m., user got 22 points at
5:24 a.m., user was shown an ad for deodorant at 5:25 a.m. This would be the data collection phase.
Any action the user could take gets recorded.
Notice that some users might have thousands of such actions, and other users might have only a
few. These would all be stored in timestamped event logs. You’d then need to process these logs
down to a dataset with rows and columns, where each row was a user and each column was a
feature. At this point, you shouldn’t be selective; you’re in the feature generation phase. So your
data science team (game designers, software engineers, statisticians, and marketing folks) might
sit down and brainstorm features. Here are some examples:
• Number of days the user visited in the first month
• Amount of time until second visit
• Number of points on day j for j=1, . . .,30 (this would be 30 separate
features)
• Total number of points in first month (sum of the other features)
• Did user fill out Chasing Dragons profile (binary 1 or 0)
• Age and gender of user
• Screen size of device
Filters
Filters order possible features with respect to a ranking based on a metric or statistic, such as
correlation with the outcome variable. This is sometimes good on a first pass over the space of
features, because they then take account of the predictive power of individual features.However,
the problem with filters is that you get correlated features.In other words, the filter doesn’t care
about redundancy. And by treating the features as independent, you’re not taking into account
possible interactions.
This isn’t always bad and it isn’t always good, as Isabelle Guyon explains.
On the one hand, two redundant features can be more powerful when they are both used; and on
the other hand, something that appears useless alone could actually help when combined with
another possibly useless-looking feature that an interaction would capture.
Wrappers
Wrapper feature selection tries to find subsets of features, of some fixed size, that will do the trick.
However, as anyone who has studied combinations and permutations knows, the number of
possible size k subsets of n things, called n k , grows exponentially. So there’s a nasty opportunity
for overfitting by doing this.
There are two aspects to wrappers that you need to consider:
1) selecting an algorithm to use to select features and
2) deciding on a selection criterion or filter to decide that your set of features is “good.”
Selecting an algorithm
Let’s first talk about a set of algorithms that fall under the category of stepwise regression, a
method for feature selection that involves selecting features according to some selection criterion
by either adding or subtracting features to a regression model in a systematic way. There
are three primary methods of stepwise regression: forward selection, backward elimination, and a
combined approach (forward and backward).
Forward selection
In forward selection you start with a regression model with no features, and gradually add one
feature at a time according to which feature improves the model the most based on a selection
criterion. This looks like this: build all possible regression models with a single predictor. Pick the
best. Now try all possible models that include that best predictor and a second predictor. Pick the
best of those. You keep adding one feature at a time, and you stop when your selection criterion
no longer improves, but instead gets worse.
Backward elimination
In backward elimination you start with a regression model that includes all the features, and you
gradually remove one feature at a time according to the feature whose removal makes the biggest
improvement in the selection criterion. You stop removing features when removing the feature
makes the selection criterion get worse.
Combined approach
Most subset methods are capturing some flavor of minimumredundancy-maximum-relevance. So,
for example, you could have a greedy algorithm that starts with the best feature, takes a few more
highly ranked, removes the worst, and so on. This a hybrid approach with a filter method.
Decision trees have an intuitive appeal because outside the context of data science in our every
day lives, we can think of breaking big decisions down into a series of questions. See the decision
tree in Figure 7-3 about a college student facing the very important decision
of how to spend their time.
This decision is actually dependent on a bunch of factors: whether or not there are any parties or
deadlines, how lazy the student is feeling, and what they care about most (parties). The
interpretability of decision trees is one of the best features about them.
In the context of a data problem, a decision tree is a classification algorithm. For the Chasing
Dragons example, you want to classify users as “Yes, going to come back next month” or “No,
not going to come back next month.” This isn’t really a decision in the colloquial sense, so don’t
let that throw you. You know that the class of any given user is dependent on many factors (number
of dragons the user slew, their age, how many hours they already played the game). And you want
to break it down based on the data you’ve collected. But how do you construct decision trees from
data and what mathematical properties can you expect them to have?
Ultimately you want a tree that is something like Figure 7-4.
Entropy
To quantify what is the most “informative” feature, we define entropy– effectively a measure for
how mixed up something is—for X as follows:
In particular, if either option has probability zero, the entropy is 0. Moreover, because
p (X =1) =1− p (X =0 ), the entropy is symmetric about 0.5 and maximized at 0.5, which we can
easily confirm using a bit of calculus. Figure 7-5 shows a picture of that.
Mathematically, we kind of get this. But what does it mean in words, and why are we calling it
entropy? Earlier, we said that entropy is a measurement of how mixed up something is.
So, for example, if X denotes the event of a baby being born a boy, we’d expect it to be true or
false with probability close to 1/2, which corresponds to high entropy, i.e., the bag of babies from
which we are selecting a baby is highly mixed.
But if X denotes the event of a rainfall in a desert, then it’s low entropy. In other words, the bag of
day-long weather events is not highly mixed in deserts.
Using this concept of entropy, we will be thinking of X as the target of our model. So, X could be
the event that someone buys something on our site. We’d like to know which attribute of the user
will tell us the most information about this event X. We will define the information
gain, denoted IG X,a , for a given attribute a, as the entropy we lose if we know the value of that
attribute:
IG X,a =H X −H X a
To compute this we need to define H X a . We can do this in two steps.
For any actual value a0 of the attribute a we can compute the specific conditional entropy H X
a=a0 as you might expect:
H X a=a0 = − p X =1 a=a0 log2 p X =1 a=a0 −
p X =0 a=a0 log2 p X =0 a=a0
and then we can put it all together, for all possible values of a, to get
the conditional entropy H X a :
H X a = Σai p a=ai ・H X a=ai
In words, the conditional entropy asks: how mixed is our bag really if we know the value of
attribute a? And then information gain can be described as: how much information do we learn
about X (or how much entropy do we lose) once we know a? Going back to how we use the concept
of entropy to build decision trees: it helps us decide what feature to split our tree on, or in other
words, what’s the most informative question to ask?
The Decision Tree Algorithm:
You build your decision tree iteratively, starting at the root. You need an algorithm to decide which
attribute to split on; e.g., which node should be the next one to identify. You choose that attribute
in order to maximize information gain, because you’re getting the most bang for your buck that
way. You keep going until all the points at the end are in the same class or you end up with no
features left. In this case, you take the majority vote.
Often people “prune the tree” afterwards to avoid overfitting. This just means cutting it off below
a certain depth. After all, by design, the algorithm gets weaker and weaker as you build the tree,
and it’s well known that if you build the entire tree, it’s often less accurate (with new data) than if
you prune it.
This is an example of an embedded feature selection algorithm. (Why embedded?) You don’t need
to use a filter here because the information gain method is doing your feature selection for you.
Suppose you have your Chasing Dragons dataset. Your outcome variable
is Return: a binary variable that captures whether or not the user returns next month, and you have
tons of predictors. You can use the R library rpart and the function rpart, and the code would look
like
this:
But how do you do this? How do you actually find U and V? In reality, as you will see next, you’re
not first minimizing the squared error and then minimizing the size of the entries of the matrices
U and V. You’re actually doing both at the same time.So your goal is to find U and V by solving
the optimization problem described earlier. This optimization doesn’t have a nice closed formula
like ordinary least squares with one set of coefficients. Instead, you need an iterative algorithm
like gradient descent. As long as your problem is convex you’ll converge OK (i.e., you won’t find
yourself at a local, but not global, maximum), and you can force your problem to be convex using
regularization.
Here’s the algorithm:
• Pick a random V.
• Optimize U while V is fixed.
• Optimize V while U is fixed.
• Keep doing the preceding two steps until you’re not changing very
much at all. To be precise, you choose an ϵ and if your coefficients
are each changing by less than ϵ, then you declare your algorithm
“converged.”
Theorem with no proof: The preceding algorithm will converge if your prior
is large enough If you enlarge your prior, you make the optimization easier because you’re
artificially creating a more convex function—on the other hand,
if your prior is huge, then all your coefficients will be zero anyway, so that doesn’t really get you
anywhere. So actually you might not want to enlarge your prior. Optimizing your prior is
philosophically screwed because how is it a prior if you’re back-fitting it to do what you want it
to do? Plus you’re mixing metaphors here to some extent by searching for a close approximation
of X at the same time you are minimizing coefficients. The more you care about coefficients, the
less
you care about X. But in actuality, you only want to care about X.
Fix V and Update U
The way you do this optimization is user by user. So for user i, you
want to find:
where v j is fixed. In other words, you just care about this user for now.
But wait a minute, this is the same as linear least squares, and has a
closed form solution! In other words, set:
where V*,i is the subset of V for which you have preferences coming from user i. Taking the
inverse is easy because it’s d×d, which is small. And there aren’t that many preferences per user,
so solving this many times is really not that hard. Overall you’ve got a doable update for U.
When you fix U and optimize V, it’s analogous—you only ever have to consider the users that
rated that movie, which may be pretty large for popular movies but on average isn’t; but even so,
you’re only ever inverting a d×d matrix.
UNIT–III: DATA VISUALIZATION (8 periods)
A Brief matplotlib API primer, Plotting with Pandas and Seaborn – Line plots,
Bar plots, Histograms and density plots, Scatter plots, Facet grids and Categorical
data; Other Python visualization tools.
Unit – III
DATA VISUALIZATION
A brief matplotlib api primer:
matplotlib. pyplot is a collection of functions that make Matplotlib work like
MATLAB. Each pyplot function makes some change to a figure: e.g., creates a
figure, creates a plotting area in a figure, plots some lines in a plotting area,
decorates the plot with labels, etc.
Types of Plots:
Pandas offer tools for cleaning and process your data. It is the most popular
Python library that is used for data analysis. In pandas, a data table is called
a dataframe.
Example1:
import pandas as pd
]}
# Create DataFrame
df
output:
Example2: load the CSV data from the system and display it through
Seaborn:
style = "white" )
Line plot:
Lineplot Is the most popular plot to draw a relationship between x and y with
the possibility of several semantic groupings.
Syntax : sns.lineplot(x=None, y=None)
Output:
Example 2: Use the hue parameter for plotting the graph. # import module
Output:
Bar plot:
Example1:
Output:
Example2:
data = pandas.read_csv("nba.csv")
Output:
sns.load_dataset('diamonds')
hist=False)
libraries
sns.load_dataset('diamonds')
sns.distplot(a=df.carat)
sns.load_dataset('diamonds')
color='purple',
sns.distplot(a=df2.petal_length, color='green',
Scatter Plot
Scatterplot can be used with several semantic groupings which can help to
understand well in a graph. They can plot two-dimensional graphics that can
be enhanced by mapping up to three additional variables while using the
semantics of hue, size, and style parameters. All the parameter control visual
semantic which are used to identify the different subsets. Using redundant
semantics can be helpful for making graphics more accessible.
fmri = seaborn.load_dataset("fmri")
seaborn.scatterplot(x="timepoint",y="signal",data=fmri)
Output:
Grouping data points on the basis of category, here as region and event.
y="signal", hue="region",
style="event", data=fmri)
Output:
output:
2. Adding the hue attributes.
It will produce data points with different colors. Hue can be used to
group to multiple data variable and show the dependency of the passed
data values are to be plotted.
Syntax: seaborn.scatterplot( x, y, data, hue)
Code:
seaborn.scatterplot(x='day', y='tip', data=tip, hue='time')
output:
In the above example, we can see how the tip and day bill is related to
whether it was lunchtime or dinner time. The blue color has represented
the Dinner and the orange color represents the Lunch.
Let’s check for a hue = ” day “
Code:
code:
output:
4. Adding the palette attributes.
Using the palette we can generate the point with different colors. In this below
example we can see the palette can be responsible for a generate the scatter
plot with different colormap values. Syntax:
seaborn.scatterplot( x, y, data, palette=”color_name”)
code:
output:
code:
output:
seaborn.FacetGrid() :
FacetGrid class helps in visualizing distribution of one variable as well
as the relationship between multiple variables separately within subsets of
your dataset using multiple panels.
A FacetGrid can be drawn with up to three dimensions ? row, col,
and hue. The first two have obvious correspondence with the resulting array
of axes; think of the hue variable as a third dimension along a depth axis,
where different levels are plotted with different colors.
FacetGrid object takes a dataframe as input and the names of the
variables that will form the row, column, or hue dimensions of the grid.
The variables should be categorical and the data at each level of the variable
will be used for a facet along that axis.
Example1:
plt.show()
output:
Example2:
# importing packages import seaborn
plt.show()
output:
Example3:
# importing packages import seaborn
object
plt.show()
output:
3. Seaborn
Seaborn is a Python data visualization library that is based on Matplotlib and
closely integrated with the NumPy and pandas data structures. Seaborn has
various dataset-oriented plotting functions that operate on data frames and
arrays that have whole datasets within them. Then it internally performs the
necessary statistical aggregation and mapping functions to create informative
plots that the user desires. It is a high-level interface for creating beautiful
and informative statistical graphics that are integral to exploring and
understanding data. The Seaborn data graphics can include bar charts, pie
charts, histograms, scatterplots, error charts, etc. Seaborn also has various
tools for choosing color palettes that can reveal patterns in the data.
4. GGplot
value
• The mean of a sample is the summary statistic computed with the previous formula.
• An average is one of several summary statistics we might choose to describe a centraltendency.
Sometimes the mean is a good description of a set of values. For example, apples are all pretty much
the same size(atleast the ones sold in supermarkets).So if I buy 6 apples and the total weightis 3 pounds,
it would be area son able summary to say they are about a half pound each.
But pumpkins are more diverse. Suppose I grow several varieties in my garden, and one day I harvest
three decorative pumpkins that are 1 pound each, two pie pumpkins that are 3pounds each, and one
Atlantic Giant pumpkin that weighs 591pounds.Themeanof this sample is 100 pounds, but if I told
you “The average pumpkin in my garden is 100pounds,”that would be misleading. In this example,
there is no meaningful average because there is no typical pumpkin.
Variance
If there is no single number that summarizes pumpkin weights, we can do a little better with
numbers: mean and variance.
Variance is a summary statistic intended to describe the variability or spread of a distribution.The
variance of a set of values is
The term xi-x̄ is called the “deviation from the mean,” so variance is the mean squared deviation. The
square root of variance, S, is the standard deviation.
Pandas data structures provides methods to compute mean, variance and standard deviation:mean =
live.prglngth.mean()var =
live.prglngth.var()std = live.prglngth.std()
For all live births, the mean pregnancy lengthis38.6weeks,the standard deviation is2.7weeks,which
means we should expect deviations of2-3weeks to be common.
Variance of pregnancy length is 7.3, which is hard to interpret, especially since the units are weeks,or“
square weeks.” Variance is useful in some calculations, but it is not a good summary statistic.
4.2.1 Reporting Results:
There are several ways to describe the difference in pregnancy length (if there is one) between first
babies and others. How should report these results?
The answer depends on question. A scientist might be interested in any (real) effect, no matter how
small. A doctor might only care about effects that are clinically significant; that is, differences that
affect treatment decisions. A pregnant woman might be interested in results that are relevant to her,
like the probability of delivering early or late.
4.3 Probability mass function
Another way to represent a distribution is a probability mass function (PMF), which maps fromeach
value to its probability.
A probability is a frequency expressed as a fraction of the sample size, n. To get from frequenciesto
probabilities, we divide through by n, which is called normalization.
Given a Hist, we can make a dictionary that maps from each value to its probability:n = hist.Total()
d ={}
for x, freq in hist.Items():
d[x] = freq / n
Or use Pmf class provided by think stats2. Like Hist, the Pmf constructor can take a list, pandasSeries,
dictionary, Hist, or another Pmf object.
Here’s an example with a simple list:
>>> import thinkstats2
>>> pmf = thinkstats2.Pmf([1, 2, 2, 3, 5])
>>> pmf
Pmf({1: 0.2, 2: 0.4, 3: 0.2, 5: 0.2})
ThePmfisnormalizedsototalprobabilityis1.
Pmf and Hist objects are similar in many ways; in fact, they inherit many of their methods from a
common parent class. For example, the methods Values and Items work the same way for both.The
biggest difference is that a Hist maps from values to integer counters; a Pmf maps from values to
floating-point probabilities.
To lookup the probability associated with a value, use Prob:
>>> pmf.Prob(2)0.4
The bracket operator is equivalent:
>>> pmf[2]0.4
We can modify an existing Pmf by incrementing the probability associated with a value:
>>> pmf.Incr(2, 0.2)
>>>pmf.Prob(2)0.6
Or we can multiply a probability by a factor:
>>> pmf.Mult(2, 0.5)
>>> pmf.Prob(2)0.3
If we modify a Pmf, the result may not be normalized; that is, the probabilities may no longeradd up
to 1. To check, you can call Total, which returns the sum of the probabilities:
>>> pmf.Total()0.9
To re normalize, call Normalize:
>>> pmf.Normalize()
>>> pmf.Total()1.0
Pmf objects provide a Copy method so we can make and modify a copy without affecting the original.
My notation in this section might seem inconsistent, but there is a system: we use Pmf for the name
of the class, pmf for an instance of the class, and PMF for the mathematical concept of a probability
mass function.
Patterns and relationships. Once have an idea what is going on,a good nextstep is to design a
visualization that makes the patterns we have identified as clear as possible.
weeks = range(35, 46)diffs = []
for week in weeks:
p1 = first_pmf.Prob(week) p2 = other_pmf.Prob(week)
diff = 100 * (p1 - p2)diffs.append(diff)thinkplot.Bar(weeks, diffs)
In this code, weeks is the range of weeks; diffs is the difference between the two PMFs in percentage points.
Figure 4.2 shows the result as a bar chart. This figure makes the pattern clearer: first babiesare less likely
to be born in week 39, and somewhat more likely to be born in weeks 41 and 42. use the same dataset
to identify an apparent difference and then chose a visualization that makesthe difference apparent.
If you calculate mean class size of the above dataset, it is going to be (10 + 20 + 30)/3 = 20
Actual class size = 20
• But, if you survey each student about their class size, a student in class 1 is going to say I have 9 other
classmates in the class and student in class 2 is going to say that he/she has 19other classmates and so
on.
Let’s take their responses into consideration and calculate the mean. This Mean is also called as
Observed mean.
Observed mean = (10*10 )+ (20 * 20) + (30 * 30)/(10+20+30)Observed mean = (100 + 400 + 900)/50
Observed Mean = 23.33
• I feel the observed mean calculation is like any weighted average calculation that we do inmost of the
statistical exercises.
You can also calculate the observed mean using PMF for the above dataset:PMF = [(10: 10/60), (20:
20/60), (30: 30/60)]
PMF = [(10: 0.167), (20: 0.333), (30: 0.500) ]
So the mean of the class size is: 10*0.167+20*0.333+30*0.500 = 23.33
Observed mean = 23.33
As you can see, the observed mean is higher than the actual mean. This is what class size paradox
teaches us.
By default, the rows and columns are numbered starting at zero, but we can provide column
names:
>>> columns = ['A', 'B']
>>> df = pandas.DataFrame(array, columns=columns)
>>> df
We can also provide row names. The set of row names is called the index; the row names
themselves are called labels.
>>> index = ['a', 'b', 'c', 'd']
>>> df = pandas.DataFrame(array, columns=columns, index=index)
>>> df
4.8 CDFs:
● The CDF is the function that maps from value to its percentile rank.
● The CDF is a function of x, where x is any value that might appear in the distribution.
● To evaluate CDF(x) for a particular value of x, we compute the fraction of values in thedistribution
less than or equal to x.
Let us consider the following Example : a function that takes a sequence, t, and a value, x:
Step 1: Arrange Data in Ascending Order Sort the dataset in ascending order. This ensures that thedata
points are organized for percentile calculation.
Step 2: Identify the Desired Percentile Determine which percentile you want to calculate. Common
percentiles include the median (50th percentile), quartiles (25th and 75th percentiles), and various
other percentiles like the 90th or 95th percentile.
Step 3: Calculate the Position Use the formula position = (percentile / 100) * (n + 1) to find the position
of the desired percentile within the dataset, where n is the total number of data points.
Step 4: Find the Data Value If the position is a whole number, the data value at that position is the
desired percentile. If the position is not a whole number, calculate the weighted average of the data
values at the floor and ceiling positions.
Code:
import numpy as np
data = [18, 22, 25, 27, 38, 33, 37, 41, 45, 50]
sorted_data = np.sort(data)
percentile_25 = 25
percentile_75 = 75
if position_25.is_integer():
percentile_value_25 = sorted_data[int(position_25)-1]
else:
floor_position_25= int(np.floor(position_25))
ceil_position_25= int(np.ceil(position_25))
percentile_value_25 = sorted_data[floor_position_25] + (position_25 -
floor_position_25) * (sorted_data[ceil_position_25] -
sorted_data[floor_position_25])
if position_75.is_integer():
else:
floor_position_75= int(np.floor(position_75))
ceil_position_75= int(np.ceil(position_75))
percentile_value_75 = sorted_data[floor_position_75] + (position_75 -
floor_position_75) * (sorted_data[ceil_position_75] -
sorted_data[floor_position_75])
output:
In this example, we calculate the 25th and 75th percentiles of the dataset using Python's NumPy library
for sorting and mathematical functions.
Percentiles are crucial statistical measures in data science for understanding data distributions and
making informed decisions. By calculating percentiles, you gain insights into the relative positionof
data points within a dataset, helping you analyze the spread and variability of the data. Read more on
Sarthaks.com -
Q: What is a percentile in statistics?
A: You can calculate percentiles in Python using the numpy library.Here's an example:
import numpy as np data = [12, 15, 17, 20, 23, 25, 27, 30, 32, 35]
percentile_value = 25
# For the 25th percentile
result = np. percentile (data, percentile_value)
print (f"The {percentile_value}th percentile is: {result}")
output:
Q: How do you interpret the 25th and 75th percentiles, also known as the first and third quartiles? The
25th percentile (Q1) is the value below which 25% of the data falls. The 75th percentile (Q3)is the
value below which 75% of the data falls. The interquartile range (IQR) is the difference between the
third and first quartiles (IQR = Q3 - Q1), which provides a measure of the spread of the middle 50%
of the data.
Q: How do you handle outliers when calculating percentiles?
Outliers can significantly affect percentile calculations. One approach to handle outliers is to use the
"trimmed" dataset (removing the extreme values) to calculate percentiles. Another approach isto use
robust estimators like the median, which is less sensitive to outliers compared to the mean.
Time series analysis – Importing and cleaning, Plotting, Moving averages, Missing values, Serial
correlation, Autocorrelation; Predictive modeling – Overview, Evaluating predictive models,
Building predictive model solutions, Sentiment analysis
import csv
with open("E:\customers.csv",'r') as custfile:
rows=csv.reader(custfile,delimiter=',')
for r in rows:
print(r)
Purchase
Person Name Country Product
Price
0 Jon Japan Computer $800
1 Bill US Printer $450
2 Maria Brazil Laptop $150
3 Rita UK Computer $1,200
4 Jack Spain Printer $150
5 Ron China Computer $1,200
5.2.2.1 Cleaning Empty Cells : Empty cells can potentially give the wrong result when we analyze
data.
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.
Consider the following data set
Example
Return a new Data Frame with no empty cells:
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
After executing the above code rows 18, 22 and 28 have been removed
By default, the dropna() method returns a new Data Frame, and will not change the original.
Example
Convert to date:
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
After running the above code the following output gives as sample result.
Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of "450", and we
could just insert "45" in row 7:
Example
Set "Duration" = 45 in row 7:
df.loc[7, 'Duration'] = 45
Example
Returns True for every row that is a duplicate, othwerwise False:
print(df.duplicated())
Amount
Quantity purchased in grams
Quality
High, medium, or low quality, as reported by the purchaser
Date
Date of report, presumed to be shortly after date of purchase
Ppg
Price per gram in dollars
State.name
String state name
Lat
Approximate latitude of the transaction, based on city name
Lon
Approximate longitude of the transaction
Each transaction is an event in time, so we could treat this dataset as a time series. But
the events are not equally spaced in time; the number of transactions reported each day
varies from 0 to several hundred.
In order to demonstrate these methods, let's divide the dataset into groups by reported
quality, and then transform each group into an equally spaced series by computing the
mean daily price per gram.
def GroupByQualityAndDay(transactions):
groups = transactions.groupby('quality')
dailies = {}
for name, group in groups:
dailies[name] = GroupByDay(group)
return dailies
groupby is a DataFrame method that returns a GroupBy object, groups; used in a for
loop, it iterates the names of the groups and the DataFrames that represent them. Since
the values of quality are low, medium, and high, we get three groups with those names.
The loop iterates through the groups and calls GroupByDay, which computes the daily
average price and returns a new DataFrame:
def GroupByDay(transactions, func=np.mean):
The parameter, transactions, is a DataFrame that contains columns date and ppg. We select these
two columns, then group by date. The result, grouped, is a map from each date to a DataFrame that
contains prices reported on that date. aggregate is a GroupBy method that iterates through the
groups and applies a function to each column of the group; in this case there is only one column,
ppg. So the result of aggregate is a DataFrame with one row for each date and one column, ppg.
Dates in these DataFrames are stored as NumPy datetime64 objects, which are represented as 64-
bit integers in nanoseconds. For some of the analyses coming up, it will be convenient to work
with time in more human-friendly units, like years.
So GroupByDay adds a column named date by copying the index, then adds years, which contains
the number of years since the first transaction as a floating-point number.
The resulting DataFrame has columns ppg, date, and years.
5.3 Plotting the Data
Pandas uses the plot() method to create diagrams.
We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the screen
Example
Import pyplot from Matplotlib and visualize our DataFrame:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
The above code results the following graph
thinkplot.PrePlot(rows=3)
for i, (name, daily) in enumerate(dailies.items()):
thinkplot.SubPlot(i+1)
title = 'price per gram ($)' if i==0 else ''
thinkplot.Config(ylim=[0, 20], title=title)
thinkplot.Scatter(daily.index, daily.ppg, s=10, label=name)
if i == 2:
pyplot.xticks(rotation=30)
else:
thinkplot.Config(xticks=[])
PrePlot with rows=3 means that we are planning to make three subplots laid out in three rows. The
loop iterates through the DataFrames and creates a scatter plot for each is shown in Fig 5.1.
It is common to plot time series with line segments between the points, but in this case there are
many data points and prices are highly variable, so adding lines would not help. Since the labels
on the x-axis are dates, we use pyplot.xticks to rotate the “ticks” 30 degrees, making them more
readable.
Figure 5-1 Time series of daily price per gram for high, medium, and low quality
cannabis
One apparent feature in these plots is a gap around November 2013. It’s possible that data
collection was not active during this time, or the data might not be available.
One of the simplest moving averages is the rolling mean, which computes the mean of the values
in each window. For example, if the window size is 3, the rolling mean computes the mean of
values 0 through 2, 1 through 3, 2 through 4, etc.
pandas provides rolling_mean, which takes a Series and a window size and returns a new
Series. >>> series = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> pandas.rolling_mean(series, 3)
array([ nan, nan, 1, 2, 3, 4, 5, 6, 7, 8])
The first two values are nan; the next value is the mean of the first three elements, 0, 1, and 2. The
next value is the mean of 1, 2, and 3. And so on.
Before we can apply rolling_mean to the cannabis data, we have to deal with missing values. There
are a few days in the observed interval with no reported transactions for one or more quality
categories, and a period in 2013 when data collection was not active. In the DataFrames we have
used so far, these dates are absent; the index skips days with no data. For the analysis that follows,
we need to represent this missing data explicitly. We can do that by “reindexing” the DataFrame:
dates = pandas.date_range(daily.index.min(), daily.index.max()) reindexed =
daily.reindex(dates)
The first line computes a date range that includes every day from the beginning to the end of the
observed interval. The second line creates a new DataFrame with all of the data from daily, but
including rows for all dates, filled with nan.
The window size is 30, so each value in roll_mean is the mean of 30 values from reindexed.ppg.
Figure 5-2 Daily price and a rolling mean (left) and exponentially-weighted
moving average (right)
The datasets where information is collected along with timestamps in an orderly fashion are
denoted as time-series data. If you have missing values in time series data, you can obviously try
any of the above-discussed methods. But there are a few specific methods also which can be used
here.
forward_filled=df.fillna(method='ffill')
print(forward_filled)
backward_filled=df.fillna(method='bfill')
print(backward_filled)
I hope you are able to spot the difference in both cases with the above images.
Linear Interpolation
Time series data has a lot of variations. The above methods of imputing using backfill and forward
fill isn’t the best possible solution. Linear Interpolation to the rescue!
Here, the values are filled with incrementing or decrementing values. It is a kind of imputation
technique, which tries to plot a linear relationship between data points. It uses the non-null values
available to compute the missing points.
interpolated=df.interpolate(limit_direction="both")
print(interpolated)
Compare these values to backward and forward fill and check for yourself which is good!
These are some basic ways of handling missing values in time-series data
There are some cases, where none of the above works well. Yet, you need to do an analysis. Then,
you should opt for algorithms that support missing values. KNN (K nearest neighbors) is one such
algorithm. It will consider the missing values by taking the majority of the K nearest values. The
random forest also is robust to categorical data with missing values. Many decision tree-based
algorithms like XGBoost, Catboost support data with missing values.
If we apply SerialCorr to the raw price data with lag 1, we find serial correlation 0.48 for the high
quality category, 0.16 for medium and 0.10 for low. In any time series with a long-term trend, we
expect to see strong serial correlations; for example, if prices are falling, we expect to see values
above the mean in the first half of the series and values below the mean in the second half.
It is more useful to find, if the correlation persists if you subtract away the trend. For example, we
can compute the residual of the EWMA and then compute its serial correlation:
ewma = pandas.ewma(reindexed.ppg, span=30)
resid = reindexed.ppg - ewma
corr = SerialCorr(resid, 1)
With lag=1, the serial correlations for the de-trended data are -0.022 for high quality, -0.015 for
medium, and 0.036 for low. These values are small, indicating that there is little or no one-day
serial correlation in this series.
5.7 Autocorrelation
Autocorrelation is a mathematical representation of the degree of similarity between a given time
series and a lagged version of itself over successive time intervals shown in Fig 5.3. It's
conceptually similar to the correlation between two different time series, but autocorrelation uses
the same time series twice: once in its original form and once lagged one or more time periods.
● Autocorrelation represents the degree of similarity between a given time series and a lagged
version of itself over successive time intervals.
● Autocorrelation measures the relationship between a variable's current value and its past
values.
● An autocorrelation of +1 represents a perfect positive correlation, while an autocorrelation
of negative 1 represents a perfect negative correlation.
● Technical analysts can use autocorrelation to measure how much influence past prices for
a security have on its future price.
The autocorrelation function is a function that maps from lag to the serial correlation with the
given lag. “Autocorrelation” is another name for serial correlation, used more often when the lag
is not 1.
StatsModels, which we used for linear regression in “StatsModels”, also provides functions for
time series analysis, including acf, which computes the autocorrelation function:
import statsmodels.tsa.stattools as smtsa acf = smtsa.
acf(filled.resid, nlags=365, unbiased=True)
acf computes serial correlations with lags from 0 through nlags. The unbiased flag tells acf to
correct the estimates for the sample size. The result is an array of correlations. If we select daily
prices for high quality, and extract correlations for lags 1, 7, 30, and 365, we can confirm that acf
and SerialCorr yield approximately the same results:
>>> acf[0], acf[1], acf[7], acf[30], acf[365]
1.000, -0.029, 0.020, 0.014, 0.044
With lag=0, acf computes the correlation of the series with itself, which is always 1
Figure 5.3 Autocorrelation function for daily prices (left), and daily prices with a
simulated weekly seasonality (right)
5.8 Predictive modeling
Predictive modeling is a statistical technique using machine learning and data mining to predict
and forecast likely future outcomes with the aid of historical and existing data. It works by
analyzing current and historical data and projecting what it learns on a model generated to forecast
likely outcomes. Predictive modeling can be used to predict just about anything, from TV ratings
and a customer’s next purchase to credit risks and corporate earnings.
A predictive model is not fixed; it is validated or revised regularly to incorporate changes in the
underlying data. In other words, it’s not a one-and-done prediction. Predictive models make
assumptions based on what has happened in the past and what is happening now. If incoming, new
data shows changes in what is happening now, the impact on the likely future outcome must be
recalculated, too.
Predictive analytics tools use a variety of vetted models and algorithms that can be applied to a
wide spread of use cases.
Classification Problems
A classification problem is about predicting what category something falls into. An example of a
classification problem is analyzing medical data to determine if a patient is in a high risk group for
a certain disease or not. Metrics that can be used for evaluation a classification model: Percent
correction classification (PCC): measures overall accuracy. Every error has the same weight.
Confusion matrix: also measures accuracy but distinguished between errors, i.e false positives,
false negatives and correct predictions.
Both of these metrics are good to use when every data entry needs to be scored. For example, if
every customer who visits a website needs to be shown customized content based on their browsing
behavior, every visitor will need to be categorized.
The following are the measures used in classification problem:
Area Under the ROC Curve (AUC – ROC): is one of the most widely used metrics for
evaluation. Popular because it ranks the positive predictions higher than the negative. Also, ROC
curve is independent of the change in proportion of responders.
Lift and Gain charts: both charts measure the effectiveness of a model by calculating the ratio
between the results obtained with and without the performance evaluation model. In other words,
these metrics examine if using predictive models has any positive effects or not.
Regression Problems
A regression problem is about predicting a quantity. A simple example of a regression problem is
prediction of the selling price of a real estate property based on its attributes (location, square
meters available, condition, etc.).
To evaluate how good your regression model is, you can use the following metrics:
R-squared: indicate how many variables compared to the total variables the model predicted. R-
squared does not take into consideration any biases that might be present in the data. Therefore, a
good model might have a low R-squared value, or a model that does not fit the data might have a
high R-squared value.
Average error: the numerical difference between the predicted value and the actual value.
Mean Square Error (MSE): good to use if you have a lot of outliers in the data.
Median error: the average of all differences between the predicted and the actual values.
Average absolute error: similar to the average error, only you use the absolute value of the
difference to balance out the outliers in the data.
Median absolute error: represents the average of the absolute differences between prediction and
actual observation. All individual differences have equal weight, and big outliers can therefore
affect the final evaluation of the model.
1. Scope and define the predictive analytics model you want to build. In this step if we want to
determine what business processes will be analyzed and what the desired business outcomes are,
such as the adoption of a product by a certain segment of customers.
2. Explore and profile your data. Predictive analytics is data-intensive. In this step we need to
determine the needed data, where it’s stored, whether it’s readily accessible, and its current state.
3. Gather, cleanse and integrate the data. Once if we know where the necessary data is located,
we may need to clean the data. Build the model from a consistent and comprehensive set of
information that is ready to be analyzed.
4. Build the predictive model. Establish the hypothesis and then build the test model. The goal is
to include, and rule out, different variables and factors and then test the model using historical data
to see if the results produced by the model prove the hypothesis.
5. Incorporate analytics into business processes. To make the model valuable, we need to
integrate it into the business process so it can be used to help achieve the outcome.
6. Monitor the model and measure the business results. We live and market in a dynamic
environment, where buying, competition and other factors change. You will need to monitor the
model and measure how effective it is at continuing to produce the desired outcome. It may be
necessary to make adjustments and fine tune the model as conditions evolve.
Businesses can use insights from sentiment analysis to improve their products, fine-tune marketing
messages, correct misconceptions, and identify positive influencers. Social media has
revolutionized the way people make decisions about products and services. In markets like travel,
hospitality, and consumer electronics, customer reviews are now considered to be at least as
important as evaluations by professional reviewers.