0% found this document useful (0 votes)
25 views108 pages

Data Science

Unit-I introduces data science as an interdisciplinary field focused on extracting knowledge from large datasets, emphasizing the importance of data in decision-making for organizations. It covers essential skills for data scientists, tools used in the field, types of data, data collection methods, and the significance of data preprocessing. The document also differentiates between data analysis and data analytics, highlighting their roles in understanding and interpreting data.

Uploaded by

pics2104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views108 pages

Data Science

Unit-I introduces data science as an interdisciplinary field focused on extracting knowledge from large datasets, emphasizing the importance of data in decision-making for organizations. It covers essential skills for data scientists, tools used in the field, types of data, data collection methods, and the significance of data preprocessing. The document also differentiates between data analysis and data analytics, highlighting their roles in understanding and interpreting data.

Uploaded by

pics2104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Unit-I

UNIT–I: INTRODUCTION (09 Periods)


Definition of data science, Skills for data science, Tools for data science, Data types, Data
collections, Data preprocessing, Data analysis and data analytics, Descriptive analysis,
Diagnostic analytics, Predictive analytics, Prescriptive analytics, Exploratory analysis,
Mechanistic analysis.
Data Science:
Data Science is an interdisciplinary field that focuses on extracting knowledge from data sets
which are typically huge in amount. The field encompasses analysis, preparing data for
analysis, and presenting findings to inform high-level decisions in an organization. As such,
it incorporates skills from computer science, mathematics, statistics, information
visualization, graphic, and business.
Why Data Science?
Data is everywhere and is one of the most important features of every organization that helps
a business to flourish by making decisions based on facts, statistical numbers, and trends.
Due to this growing scope of data, data science came into the picture which is a
multidisciplinary IT field, and data scientists’ jobs are the most demanding in the 21st
century. Data analysis/ Data science helps us to ensure we get answers to questions from
data. Data science, and in essence, data analysis plays an important role by helping us to
discover useful information from the data, answer questions, and even predict the future or
the unknown. It uses scientific approaches, procedures, algorithms, and frameworks to
extract knowledge and insight from a huge amount of data.
Introduction to Data Science
Data science is a concept to bring together ideas, data examination, Machine Learning, and
their related strategies to comprehend and dissect genuine phenomena with data. It is an
extension of data analysis fields such as data mining, statistics, and predictive analysis. It is
a huge field that uses a lot of methods and concepts which belong to other fields like
information science, statistics, mathematics, and computer science. Some of the techniques
utilized in Data Science encompass machine learning, visualization, pattern recognition,
probability modeling data, data engineering, signal processing, etc.

Skills for Data Science:


Willing to Experiment. A data scientist needs to have the drive, intuition, and curiosity not
only to solve problems as they are presented, but also to identify and articulate problems on
her own. Intellectual curiosity and the ability to experiment require an amalgamation of
analytical and creative thinking.
Proficiency in Mathematical Reasoning. Mathematical and statistical knowledge is the
second critical skill for a potential applicant seeking a job in data science. We are not suggesting
that you need a Ph.D. in mathematics or statistics, but you do need to have a strong grasp on
the basic statistical methods and how to employ them. Employers are seeking applicants who
can demonstrate their ability in reasoning, logic, interpreting data, and developing strategies to
perform analysis.
Data Literacy. Data literacy is the ability to extract meaningful information from a dataset,
and any modern business has a collection of data that needs to be interpreted. A skilled data
scientist plays an intrinsic role for businesses through an ability to assess a dataset for relevance
and suitability for the purpose of interpretation, to perform analysis, and create meaningful
visualizations to tell valuable data stories.

3. Tools for Data Science


Some of the most used tools in data science – Python, R, and SQL
Python is a scripting language. It means that programs written in Python do not need to be
compiled as a whole like you would do with a program in C or Java; instead, a Python 27 1.7
Tools for Data Science program runs line by line. The language (its syntax and structure) also
provides a very easy learning curve for the beginner, yet giving very powerful tools for
advanced programmers. Let us see this with an example. If you want to write the classic “Hello,
World” program in Java, here is how it goes:
Step 1: Write the code and save as HellowWorld.java.
public class HelloWorld {
public static void main(String[] args) {
System.out.println(“Hello, World”);
}
}
Step 2: Compile the code.
% javac HelloWorld.java
Step 3: Run the program.
% java HelloWorld
In contrast, here is how you do the same in Python:
Step 1: Write the code and save as hello.py
print(“Hello, World”)
Step 2: Run the program.
% python hello.py
Data Types:
Data can be categorized into two groups:
• Structured data
• Unstructured data
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.

Structured Data
Structured data is organized and easier to work with.

How to Structure Data?


We can use an array or a database table to structure or present data.
Example of an array: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Data Collections
There are many places online to look for sets or collections of data. Here are some of those
sources:
Open Data
The idea behind open data is that some data should be freely available in a public domain that
can be used by anyone as they wish, without restrictions from copyright, patents, orother
mechanisms of control.
Social Media Data
Social media has become a gold mine for collecting data to analyze for research or marketing
purposes. This is facilitated by the Application Programming Interface (API) that social
media companies provide to researchers and developers. Think of the API as a set of rules
and methods for asking and sending data. For various data-related needs (e.g., retrieving a
user’s profile picture), one could send API requests to a particular social media service. This is
typically a programmatic call that results in that service sending a response in a structured data
format, such as an XML.
Multimodal Data
We are living in a world where more and more devices exist – from lightbulbs to cars – and
are getting connected to the Internet, creating an emerging trend of the Internet of Things (IoT).
These devices are generating and using much data, but not all of which are “traditional”
types (numbers, text). When dealing with such contexts, we may need to collect and explore
multimodal (different forms) and multimedia (different media) data such as images, music
and other sounds, gestures, body posture, and the use of space.
Data Storage and Presentation
Depending on its nature, data is stored in various formats CSV (Comma-Separated Values)
format is the most common import and export format for spreadsheets and databases. There is
no “CSV standard,” so the format is operationally defined by the many applications that read
and write it. For example, Depression. csv is a dataset that is available at UF Health (University
of Florida Health), UF Biostatistics11 for downloading. The dataset represents the
effectiveness of different treatment procedures on separate individuals with clinical depression.
Ex: treat, before, after, diff
No Treatment,13,16,3
No Treatment,10,18,8
No Treatment,16,16,0
Placebo,16,13,-3
Placebo,14,12,-2
Placebo,19,12,-7
Seroxat (Paxil),17,15,-2
Seroxat (Paxil),14,19,5
Seroxat (Paxil),20,14,-6
Effexor,17,19,2
Effexor,20,12,-8
Effexor,13,10,-3
TSV (Tab-Separated Values) files are used for raw data and can be imported into and
exported from spreadsheet software. Tab-separated values files are essentially text files, and
the raw data can be viewed by text editors, though such files are often used when moving raw
data between spreadsheets.
Ex:
Name<TAB>Age<TAB>Address
Ryan<TAB>33<TAB>1115 W Franklin
Paul<TAB>25<TAB>Big Farm Way
Jim<TAB>45<TAB>W Main St
Samantha<TAB>32<TAB>28 George St
XML (eXtensible Markup Language)
XML (eXtensible Markup Language) was designed to be both human- and machinereadable,
and can thus be used to store and transport data
<?xml version=“1.0” encoding=“UTF-8”?>
<bookstore>
<book category=“information science” cover=“hardcover”>
<title lang=“en”>Social Information Seeking</title>
<author>Chirag Shah</author>
<year>2017</year>
<price>62.58</price>
</book>
<book category=“data science” cover=“paperback”>
<title lang=“en”>Hands-On Introduction to Data
Science</title>
<author>Chirag Shah</author>
<year>2019</year>
<price>50.00</price>
</book>
</bookstore>

JSON (JavaScript Object Notation)


When exchanging data between a browser and a server, the data can be sent only as text. JSON
is text, and we can convert any JavaScript object into JSON, and send JSON to the server. We
can also convert any JSON received from the server into JavaScript objects. This way we can
work with the data as JavaScript objects, with no complicated parsing and translations.
Sending data: If the data is stored in a JavaScript object, we can convert the object
into JSON, and send it to a server. Below is an example:
<!DOCTYPE html>
<html>
<body>
<p id=“demo”></p>
<script>
var obj = {“name”:“John”, “age”:25, “state”: “New Jersey”};
var obj_JSON = JSON.stringify(obj);
window.location = “json_Demo.php?x=” + obj_JSON;
</script>
</body>
</html>

Data Pre-processing
Data in the real world is often dirty; that is, it is in need of being cleaned up before it can be
used for a desired purpose. This is often called data pre-processing. What makes data “dirty”?
Here are some of the factors that indicate that data is not clean or ready to process:
Incomplete. When some of the attribute values are lacking, certain attributes of interest
are lacking, or attributes contain only aggregate data.
Noisy. When data contains errors or outliers. For example, some of the data points in a
dataset may contain extreme values that can severely affect the dataset’s range.
Inconsistent. Data contains discrepancies in codes or names. For example, if the
“Name” column for registration records of employees contains values other than
alphabetical letters, or if records do not start with a capital letter, discrepancies are
present.
Figure shows the most important tasks involved in data pre-processing

Fig: Forms of data pre-processing (N.H. Son, Data Cleaning and Data Pre-processing21).
Data Cleaning:
Since there are several reasons why data could be “dirty,” there are just as many ways to
“clean” it. For this discussion, we will look at three key methods that describe ways in which
data may be “cleaned,” or better organized, or scrubbed of potentially incorrect,incomplete, or
duplicated information
i)Data Munging:
Often, the data is not in a format that is easy to work with. For example, it may be stored
or presented in a way that is hard to process. Thus, we need to convert it to something more
suitable for a computer to understand
Consider the following text recipe.
“Add two diced tomatoes, three cloves of garlic, and a pinch of salt in the mix.”
This can be turned into a table (Table 2.2).
This table conveys the same information as the text, but it is more “analysis friendly.” Of
course, the real question is – How did that sentence get turned into the table? A not-so-
encouraging answer is “using whatever means necessary”! I know that is not what you want to
hear because it does not sound systematic. Unfortunately,
often there is no better or systematic method for wrangling. Not surprisingly, there are people
who are hired to do specifically just this – wrangle ill-formatted data into something more
manageable.
Handling Missing Data
Sometimes data may be in the right format, but some of the values are missing.
Other times data may be missing due to problems with the process of collecting data, or an
equipment malfunction. Or, comprehensiveness may not have been considered important at the
time of collection. For instance, when we started collecting that customer data, it was limited
to a certain city or region, and so the area code for a phone number was not necessary to collect.
Well, we may be in trouble once we decide to expand beyond that city or region, because now
we will have numbers from all kinds of area codes.
So, what to do when we encounter missing data? There is no single good answer. We need to
find a suitable strategy based on the situation. Strategies to combat missing data include
ignoring that record, using a global constant to fill in all missing values, imputation,
inference-based solutions (Bayesian formula or a decision tree), etc.We will revisit some of
these inference techniques later in the book in chapters on machine learning and data mining.
Data Integration
To be as efficient and effective for various data analyses as possible, data from various sources
commonly needs to be integrated. The following steps describe how to integrate multiple
databases or files.
1. Combine data from multiple sources into a coherent storage place (e.g., a single file or
a database).
2. Engage in schema integration, or the combining of metadata from different sources.
3. Detect and resolve data value conflicts.
For example:
a. A conflict may arise; for instance, such as the presence of different attributes and values
from various sources for the same real-world entity.
b. Reasons for this conflict could be different representations or different scales; for
example, metric vs. British units.
4. Address redundant data in data integration. Redundant data is commonly generated in
the process of integrating multiple databases. For example:
a. The same attribute may have different names in different databases.
b. One attribute may be a “derived” attribute in another table; for example, annual
revenue.
c. Correlation analysis may detect instances of redundant data.
Data Transformation
Data must be transformed so it is consistent and readable (by a system). The following five
processes may be used for data transformation. For the time being, do not worry if these seem
too abstract. We will revisit some of them in the next section as we work through an example
of data pre-processing.
1. Smoothing: Remove noise from data.
2. Aggregation: Summarization, data cube construction.
3. Generalization: Concept hierarchy climbing.
4. Normalization: Scaled to fall within a small, specified range and aggregation. Some of the
techniques that are used for accomplishing normalization (but we will not be covering
them here) are:
a. Min–max normalization.
b. Z-score normalization.
c. Normalization by decimal scaling.
5. Attribute or feature construction.
a. New attributes constructed from the given ones.
Data Reduction
Data reduction is a key process in which a reduced representation of a dataset that produces the
same or similar analytical results is obtained. One example of a large dataset that could warrant
reduction is a data cube. Data cubes are multidimensional sets of data that can be stored in a
spreadsheet. But do not let the name fool you. A data cube could be
in two, three, or a higher dimension. Each dimension typically represents an attribute of
interest. Now, consider that you are trying to make a decision using this multidimensional data.
Sure, each of its attributes (dimensions) provides some information, but perhaps not
all of them are equally useful for a given situation. In fact, often we could reduce information
from all those dimensions to something much smaller and manageable
without losing much.
This leads us to two of the most common techniques used for data reduction.
1. Data Cube Aggregation. The lowest level of a data cube is the aggregated data for an
individual entity of interest. To do this, use the smallest representation that is sufficient
to address the given task. In other words, we reduce the data to its more meaningful
size and structure for the task at hand.
2. Dimensionality Reduction. In contrast with the data cube aggregation method, where
the data reduction was with the consideration of the task, dimensionality reduction
method works with respect to the nature of the data. Here, a dimension or a column in
your data spreadsheet is referred to as a “feature,” and the goal of the process is to
identify which features to remove or collapse to a combined feature. This requires
identifying redundancy.
Data Analysis and Data Analytics:
These two terms – data analysis and data analytics – are often used interchangeably and could
be confusing. Data analysis refers to hands-on data exploration and evaluation. Data analytics
is a broader term and includes data analysis as [a] necessary subcomponent. Analytics defines
the science behind the analysis.
The science means understanding the cognitive processes an analyst uses to understand
problems and explore data in 2 meaningful ways.
One way to understand the difference between analysis and analytics is to think in terms of
past and future. Analysis looks backwards, providing marketers with a historical view of what
has happened. Analytics, on the other hand, models the future or predicts a result.
Analytics makes extensive use of mathematics and statistics and the use of descriptive
techniques and predictive models to gain valuable knowledge from data. These insights from
data are used to recommend action or to guide decision-making in a business context.
Thus, analytics is not so much concerned with individual analysis or analysis steps, but with
the entire methodology.
Descriptive Analysis:
Descriptive analysis is about: “What is happening now based on incoming data.” It is a method
for quantitatively describing the main features of a collection of data. Here are a few key points
about descriptive analysis:
• Typically, it is the first kind of data analysis performed on a dataset.
• Usually it is applied to large volumes of data, such as census data.
• Description and interpretation processes are different steps
Descriptive analysis can be useful in the sales cycle, for example, to categorize customers by
their likely product preferences and purchasing patterns. Another example is the Census Data
Set, where descriptive analysis is applied on a whole population
Frequency Distribution:
Of course, data needs to be displayed. Once some data has been collected, it is useful to plot a
graph showing how many times each score occurs. This is known as a frequency distribution.
Frequency distributions come in different shapes and sizes. The following are some of the ways
in which statisticians can present numerical findings.
Histogram. Histograms plot values of observations on the horizontal axis, with a bar showing
how many times each value occurred in the dataset.

Normal Distribution. In an ideal world, data would be distributed symmetrically around the
center of all scores. Thus, if we drew a vertical line through the center of a distribution, both
sides should look the same. This so-called normal distribution is characterized bya bell-shaped
curve, an example of which is shown in Figure 3.4.
There are two ways in which a distribution can deviate from normal:
• Lack of symmetry (called skew)
• Pointiness (called kurtosis)
As shown in Figure 3.5, a skewed distribution can be either positively skewed (Figure 3.5a)
or negatively skewed (Figure 3.5b).
Kurtosis, on the other hand, refers to the degree to which scores cluster at the end of
a distribution (platykurtic) and how “pointy” a distribution is (leptokurtic), as shown in below
Figure
Diagnostic Analytics

Diagnostic analytics are used for discovery, or to determine why something


happened.Sometimes this type of analytics when done hands-on with a small dataset is also
known as causal analysis, since it involves at least one cause (usually more than one) and one
effect.This allows a look at past performance to determine what happened and why. The result
of the analysis is often referred to as an analytic dashboard.
For example, for a social media marketing campaign, you can use descriptive analytics to
assess the number of posts, mentions, followers, fans, page views, reviews, or pins, etc.
There can be thousands of online mentions that can be distilled into a single view to seewhat
worked and what did not work in your past campaigns.
There are various types of techniques available for diagnostic or causal analytics. Among them,
one of the most frequently used is correlation.

Correlations
Correlation is a statistical analysis that is used to measure and describe the strength and
direction of the relationship between two variables. Strength indicates how closely two
variables are related to each other, and direction indicates how one variable would change its
value as the value of the other variable changes.
Correlation is a simple statistical measure that examines how two variables change together
over time. Take, for example, “umbrella” and “rain.” If someone who grew up in a place where
it never rained saw rain for the first time, this person would observe that, whenever it rains,
people use umbrellas. They may also notice that, on dry days, folks do not carry umbrellas. By
definition, “rain” and “umbrella” are said
to be correlated! More specifically, this relationship is strong and positive.
An important statistic, the Pearson’s r correlation, is widely used to measure the degree of the
relationship between linear related variables. When examining the stock market, for example,
the Pearson’s r correlation can measure the degree to which two commodities are
related. The following formula is used to calculate the Pearson’s r correlation:

where
r = Pearson’s r correlation coefficient,
N = number of values in each dataset,
Σxy = sum of the products of paired scores,
Σx = sum of x scores,
Σy = sum of y scores,
Σx2 = sum of squared x scores, and
Σy2 = sum of squared y scores

Predictive Analytics:

Predictive analytics provides companies with actionable insights based on data. Such
information includes estimates about the likelihood of a future outcome. It is important to
remember that no statistical algorithm can “predict” the future with 100% certainty because the
foundation of predictive analytics is based on probabilities. Companies use these
statistics to forecast what might happen. Some of the software most commonly used by data
science professionals for predictive analytics are SAS predictive analytics, IBM predictive
analytics, RapidMiner, and others.
As Figure 3.11 suggests, predictive analytics is done in stages.
1. First, once the data collection is complete, it needs to go through the process of cleaning.
2. Cleaned data can help us obtain hindsight in relationships between different variables.
Plotting the data (e.g., on a scatterplot) is a good place to look for hindsight.
3. Next, we need to confirm the existence of such relationships in the data. This is where
regression comes into play. From the regression equation, we can confirm the pattern of
distribution inside the data. In other words, we obtain insight from hindsight.
4. Finally, based on the identified patterns, or insight, we can predict the future, i.e.,
foresight.

Prescriptive Analytics:
• Prescriptive analytics10 is the area of business analytics dedicated to finding the best
course of action for a given situation. This may start by first analyzing the situation
(using descriptive analysis), but then moves toward finding connections among various
parameters/ variables, and their relation to each other to address a specific problem,
more likely that of prediction.
• Prescriptive analytics can also suggest options for taking advantage of a future
opportunity or mitigate a future risk and illustrate the implications of each.
• In practice, prescriptive analytics can continually and automatically process new data
to improve the accuracy of
predictions and provide advantageous decision options.
• Specific techniques used in prescriptive analytics include optimization, simulation,
game theory and decision-analysis methods.
Exploratory Analysis :
• Exploratory analysis is an approach to analyzing datasets to find previously unknown
relationships. Often such analysis involves using various data visualization approaches.
• Exploratory data analysis is an approach that postpones the usual assumptions about
what kind of model the data follows with the more direct approach of allowing the data
itself to reveal its underlying structure in the form of a model.
• Thus, exploratory analysis is not a mere collection of techniques; rather, it offers a
philosophy as to how to dissect a dataset; what to look for; how to look; and how to
interpret the outcomes.
Mechanistic Analysis :
• Mechanistic analysis involves understanding the exact changes in variables that lead to
changes in other variables for individual objects. For instance, we may want to know
how the number of free doughnuts per employee per day affects employee productivity.
• Perhaps by giving them one extra doughnut we gain a 5% productivity boost, but two
extra doughnuts could end up making them lazy (and diabetic)
• Such relationships are often explored using regression
• In statistical modeling, regression analysis is a process for estimating the relationships
among variables
• Beyond estimating a relationship, regression analysis is a way of predicting an outcome
variable from one predictor variable (simple linear regression) or several predictor
variables (multiple linear regression).
Unit-II
Extracting Meaning from Data
William Cukierski Will went to Cornell for a BA in physics and to Rutgers to get his PhD in
biomedical engineering. He focused on cancer research, studying pathology images. While
working on writing his dissertation, he got more and more involved in Kaggle competitions (more
about Kaggle in a bit), finishing very near the top in multiple competitions, and now works for
Kaggle.
After giving us some background in data science competitions and crowdsourcing, Will explain
how his company works for the participants in the platform as well as for the larger community.

Will then focus on feature extraction and feature selection. Quickly,


feature extraction refers to taking the raw dump of data you have and curating it more carefully,
to avoid the “garbage in, garbage out” scenario you get if you just feed raw data into an algorithm
without enough forethought.

Feature selection is the process of constructing a subset of the data or functions of the data to be
the predictors or variables for your models and algorithms.
Background: Data Science Competitions
There is a history in the machine learning community of data science competitions—where
individuals or teams compete over a period of several weeks or months to design a prediction
algorithm. What it predicts depends on the particular dataset, but some examples include whether
or not a given person will get in a car crash, or like a particular film. A training set is provided, an
evaluation metric determined up front, and some set of rules is provided about, for example, how
often competitors can submit their predictions, whether or not teams can
merge into larger teams, and so on.

Examples of machine learning competitions include the annual Knowledge Discovery and Data
Mining (KDD) competition, the onetime million-dollar Netflix prize (a competition that lasted two
years), and, as we’ll learn a little later, Kaggle itself Some remarks about data science competitions
are warranted. First,data science competitions are part of the data science ecosystem—one of the
cultural forces at play in the current data science landscape, and so aspiring data scientists ought
to be aware of them.
Second, creating these competitions puts one in a position to codify data science, or define its
scope. By thinking about the challenges that they’ve issued, it provides a set of examples for us to
explore the central question of this book: what is data science? This is not to say that we will
unquestionably accept such a definition, but we can at least use it as a starting point: what attributes
of the existing competitions capture data science, and what aspects of data science are missing?
Finally, competitors in the the various competitions get ranked, and so one metric of a “top” data
scientist could be their standings in these competitions. But notice that many top data scientists,
especially women, and including the authors of this book, don’t compete. In fact, there are few
women at the top, and we think this phenomenon needs to be explicitly thought through when we
expect top ranking to act as a proxy for data science talent.
Background: Crowdsourcing
There are two kinds of crowdsourcing models. First, we have the distributive
crowdsourcing model, like Wikipedia, which is for relatively simplistic but large-scale
contributions. On Wikipedia, the online encyclopedia, anyone in the world can contribute to the
content, and there is a system of regulation and quality control set up by volunteers.
The net effect is a fairly high-quality compendium of all of human
knowledge (more or less).
Then, there’s the singular, focused, difficult problems that Kaggle, DARPA, InnoCentive, and
other companies specialize in. These companies issue a challenge to the public, but generally only
a set of people with highly specialized skills compete. There is usually a cash prize, and glory or
the respect of your community, associated with winning.

Feature Selection
The idea of feature selection is identifying the subset of data or transformed
data that you want to put into your model.Prior to working at Kaggle, Will placed highly in
competitions (which is how he got the job), so he knows firsthand what it takes to build effective
predictive models. Feature selection is not only useful for
winning competitions—it’s an important part of building statistical models and algorithms in
general. Just because you have data doesn’t mean it all has to go into the model.
For example, it’s possible you have many redundancies or correlated variables in your raw data,
and so you don’t want to include all those variables in your model. Similarly you might want to
construct new variables by transforming the variables with a logarithm, say, or turning a
continuous variable into a binary variable, before feeding them into the model.
Why? We are getting bigger and bigger datasets, but that’s not always helpful. If the number of
features is larger than the number of observations, or if we have a sparsity problem, then large isn’t
necessarily good. And if the huge data just makes it hard to manipulate because of computational
reasons (e.g., it can’t all fit on one computer, so the data needs to be sharded across multiple
machines) without improving our signal, then that’s a net negative.
To improve the performance of your predictive models, you want to
improve your feature selection process.
User Retention
Let’s give an example for you to keep in mind before we dig into some possible methods. Suppose
you have an app that you designed, let’s call it Chasing Dragons (shown in Figure 7-2), and users
pay a monthly subscription fee to use it. The more users you have, the more money you make.
Suppose you realize that only 10% of new users ever come back after the first month. So you have
two options to increase your revenue: find a way to increase the retention rate of existing users, or
acquire new users. Generally it costs less to keep an existing customer around than to market and
advertise to new users. But setting aside that particular cost-benefit analysis of acquistion or
retention, let’s choose to focus on your user retention situation by building a model that predicts
whether or not a new user will come back next month based on their behavior this month. You
could build such a model in order to understand your retention situation, but let’s focus instead on
building an algorithm that is highly accurate at predicting. You might want to use this model to
give a free month to users who you predict need the extra incentive to stick around,

A good, crude, simple model you could start out with would be logistic regression, which you first
saw back in Chapter 4. This would give you the probability the user returns their second month
conditional on their activities in the first month. (There is a rich set of statistical literature called
Survival Analysis that could also work well, but that’s not necessary in this case—the modeling
part isn’t want we want to focus on here, it’s the data.) You record each user’s behavior for the
first 30 days after sign-up. You could log every action the user took with timestamps: user clicked
the button that said “level 6” at 5:22 a.m., user slew a dragon at 5:23 a.m., user got 22 points at
5:24 a.m., user was shown an ad for deodorant at 5:25 a.m. This would be the data collection phase.
Any action the user could take gets recorded.

Notice that some users might have thousands of such actions, and other users might have only a
few. These would all be stored in timestamped event logs. You’d then need to process these logs
down to a dataset with rows and columns, where each row was a user and each column was a
feature. At this point, you shouldn’t be selective; you’re in the feature generation phase. So your
data science team (game designers, software engineers, statisticians, and marketing folks) might
sit down and brainstorm features. Here are some examples:
• Number of days the user visited in the first month
• Amount of time until second visit
• Number of points on day j for j=1, . . .,30 (this would be 30 separate
features)
• Total number of points in first month (sum of the other features)
• Did user fill out Chasing Dragons profile (binary 1 or 0)
• Age and gender of user
• Screen size of device
Filters
Filters order possible features with respect to a ranking based on a metric or statistic, such as
correlation with the outcome variable. This is sometimes good on a first pass over the space of
features, because they then take account of the predictive power of individual features.However,
the problem with filters is that you get correlated features.In other words, the filter doesn’t care
about redundancy. And by treating the features as independent, you’re not taking into account
possible interactions.
This isn’t always bad and it isn’t always good, as Isabelle Guyon explains.
On the one hand, two redundant features can be more powerful when they are both used; and on
the other hand, something that appears useless alone could actually help when combined with
another possibly useless-looking feature that an interaction would capture.

Wrappers
Wrapper feature selection tries to find subsets of features, of some fixed size, that will do the trick.
However, as anyone who has studied combinations and permutations knows, the number of
possible size k subsets of n things, called n k , grows exponentially. So there’s a nasty opportunity
for overfitting by doing this.
There are two aspects to wrappers that you need to consider:
1) selecting an algorithm to use to select features and
2) deciding on a selection criterion or filter to decide that your set of features is “good.”
Selecting an algorithm
Let’s first talk about a set of algorithms that fall under the category of stepwise regression, a
method for feature selection that involves selecting features according to some selection criterion
by either adding or subtracting features to a regression model in a systematic way. There
are three primary methods of stepwise regression: forward selection, backward elimination, and a
combined approach (forward and backward).

Forward selection
In forward selection you start with a regression model with no features, and gradually add one
feature at a time according to which feature improves the model the most based on a selection
criterion. This looks like this: build all possible regression models with a single predictor. Pick the
best. Now try all possible models that include that best predictor and a second predictor. Pick the
best of those. You keep adding one feature at a time, and you stop when your selection criterion
no longer improves, but instead gets worse.
Backward elimination
In backward elimination you start with a regression model that includes all the features, and you
gradually remove one feature at a time according to the feature whose removal makes the biggest
improvement in the selection criterion. You stop removing features when removing the feature
makes the selection criterion get worse.
Combined approach
Most subset methods are capturing some flavor of minimumredundancy-maximum-relevance. So,
for example, you could have a greedy algorithm that starts with the best feature, takes a few more
highly ranked, removes the worst, and so on. This a hybrid approach with a filter method.

Embedded Methods: Decision Trees

Decision trees have an intuitive appeal because outside the context of data science in our every
day lives, we can think of breaking big decisions down into a series of questions. See the decision
tree in Figure 7-3 about a college student facing the very important decision
of how to spend their time.
This decision is actually dependent on a bunch of factors: whether or not there are any parties or
deadlines, how lazy the student is feeling, and what they care about most (parties). The
interpretability of decision trees is one of the best features about them.
In the context of a data problem, a decision tree is a classification algorithm. For the Chasing
Dragons example, you want to classify users as “Yes, going to come back next month” or “No,
not going to come back next month.” This isn’t really a decision in the colloquial sense, so don’t
let that throw you. You know that the class of any given user is dependent on many factors (number
of dragons the user slew, their age, how many hours they already played the game). And you want
to break it down based on the data you’ve collected. But how do you construct decision trees from
data and what mathematical properties can you expect them to have?
Ultimately you want a tree that is something like Figure 7-4.
Entropy
To quantify what is the most “informative” feature, we define entropy– effectively a measure for
how mixed up something is—for X as follows:
In particular, if either option has probability zero, the entropy is 0. Moreover, because
p (X =1) =1− p (X =0 ), the entropy is symmetric about 0.5 and maximized at 0.5, which we can
easily confirm using a bit of calculus. Figure 7-5 shows a picture of that.

Mathematically, we kind of get this. But what does it mean in words, and why are we calling it
entropy? Earlier, we said that entropy is a measurement of how mixed up something is.
So, for example, if X denotes the event of a baby being born a boy, we’d expect it to be true or
false with probability close to 1/2, which corresponds to high entropy, i.e., the bag of babies from
which we are selecting a baby is highly mixed.
But if X denotes the event of a rainfall in a desert, then it’s low entropy. In other words, the bag of
day-long weather events is not highly mixed in deserts.
Using this concept of entropy, we will be thinking of X as the target of our model. So, X could be
the event that someone buys something on our site. We’d like to know which attribute of the user
will tell us the most information about this event X. We will define the information
gain, denoted IG X,a , for a given attribute a, as the entropy we lose if we know the value of that
attribute:
IG X,a =H X −H X a
To compute this we need to define H X a . We can do this in two steps.
For any actual value a0 of the attribute a we can compute the specific conditional entropy H X
a=a0 as you might expect:
H X a=a0 = − p X =1 a=a0 log2 p X =1 a=a0 −
p X =0 a=a0 log2 p X =0 a=a0
and then we can put it all together, for all possible values of a, to get
the conditional entropy H X a :
H X a = Σai p a=ai ・H X a=ai
In words, the conditional entropy asks: how mixed is our bag really if we know the value of
attribute a? And then information gain can be described as: how much information do we learn
about X (or how much entropy do we lose) once we know a? Going back to how we use the concept
of entropy to build decision trees: it helps us decide what feature to split our tree on, or in other
words, what’s the most informative question to ask?
The Decision Tree Algorithm:
You build your decision tree iteratively, starting at the root. You need an algorithm to decide which
attribute to split on; e.g., which node should be the next one to identify. You choose that attribute
in order to maximize information gain, because you’re getting the most bang for your buck that
way. You keep going until all the points at the end are in the same class or you end up with no
features left. In this case, you take the majority vote.
Often people “prune the tree” afterwards to avoid overfitting. This just means cutting it off below
a certain depth. After all, by design, the algorithm gets weaker and weaker as you build the tree,
and it’s well known that if you build the entire tree, it’s often less accurate (with new data) than if
you prune it.
This is an example of an embedded feature selection algorithm. (Why embedded?) You don’t need
to use a filter here because the information gain method is doing your feature selection for you.
Suppose you have your Chasing Dragons dataset. Your outcome variable
is Return: a binary variable that captures whether or not the user returns next month, and you have
tons of predictors. You can use the R library rpart and the function rpart, and the code would look
like
this:

# Classification Tree with rpart


library(rpart)
# grow tree
model1 <- rpart(Return ~ profile + num_dragons +
num_friends_invited + gender + age +
num_days, method="class", data=chasingdragons)
printcp(model1) # display the results
plotcp(model1) # visualize cross-validation results
summary(model1) # detailed summary of thresholds picked to
transform to binary
# plot tree
plot(model1, uniform=TRUE,
main="Classification Tree for Chasing Dragons")
text(model1, use.n=TRUE, all=TRUE, cex=.8)

Handling Continuous Variables in Decision Trees:


Packages that already implement decision trees can handle continuous variables for you. So you
can provide continuous features, and it will determine an optimal threshold for turning the
continuous variable into a binary predictor. But if you are building a decision tree algorithm
yourself, then in the case of continuous variables, you need to determine the correct threshold of a
value so that it can be thought of as a binary variable. So you could partition a user’s number of
dragon slays into “less than 10” and “at least 10,” and you’d be getting back to the
binary variable case. In this case, it takes some extra work to decide on the information gain
because it depends on the threshold as well as the feature.
In fact, you could think of the decision of where the threshold should live as a separate submodel.
It’s possible to optimize to this choice by maximizing the entropy on individual attributes, but
that’s not clearly the best way to deal with continuous variables. Indeed, this kind of question can
be as complicated as feature selection itself—instead of a single threshold, you might want to
create bins of the value of your attribute, for example. What to do? It will always depend on the
situation.
Random Forests :
Let’s turn to another algorithm for feature selection. Random forests generalize decision trees with
bagging, otherwise known as bootstrap aggregating.
Before we get into the weeds of the random forest algorithm, let’s review bootstrapping. A
bootstrap sample is a sample with replacement, which means we might sample the same data point
more than once. We usually take to the sample size to be 80% of the size of the entire (training)
dataset, but of course this parameter can be adjusted depending on circumstances. This is
technically a third hyperparameter of our random forest algorithm.
Now to the algorithm. To construct a random forest, you construct N
decision trees as follows:
1. For each tree, take a bootstrap sample of your data, and for each
node you randomly select F features, say 5 out of the 100 total
features.
2. Then you use your entropy-information-gain engine as described
in the previous section to decide which among those features you
will split your tree on at each stage.
The Dimensionality Problem
So now let’s think about overdimensionality—i.e., the idea that you might have tens of thousands
of items. We typically use both Singular Value Decomposition (SVD) and Principal Component
Analysis (PCA) to tackle this, and we’ll show you how shortly. To understand how this works
before we dive into the math, let’s think about how we reduce dimensions and create “latent
features” internally every day. For example, people invent concepts like “coolness,” but we can’t
directly measure how cool someone is. Other people exhibit different patterns of behavior, which
we internally map or reduce to our one dimension of “coolness.” So coolness is an example of a
latent feature in that it’s unobserved and not measurable directly, and we could think of it as
reducing dimensions because perhaps it’s a combination of many “features” we’ve observed about
the person and implictly weighted in our mind.
Two things are happening here: the dimensionality is reduced into a single feature and the latent
aspect of that feature. But in this algorithm, we don’t decide which latent factors to care about.
Instead we let the machines do the work of figuring out what the important latent features are.
“Important” in this context means they explain the variance in the answers to the various
questions—in other words, they model the answers efficiently.
Our goal is to build a model that has a representation in a low dimensional
subspace that gathers “taste information” to generate recommendations. So we’re saying here that
taste is latent but can be approximated by putting together all the observed information we do have
about the user. Also consider that most of the time, the rating questions are binary (yes/no). To
deal with this, Hunch created a separate variable for every question. They also found that
comparison questions may be better at revealing preferences.
Singular Value Decomposition (SVD) :
So let’s get into the math now starting with singular value decomposition.
Given an m×n matrix X of rank k, it is a theorem from linear algebra that we can always compose
it into the product of three matrices
as follows:
T
X =USV
where U is m×k, S is k×k, and V is k×n, the columns of U and V are pairwise orthogonal, and S is
diagonal. Note the standard statement of SVD is slightly more involved and has U and V both
square unitary matrices, and has the middle “diagonal” matrix a rectangular. We’ll be using this
form, because we’re going to be taking approximations to X of increasingly smaller rank. You can
find the proof of the existence of this form as a step in the proof of existence of the general form
here.
Let’s apply the preceding matrix decomposition to our situation. X is our original dataset, which
has users’ ratings of items. We have m users, n items, and k would be the rank of X, and
consequently would also be an upper bound on the number d of latent variables we decide to care
about—note we choose d whereas m,n, and k are defined through our training dataset. So just like
in k-NN, where k is a tuning parameter (different k entirely—not trying to confuse you!), in this
case, d is the
tuning parameter.
Each row of U corresponds to a user, whereas V has a row for each item. The values along the
diagonal of the square matrix S are called the “singular values.” They measure the importance of
each latent variable—the most important latent variable has the biggest singular value.
Important Properties of SVD
Because the columns of U and V are orthogonal to each other, you can order the columns by
singular values via a base change operation. That way, if you put the columns in decreasing order
of their corresponding singular values (which you do), then the dimensions are ordered by
importance from highest to lowest. You can take lower rank approximation of X by throwing away
part of S. In other words, replace S by a submatrix taken from the upper-left corner of S.
Of course, if you cut off part of S you’d have to simultaneously cut off part of U and part of V, but
this is OK because you’re cutting off the least important vectors. This is essentially how you
choose the number of latent variables d—you no longer have the original matrix X anymore, only
an approximation of it, because d is typically much smaller than k, but it’s still pretty close to X.
This is what people mean when they talk about “compression,” if you’ve ever heard that term
thrown around. There is often an important interpretation to the values in thematrices U and V. For
example, you can see, by using SVD, that the “most important” latent feature is often something
like whether someone is a man or a woman. How would you actually use this for recommendation?
You’d take X, fill in all of the empty cells with the average rating for that item (you don’t want to
fill it in with 0 because that might mean something in the rating system, and SVD can’t handle
missing values), and then compute the SVD. Now that you’ve decomposed it this way, it means
that you’ve captured latent features that you can use to compare users if you want to. But that’s
not what you want—you want a prediction. If you multiply out the U, S, and Vτ together, you get
an approximation to X—or a prediction, X^—so you can predict a rating by simply looking up the
entry for the appropriate user/item pair in the matrix X^
Principal Component Analysis (PCA)
Let’s look at another approach for predicting preferences. With this
approach, you’re still looking for U and V as before, but you don’t need
S anymore, so you’re just searching for U and V such that:
X ≡U ・Vτ
Your optimization problem is that you want to minimize the discrepency
between the actual X and your approximation to X via U and V
measured via the squared error:
argminΣi, j (xi, j −ui ・v j)2
Here you denote by ui the row of U corresponding to user i, and similarly
you denote by v j the row of V corresponding to item j. As usual,
items can include metadata information (so the age vectors of all the
users will be a row in V).
Here you denote by ui the row of U corresponding to user i, and similarly
you denote by v j the row of V corresponding to item j. As usual,
items can include metadata information (so the age vectors of all the
users will be a row in V).
Then the dot product ui ・v j is the predicted preference of user i for item j, and you want that to
be as close as possible to the actual preference xi, j. So, you want to find the best choices of U and
V that minimize the squared differences between prediction and observation on everything you
actually know, and the idea is that if it’s really good on stuff you know, it will also be good on
stuff you’re guessing. This should sound familiar to you—it’s mean squared error, like we used
for linear regression. Now you get to choose a parameter, namely the number d defined as
how may latent features you want to use. The matrix U will have a row for each user and a column
for each latent feature, and the matrix V will have a row for each item and a column for each latent
feature.
How do you choose d? It’s typically about 100, because it’s more than
20 (as we told you, through the course of developing the product, we
found that we had a pretty good grasp on someone if we ask them 20
questions) and it’s as much as you care to add before it’s computationally
too much work.
Alternating Least Squares

But how do you do this? How do you actually find U and V? In reality, as you will see next, you’re
not first minimizing the squared error and then minimizing the size of the entries of the matrices
U and V. You’re actually doing both at the same time.So your goal is to find U and V by solving
the optimization problem described earlier. This optimization doesn’t have a nice closed formula
like ordinary least squares with one set of coefficients. Instead, you need an iterative algorithm
like gradient descent. As long as your problem is convex you’ll converge OK (i.e., you won’t find
yourself at a local, but not global, maximum), and you can force your problem to be convex using
regularization.
Here’s the algorithm:
• Pick a random V.
• Optimize U while V is fixed.
• Optimize V while U is fixed.
• Keep doing the preceding two steps until you’re not changing very
much at all. To be precise, you choose an ϵ and if your coefficients
are each changing by less than ϵ, then you declare your algorithm
“converged.”
Theorem with no proof: The preceding algorithm will converge if your prior
is large enough If you enlarge your prior, you make the optimization easier because you’re
artificially creating a more convex function—on the other hand,
if your prior is huge, then all your coefficients will be zero anyway, so that doesn’t really get you
anywhere. So actually you might not want to enlarge your prior. Optimizing your prior is
philosophically screwed because how is it a prior if you’re back-fitting it to do what you want it
to do? Plus you’re mixing metaphors here to some extent by searching for a close approximation
of X at the same time you are minimizing coefficients. The more you care about coefficients, the
less
you care about X. But in actuality, you only want to care about X.
Fix V and Update U
The way you do this optimization is user by user. So for user i, you
want to find:

where v j is fixed. In other words, you just care about this user for now.
But wait a minute, this is the same as linear least squares, and has a
closed form solution! In other words, set:

where V*,i is the subset of V for which you have preferences coming from user i. Taking the
inverse is easy because it’s d×d, which is small. And there aren’t that many preferences per user,
so solving this many times is really not that hard. Overall you’ve got a doable update for U.
When you fix U and optimize V, it’s analogous—you only ever have to consider the users that
rated that movie, which may be pretty large for popular movies but on average isn’t; but even so,
you’re only ever inverting a d×d matrix.
UNIT–III: DATA VISUALIZATION (8 periods)
A Brief matplotlib API primer, Plotting with Pandas and Seaborn – Line plots,
Bar plots, Histograms and density plots, Scatter plots, Facet grids and Categorical
data; Other Python visualization tools.

Unit – III
DATA VISUALIZATION
A brief matplotlib api primer:
matplotlib. pyplot is a collection of functions that make Matplotlib work like
MATLAB. Each pyplot function makes some change to a figure: e.g., creates a
figure, creates a plotting area in a figure, plots some lines in a plotting area,
decorates the plot with labels, etc.
Types of Plots:

S.No Function and Description


1 Bar: Make a bar plot.
2 Barh: Make a horizontal bar plot.
3 Boxplot: Make a box and whisker plot.
4 Hist: Plot a histogram.
5 hist2d: Make a 2D histogram plot.
6 Pie: Plot a pie chart.
7 Plot: Plot lines and/or markers to the Axes.
8 Scatter: Make a scatter plot of x vs y.
9 Polar: Make a polar plot.

Plotting with Pandas and Seaborn:

Data Visualization is the presentation of data in pictorial format. It is extremely


important for Data Analysis, primarily because of the fantastic ecosystem
of data-centric Python packages. And it helps to understand the data,
however, complex it is, the significance of data by summarizing and presenting
a huge amount of data in a simple and easy-to-understand format and helps
communicate information clearly and effectively.
Pandas:

Pandas offer tools for cleaning and process your data. It is the most popular
Python library that is used for data analysis. In pandas, a data table is called
a dataframe.

Example1:

import pandas as pd

# initialise data of lists.

data = {'Name':[ 'Mohe' , 'Karnal' , 'Yrik' , 'jack' ], 'Age':[ 30 , 21 , 29 , 28

]}

# Create DataFrame

df = pd.DataFrame( data ) # Print the output.

df
output:

Example2: load the CSV data from the system and display it through

pandas. # import module


import pandas # load the csv

data = pandas.read_csv("nba.csv") data.head()


output:

Seaborn:

Seaborn is an amazing visualization library for statistical graphics plotting in


Python. It is built on the top of matplotlib library and also closely integrated
into the data structures from pandas. Basic plot using seaborn:
# Importing libraries import numpy as np import seaborn as sns sns.set(

style = "white" )

rs = np.random.RandomState( 10 ) d = rs.normal( size = 50 )

sns.distplot(d, kde = True, color = "g")


Output:
Seaborn: statistical data visualization

Seaborn helps to visualize the statistical relationships, To understand how


variables in a dataset are related to one another and how that relationship is
dependent on other variables, we perform statistical analysis. This Statistical
analysis helps to visualize the trends and identify various patterns in the
dataset.

Line plot:

Lineplot Is the most popular plot to draw a relationship between x and y with
the possibility of several semantic groupings.
Syntax : sns.lineplot(x=None, y=None)

Parameters: x, y: Input data variables; must be numeric. Can pass data


directly or reference columns in data.
Example1:

import seaborn as sns import pandas

data = pandas.read_csv("nba.csv") sns.lineplot( data['Age'], data['Weight'])

Output:
Example 2: Use the hue parameter for plotting the graph. # import module

import seaborn as sns import pandas

# read the csv data

data = pandas.read_csv("nba.csv") # plot

sns.lineplot(data['Age'],data['Weight'], hue =data["Position"])

Output:

Bar plot:

Barplot represents an estimate of central tendency for a numeric variable


with the height of each rectangle and provides some indication of the
uncertainty around that estimate using error bars.
Syntax : seaborn.barplot(x=None, y=None, hue=None, data=None)
Parameters :

• x, y : This parameter take names of variables in data or vector data, Inputs


for plotting long-form data.
• hue : (optional) This parameter take column name for colour encoding.
• data : (optional) This parameter take DataFrame, array, or list of arrays,
Dataset for plotting. If x and y are absent, this is interpreted as wide-form.
Otherwise it is expected to be long-form.
Returns : Returns the Axes object with the plot drawn onto it.

Example1:

# import module import seaborn

seaborn.set(style = 'whitegrid') # read csv and plot

data = pandas.read_csv("nba.csv") seaborn.barplot(x =data["Age"])

Output:
Example2:

# import module import seaborn

seaborn.set(style = 'whitegrid') # read csv and plot

data = pandas.read_csv("nba.csv")

seaborn.barplot(x ="Age", y ="Weight", data = data)

Output:

Histograms and Density Plots


The histogram is the graphical representation that organizes a group of data
points into the specified range. Density Plot is the continuous and smoothed
version of the Histogram estimated from the data. It is estimated through
Kernel Density Estimation. In this method Kernel (continuous curve) is drawn
at every individual data point and then all these curves are added together to
make a single smoothened density estimation. Histogram fails when we want
to compare the data distribution of a single variable over the multiple
categories at that time Density Plot is useful for visualizing the data.
Approach:

Import the necessary libraries.


Create or import a dataset from seaborn library.
Select the column for which we have to make a plot.
For making the plot we are using distplot() function provided by
seaborn library for plotting Histogram and Density Plot together in which we
have to pass the dataset column.
We can also make Histogram and Density Plot individually using
distplot() function according to our needs.
For creating Histogram individually we have to pass hist=False as
a parameter in the distplot() function.
For creating Density Plot individually we have to pass kde=False as
a parameter in the distplot() function.
Now after making the plot we have to visualize that, so for
visualization, we have to use show() function provided by matplotlib.pyplot
library.

Example 1: Plotting the Histogram using seaborn library

# importing necessary libraries import seaborn as sns


import matplotlib.pyplot as plt
# importing diamond dataset from the library df =
sns.load_dataset('diamonds')
# plotting histogram for carat using distplot() sns.distplot(a=df.carat,
kde=False)
# visualizing plot using matplotlib.pyplot library plt.show()
Output:

Example 2: Plotting the Density using seaborn library on the default

setting. # importing libraries

import seaborn as sns

import matplotlib.pyplot as plt

# importing diamond dataset from the library df =

sns.load_dataset('diamonds')

# plotting density plot for carat using distplot() sns.distplot(a=df.carat,

hist=False)

# visualizing plot using matplotlib.pyplot library plt.show()


Output:

Example 3: Plotting Histogram and Density Plot together # importing

libraries

import seaborn as sns

import matplotlib.pyplot as plt

# importing diamond dataset from the library df =

sns.load_dataset('diamonds')

# plotting histogram and density # plot for carat using distplot()

sns.distplot(a=df.carat)

# visualizing plot using matplotlib.pyplot library plt.show()


Output:

Example 4: Plotting Histogram and Density Plot together by setting bins

and color. # importing libraries

import seaborn as sns

import matplotlib.pyplot as plt

# importing diamond dataset from the library df =

sns.load_dataset('diamonds')

# plotting histogram and density plot

# for carat using distplot() by setting color sns.distplot(a=df.carat, bins=40,

color='purple',

hist_kws={"edgecolor": 'black'}) # visualizing plot using

matplotlib.pyplot library plt.show()


Output:

Example 5: Plotting Histogram and Density Plot together using Iris

dataset. # importing libraries

import seaborn as sns

import matplotlib.pyplot as plt

# importing iris dataset from the library df2 = sns.load_dataset('iris')

# plotting histogram and density plot for

# petal length using distplot() by setting color

sns.distplot(a=df2.petal_length, color='green',

hist_kws={"edgecolor": 'black'}) # visualizing plot using

matplotlib.pyplot library plt.show()


Output:

Scatter Plot
Scatterplot can be used with several semantic groupings which can help to
understand well in a graph. They can plot two-dimensional graphics that can
be enhanced by mapping up to three additional variables while using the
semantics of hue, size, and style parameters. All the parameter control visual
semantic which are used to identify the different subsets. Using redundant
semantics can be helpful for making graphics more accessible.

Syntax: seaborn.scatterplot(x=None, y=None, hue=None, style=None,


size=None, data=None, palette=None, hue_order=None, hue_norm=None,
sizes=None, size_order=None,
size_norm=None, markers=True, x_bins=None, y_bins=None
style_order=None, ,
units=None, estimator=None, ci=95, alpha=’auto’, x_jitter=Non
n_boot=1000, e,
y_jitter=None, legend=’brief’, ax=None, **kwargs)
Parameters:
x, y: Input data variables that should be numeric.
data: Dataframe where each column is a variable and each row is an
observation.
size: Grouping variable that will produce points with different sizes.
style: Grouping variable that will produce points with different markers.
palette: Grouping variable that will produce points with different markers.
markers: Object determining how to draw the markers for different levels.
alpha: Proportional opacity of the points.
Returns: This method returns the Axes object with the plot drawn onto it.
Sample code:

import seaborn seaborn.set(style='whitegrid')

fmri = seaborn.load_dataset("fmri")

seaborn.scatterplot(x="timepoint",y="signal",data=fmri)

Output:

Grouping data points on the basis of category, here as region and event.

import seaborn seaborn.set(style='whitegrid')

fmri = seaborn.load_dataset("fmri") seaborn.scatterplot(x="timepoint",

y="signal", hue="region",

style="event", data=fmri)
Output:

Grouping variables in Seaborn Scatter Plot with different attributes

1. Adding the marker attributes


The circle is used to represent the data point and the default marker here is a
blue circle. In the above output, we are seeing the default output for the
marker, but we can customize this blue circle with marker attributes.
Code:
seaborn.scatterplot(x='day', y='tip', data= tip, marker = '+')

output:
2. Adding the hue attributes.
It will produce data points with different colors. Hue can be used to
group to multiple data variable and show the dependency of the passed
data values are to be plotted.
Syntax: seaborn.scatterplot( x, y, data, hue)

Code:
seaborn.scatterplot(x='day', y='tip', data=tip, hue='time')
output:

In the above example, we can see how the tip and day bill is related to
whether it was lunchtime or dinner time. The blue color has represented
the Dinner and the orange color represents the Lunch.
Let’s check for a hue = ” day “

Code:

seaborn.scatterplot(x='day', y='tip', data=tip, hue='day')


output:

3. Adding the style attributes.


Grouping variable that will produce points with different markers. Using
style we can generate the scatter grouping variable that will produce points
with different markers.
Syntax:
seaborn.scatterplot( x, y, data, style)

code:

seaborn.scatterplot(x='day', y='tip', data=tip, hue="time", style="time")

output:
4. Adding the palette attributes.
Using the palette we can generate the point with different colors. In this below
example we can see the palette can be responsible for a generate the scatter
plot with different colormap values. Syntax:
seaborn.scatterplot( x, y, data, palette=”color_name”)

code:

seaborn.scatterplot(x='day', y='tip', data=tip, hue='time', palette='pastel')

output:

5. Adding size attributes.


Using size we can generate the point and we can produce points with
different sizes.
Syntax:
seaborn.scatterplot( x, y, data, size)

code:

seaborn.scatterplot(x='day', y='tip', data=tip ,hue='size', size = "size")


Output:

6. Adding legend attributes.


.Using the legend parameter we can turn on (legend=full) and we can also
turn off the legend using (legend = False).
If the legend is “brief”, numeric hue and size variables will be represented
with a sample of evenly spaced values.
If the legend is “full”, every group will get an entry in the legend. If False,

no legend data is added and no legend is drawn.

Syntax: seaborn.scatterplot( x, y, data, legend=”brief)


Code:

seaborn.scatterplot(x='day', y='tip', data=tip, hue='day',

sizes=(30, 200), legend='brief')


output:
7. Adding alpha attributes.
Using alpha we can control proportional opacity of the points. We can
decrease and increase the opacity.
Syntax: seaborn.scatterplot( x, y, data, alpha=”0.2″)
Code:
seaborn.scatterplot(x='day', y='tip', data=tip, alpha = 0.1)

output:

Facet grids and Categorical data

seaborn.FacetGrid() :
FacetGrid class helps in visualizing distribution of one variable as well
as the relationship between multiple variables separately within subsets of
your dataset using multiple panels.
A FacetGrid can be drawn with up to three dimensions ? row, col,
and hue. The first two have obvious correspondence with the resulting array
of axes; think of the hue variable as a third dimension along a depth axis,
where different levels are plotted with different colors.
FacetGrid object takes a dataframe as input and the names of the
variables that will form the row, column, or hue dimensions of the grid.
The variables should be categorical and the data at each level of the variable
will be used for a facet along that axis.
Example1:

# importing packages import seaborn

import matplotlib.pyplot as plt


# loading of a dataframe from seaborn df = seaborn.load_dataset('tips')

# Form a facetgrid using columns with a hue


graph = seaborn.FacetGrid(df, col ="sex", hue ="day") # map the above

form facetgrid with some attributes

graph.map(plt.scatter, "total_bill", "tip", edgecolor ="w").add_legend() #

show the object

plt.show()
output:

Example2:
# importing packages import seaborn

import matplotlib.pyplot as plt


# loading of a dataframe from seaborn
df = seaborn.load_dataset('tips')
# Form a facetgrid using columns with a hue
graph = seaborn.FacetGrid(df, row ='smoker', col ='time') # map the above

form facetgrid with some attributes graph.map(plt.hist, 'total_bill', bins =

15, color ='orange') # show the object

plt.show()
output:

Example3:
# importing packages import seaborn

import matplotlib.pyplot as plt


# loading of a dataframe from seaborn df = seaborn.load_dataset('tips')

# Form a facetgrid using columns with a hue


graph = seaborn.FacetGrid(df, col ='time', hue ='smoker') # map the above

form facetgrid with some attributes

graph.map(seaborn.regplot, "total_bill", "tip").add_legend() # show the

object

plt.show()
output:

Other Python visualization tools:


1. Matplotlib
Matplotlib is a data visualization library and 2-D plotting library of Python It
was initially released in 2003 and it is the most popular and widely-used
plotting library in the Python community. It comes with an interactive
environment across multiple platforms. Matplotlib can be used in Python
scripts, the Python and IPython shells, the Jupyter notebook, web application
servers, etc. It can be used to embed plots into applications using various GUI
toolkits like Tkinter, GTK+, wxPython, Qt, etc. So you can use Matplotlib to
create plots, bar charts, pie charts, histograms, scatterplots, error charts,
power spectra, stemplots, and whatever other visualization charts you want!
The Pyplot module also provides a MATLAB-like interface that is just as
versatile and useful as MATLAB while being free and open source.
2. Plotly
Plotly is a free open-source graphing library that can be used to form data
visualizations. Plotly (plotly.py) is built on top of the Plotly JavaScript library
(plotly.js) and can be used to create web-based data visualizations that can
be displayed in Jupyter notebooks or web applications using Dash or saved as
individual HTML files. Plotly provides more than 40 unique chart types like
scatter plots, histograms, line charts, bar charts, pie charts, error bars, box
plots, multiple axes, sparklines, dendrograms, 3-D charts, etc. Plotly also
provides contour plots, which are not that common in other data visualization
libraries. In addition to all this, Plotly can be used offline with no internet
connection.

3. Seaborn
Seaborn is a Python data visualization library that is based on Matplotlib and
closely integrated with the NumPy and pandas data structures. Seaborn has
various dataset-oriented plotting functions that operate on data frames and
arrays that have whole datasets within them. Then it internally performs the
necessary statistical aggregation and mapping functions to create informative
plots that the user desires. It is a high-level interface for creating beautiful
and informative statistical graphics that are integral to exploring and
understanding data. The Seaborn data graphics can include bar charts, pie
charts, histograms, scatterplots, error charts, etc. Seaborn also has various
tools for choosing color palettes that can reveal patterns in the data.

4. GGplot

Ggplot is a Python data visualization library that is based on the


implementation of ggplot2 which is created for the programming language
R. Ggplot can create data visualizations such as bar charts, pie charts,
histograms, scatterplots, error charts, etc. using high-level API. It also allows
you to add different types of data visualization components or layers in a
single visualization. Once ggplot has been told which variables to map to
which aesthetics in the plot, it does the rest of the work so that the user can
focus on interpreting the visualizations and take less time in creating them.
But this also means that it is not possible to create highly customized
graphics in ggplot. Ggplot is also deeply connected with pandas so it is
best to keep the data in DataFrames.
5. Altair
Altair is a statistical data visualization library in Python. It is based on Vega
and Vega-Lite which are a sort of declarative language for creating, saving,
and sharing data visualization designs that are also interactive. Altair can be
used to create beautiful data visualizations of plots such as bar charts, pie
charts, histograms, scatterplots, error charts, power spectra, stemplots, etc.
using a minimal amount of coding. Altair has dependencies which include
python 3.6, entrypoints, jsonschema, NumPy, Pandas, and Toolz which are
automatically installed with the Altair installation commands. You can open
Jupyter Notebook or JupyterLab and execute any of the code to obtain that
data visualizations in Altair. Currently, the source for Altair is available on
GitHub.
6. Bokeh
Bokeh is a data visualization library that provides detailed graphics with a high
level of interactivity across various datasets, whether they are large or small.
Bokeh is based on The Grammar of Graphics like ggplot but it is native to
Python while ggplot is based on ggplot2 from R. Data visualization experts
can create various interactive plots for modern web browsers using bokeh
which can be used in interactive web applications, HTML documents, or JSON
objects. Bokeh has 3 levels that can be used for creating visualizations. The
first level focuses only on creating the data plots quickly, the second level
controls the basic building blocks of the plot while the third level provides full
autonomy for creating the charts with no pre-set defaults. This level is suited
to the data analysts and IT professionals that are well versed in the
technical side of creating data visualizations.
7. Pygal
Pygal is a Python data visualization library that is made for creating sexy
charts! (According to their website!) While Pygal is similar to Plotly or Bokeh
in that it creates data visualization charts that can be embedded into web
pages and accessed using a web browser, a primary difference is that it can
output charts in the form of SVG’s or Scalable Vector Graphics. These SVG’s
ensure that you can observe your charts clearly without losing any of the
quality even if you scale them. However, SVG’s are only useful with smaller
datasets as too many data points are difficult to render and the charts can
become sluggish.
8. Geoplotlib
Most of the data visualization libraries don’t provide much support for
creating maps or using geographical data and that is why geoplotlib is
such an important Python library. It supports the creation of geographical
maps in particular with many different types of maps available such as
dot-density maps, choropleths, symbol maps, etc. One thing to keep in
mind is that requires NumPy and pyglet as prerequisites before installation
but that is not a big disadvantage. Especially since you want to create
geographical maps and geoplotlib is the only excellent option for maps out
there!
In conclusion, all these Python Libraries for Data Visualization are
great options for creating beautiful and informative data visualizations.
Each of these has its strong points and advantages so you can select the
one that is perfect for your data visualization or project. For example,
Matplotlib is extremely popular and well suited to general 2-D plots while
Geoplotlib is uniquely suite to geographical visualizations. So go on and
choose your library to create a stunning visualization in Python!
UNIT IV
STATISTICAL THINKING
Distributions – Representing and plotting histograms, Outliers, Summarizing distributions, Variance,
Reporting results; Probability mass function – Plotting PMFs, Other visualizations, The class size
paradox, Data frame indexing; Cumulative distribution functions - Limits of PMFs, Representing
CDFs, Percentile based statistics, Random numbers, Comparing percentile ranks; Modeling
distributions - Exponential distribution, Normal distribution, Lognormal distribution.
4.1 Representing Histograms:
The Hist constructor can take a sequence, dictionary, pandas Series, or another Hist. we can
instantiate a Hist object using the following command:
>>> import thinkstats2
>>> hist = thinkstats2.Hist([1, 2, 2, 3, 5])
>>> hist Hist({1: 1, 2: 2, 3: 1, 5: 1})
Hist objects provide Freq, which takes a value and returns its frequency:
>>> hist.Freq(2)2
The bracket operator does the same thing:
>>> hist[2]2
If you look up a value that has never appeared, the frequency is 0:
>>> hist.Freq(4)0
Values returns an unsorted list of the values in the Hist:
>>> hist.Values()[1, 5, 3, 2]
To loop through the values in order, we can use the built-in function sorted: for val in
sorted(hist.Values()):
print(val, hist.Freq(val))
Or we can use Items to iterate through value-frequency pairs:
for val, freq in hist.Items():
print(val, freq)
4.1.1 Plotting Histogram:
A module called thinkplot.py that provides functions for plotting Hists and other objects definedin
thinkstats2.py. It is based on pyplot,which is part of the matplotlib package.
To plot hist with thinkplot:
>>> import thinkplot
>>> thinkplot.Hist(hist)
>>> thinkplot.Show(xlabel='value', ylabel='frequency')
4.1.2 Outliers
Considering of histograms is easy to identify the most common values and the shape of the
distribution, but rare values are not always visible.
Outliers are extreme values that might be errors in measurement and recording, or might be
accurate reports of rare events.
Hist provides methods Largest and Smallest, which take an integer n and return the n largest or
smallest values from the histogram:
for weeks, freq in hist.Smallest(10):
print(weeks, freq)
In the list of pregnancy lengths for live births, the 10 lowest values are [0, 4, 9, 13, 17, 18, 19, 20,21,
22]. Values below 10 weeks are certainly errors; the most likely explanation is that the outcomewas not
coded correctly. Values higher than 30 weeks are probably legitimate. Between 10 and 30weeks, it is
hard to be sure; some values are probably errors, but some represent premature babies.On the other end
of the range, the highest values are:
weeks count
43 148
44 46
45 10
46 1
47 1
48 7
50 2
Most doctors recommend induced labor if a pregnancy exceeds 42 weeks, so some of the longer values
are surprising. In particular, 50 weeks seems medically unlikely.
The best way to handle outliers depends on “domain knowledge”; that is, information about wherethe
data come from and what they mean. And it depends on what analysis we are planning to perform.
In this example, the motivating question is whether first babies tend to be early (or late), they are
usually interested in full-term pregnancies, so for this analysis will focus on pregnancies longer than
27 weeks.
First Babies:
Now compare the distribution of pregnancy lengths for first babies and others. Divide the
DataFrame of live births using birthord, and computed their histograms:
firsts = live[live.birthord == 1]others = live[live.birthord != 1]
first_hist = thinkstats2.Hist(firsts.prglngth) other_hist = thinkstats2.Hist(others.prglngth) Then plot
their histograms on the same axis:
width = 0.45 thinkplot.PrePlot(2) thinkplot.Hist(first_hist,align='right',width=width)
thinkplot.Hist(other_hist, align='left', width=width)thinkplot.Show(xlabel='weeks',
ylabel='frequency')
thinkplot.PrePlot takes the number of histograms planning to plot;it uses this information to choosean
appropriate collection of colors.
thinkplot.Hist normally uses align=’center so that each bar is centered over its value.
For this figure, use align=’right’ and align=’left’ to place corresponding bars on either side of the value.
Withwidth=0.45, the total widthofthetwobarsis0.9, leaving some space between each pair.
Finally, adjust the axis to show only data between27and46weeks.Figure 4.1shows the corresponding
result.
Figure4.1-Histogram of pregnancy lengths.
Histograms are useful because they make the most frequent values immediately apparent. But they are
not the best choice for comparing two distributions. In this example, there are fewer “first babies”
than“ others,” so some of the apparent differences in the histograms are due to sample sizes.
4.2 Summarizing Distributions:
A histogram is a complete description of the distribution of a sample; that is, given a histogram,could
reconstruct the values in the sample(although not their order).
If the details of the distribution are important, it might be necessary to present a histogram. Butoften
want to summarize the distribution with a few descriptive statistics.
Some of the characteristics we might want to report are:
Central tendency
Do the values tend to cluster around a particular point?
Modes

Is there more than one cluster?


spread

How much variability is there in the values?


Tails
How quickly do the probabilities drop off as we move away from the modes?
outliers
Are there extreme values far from the modes?
Statistics designed to answer these questions are called summary statistics. By far the most
common summary statistic is the mean, which is meant to describe the central
tendency of the distribution.
The words mean and average are sometimes used interchangeably, but we make this distinction:If we
have a sample of nvalues, xi, the mean, x¯, is the sum of the values divided by the number of

value
• The mean of a sample is the summary statistic computed with the previous formula.
• An average is one of several summary statistics we might choose to describe a centraltendency.
Sometimes the mean is a good description of a set of values. For example, apples are all pretty much
the same size(atleast the ones sold in supermarkets).So if I buy 6 apples and the total weightis 3 pounds,
it would be area son able summary to say they are about a half pound each.

But pumpkins are more diverse. Suppose I grow several varieties in my garden, and one day I harvest
three decorative pumpkins that are 1 pound each, two pie pumpkins that are 3pounds each, and one
Atlantic Giant pumpkin that weighs 591pounds.Themeanof this sample is 100 pounds, but if I told
you “The average pumpkin in my garden is 100pounds,”that would be misleading. In this example,
there is no meaningful average because there is no typical pumpkin.

Variance
If there is no single number that summarizes pumpkin weights, we can do a little better with
numbers: mean and variance.
Variance is a summary statistic intended to describe the variability or spread of a distribution.The
variance of a set of values is

The term xi-x̄ is called the “deviation from the mean,” so variance is the mean squared deviation. The
square root of variance, S, is the standard deviation.
Pandas data structures provides methods to compute mean, variance and standard deviation:mean =
live.prglngth.mean()var =
live.prglngth.var()std = live.prglngth.std()

For all live births, the mean pregnancy lengthis38.6weeks,the standard deviation is2.7weeks,which
means we should expect deviations of2-3weeks to be common.
Variance of pregnancy length is 7.3, which is hard to interpret, especially since the units are weeks,or“
square weeks.” Variance is useful in some calculations, but it is not a good summary statistic.
4.2.1 Reporting Results:
There are several ways to describe the difference in pregnancy length (if there is one) between first
babies and others. How should report these results?
The answer depends on question. A scientist might be interested in any (real) effect, no matter how
small. A doctor might only care about effects that are clinically significant; that is, differences that
affect treatment decisions. A pregnant woman might be interested in results that are relevant to her,
like the probability of delivering early or late.
4.3 Probability mass function
Another way to represent a distribution is a probability mass function (PMF), which maps fromeach
value to its probability.
A probability is a frequency expressed as a fraction of the sample size, n. To get from frequenciesto
probabilities, we divide through by n, which is called normalization.
Given a Hist, we can make a dictionary that maps from each value to its probability:n = hist.Total()
d ={}
for x, freq in hist.Items():
d[x] = freq / n

Or use Pmf class provided by think stats2. Like Hist, the Pmf constructor can take a list, pandasSeries,
dictionary, Hist, or another Pmf object.
Here’s an example with a simple list:
>>> import thinkstats2
>>> pmf = thinkstats2.Pmf([1, 2, 2, 3, 5])
>>> pmf
Pmf({1: 0.2, 2: 0.4, 3: 0.2, 5: 0.2})

ThePmfisnormalizedsototalprobabilityis1.
Pmf and Hist objects are similar in many ways; in fact, they inherit many of their methods from a
common parent class. For example, the methods Values and Items work the same way for both.The
biggest difference is that a Hist maps from values to integer counters; a Pmf maps from values to
floating-point probabilities.
To lookup the probability associated with a value, use Prob:
>>> pmf.Prob(2)0.4
The bracket operator is equivalent:
>>> pmf[2]0.4
We can modify an existing Pmf by incrementing the probability associated with a value:
>>> pmf.Incr(2, 0.2)
>>>pmf.Prob(2)0.6
Or we can multiply a probability by a factor:
>>> pmf.Mult(2, 0.5)
>>> pmf.Prob(2)0.3
If we modify a Pmf, the result may not be normalized; that is, the probabilities may no longeradd up
to 1. To check, you can call Total, which returns the sum of the probabilities:
>>> pmf.Total()0.9
To re normalize, call Normalize:
>>> pmf.Normalize()
>>> pmf.Total()1.0
Pmf objects provide a Copy method so we can make and modify a copy without affecting the original.

My notation in this section might seem inconsistent, but there is a system: we use Pmf for the name
of the class, pmf for an instance of the class, and PMF for the mathematical concept of a probability
mass function.

4.4 Plotting PMFs:


Thinkplot provides two ways to plotPmfs:
To plot a Pmf as a bar graph, we can use think plot. Hist. Bar graphs are most useful lif the
number of values in the Pmf is small.
Toplot a Pmf as a step function, we can use thinkplot. Pmf. This option is most useful if there area large
number of values and the Pmf is smooth. This functional so works with Hist objects.
Here’s the code that generates Figure 4.2: thinkplot.PrePlot(2,cols=2)thinkplot.Hist(first_pmf, align='right',
width=width)thinkplot.Hist(other_pmf, align='left', width=width)thinkplot.Config(xlabel='weeks',
ylabel='probability',axis=[27,46,0,0.6]
)
thinkplot.PrePlot(2)thinkplot.SubPlot(2)thinkplot.P mfs([first_pmf,
other_pmf])thinkplot.Show(xlabel='weeks',
axis=[27, 46, 0, 0.6])
By plotting the PMF instead of the histogram, we can compare the two distributions withoutbeing
misled by the difference in sample size
Figure 4.2 shows PMF so fpregnancy length for first babies and others using bargraphs(left) and
stepfunctions(right).
Figure 4.2. PMF of pregnancy lengths for first babies and others, using bar graphs and step functions

4.4.1 Other Visualizations

Patterns and relationships. Once have an idea what is going on,a good nextstep is to design a
visualization that makes the patterns we have identified as clear as possible.
weeks = range(35, 46)diffs = []
for week in weeks:
p1 = first_pmf.Prob(week) p2 = other_pmf.Prob(week)
diff = 100 * (p1 - p2)diffs.append(diff)thinkplot.Bar(weeks, diffs)
In this code, weeks is the range of weeks; diffs is the difference between the two PMFs in percentage points.
Figure 4.2 shows the result as a bar chart. This figure makes the pattern clearer: first babiesare less likely
to be born in week 39, and somewhat more likely to be born in weeks 41 and 42. use the same dataset
to identify an apparent difference and then chose a visualization that makesthe difference apparent.

4.5 THE CLASS SIZE PARADOX


Class size paradox tells us that students tend to experience a greater number of classmates than thetrue
average class size because most of a large number of those experiences are in the classes with most
students. Confusing?
Let’s examine this paradox with an example below.
I will calculate the Actual class size and Observed class size:

If you calculate mean class size of the above dataset, it is going to be (10 + 20 + 30)/3 = 20
Actual class size = 20
• But, if you survey each student about their class size, a student in class 1 is going to say I have 9 other
classmates in the class and student in class 2 is going to say that he/she has 19other classmates and so
on.
Let’s take their responses into consideration and calculate the mean. This Mean is also called as
Observed mean.
Observed mean = (10*10 )+ (20 * 20) + (30 * 30)/(10+20+30)Observed mean = (100 + 400 + 900)/50
Observed Mean = 23.33
• I feel the observed mean calculation is like any weighted average calculation that we do inmost of the
statistical exercises.
You can also calculate the observed mean using PMF for the above dataset:PMF = [(10: 10/60), (20:
20/60), (30: 30/60)]
PMF = [(10: 0.167), (20: 0.333), (30: 0.500) ]
So the mean of the class size is: 10*0.167+20*0.333+30*0.500 = 23.33
Observed mean = 23.33
As you can see, the observed mean is higher than the actual mean. This is what class size paradox
teaches us.

4.6 DATA FRAME INDEXING


In“DataFrames”wereadapandasDataFrameandusedittoselectandmodifydatacolumns.lookatrowsel
ection.To,create a NumPyarray of random numbers and use it to initialize a DataFrame:
>>> import numpy as np
>>> import pandas
>>> array = np.random.randn(4, 2)
>>> df = pandas.DataFrame(array)
>>> df

By default, the rows and columns are numbered starting at zero, but we can provide column
names:
>>> columns = ['A', 'B']
>>> df = pandas.DataFrame(array, columns=columns)
>>> df

We can also provide row names. The set of row names is called the index; the row names
themselves are called labels.
>>> index = ['a', 'b', 'c', 'd']
>>> df = pandas.DataFrame(array, columns=columns, index=index)
>>> df

4.7 CUMULATIVE DISTRIBUTION FUNCTIONS


4.7.1 The Limits of PMFs:
● PMFs work well if the number of values is small. But as the number of values increases,the probability
associated with each value gets smaller and the effect of random noise Increases.
● The problems can be mitigated by binning the data; that is, dividing the range of valuesinto non-
overlapping intervals and counting the number of values in each bin.
● Binning can be useful, but it is tricky to get the size of the bins right. If they are big enoughto smooth
out noise, they might also smooth out useful information.
● It is hard to tell which features are meaningful. Also, it is hard to see overall patterns insome cases.
4.7.2 Percentiles:
Example:
● If we consider a standardized test, we probably got the results in the form of a raw score and a
percentile rank. In this context, the percentile rank is the fraction of peoplewho scored lower than you
(or the same). So if we are “in the 90th percentile,” we didas well as or better than 90% of the people
who took the exam.
Computation of the percentile rank value in the sequence scores: def PercentileRank(scores,
your_score):
count = 0
for score in scores:
if score <= your_score:
count += 1
percentile_rank = 100.0 * count / len(scores)return percentile_rank
● As an example, if the scores in the sequence were 55, 66, 77, 88 and 99, and we will get thevalue 88,
then the percentile rank would be 100 * 4 / 5 which is 80.
● For a given a percentile rank and we want to find the corresponding value, one option is tosort the
values and search for the one we want:
def Percentile(scores, percentile_rank):scores.sort()
for score in scores:
if PercentileRank(scores, score) >= percentile_rank:return score
● The result of this calculation is a percentile. For example, the 50th percentile is the value with
percentile rank 50. In the distribution of exam scores, the 50th percentile is 77.
● To summarize, Percentile Rank takes a value and computes its percentile rank in a set ofvalues;
Percentile takes a percentile rank and computes the corresponding value.

4.8 CDFs:
● The CDF is the function that maps from value to its percentile rank.
● The CDF is a function of x, where x is any value that might appear in the distribution.
● To evaluate CDF(x) for a particular value of x, we compute the fraction of values in thedistribution
less than or equal to x.
Let us consider the following Example : a function that takes a sequence, t, and a value, x:

def EvalCdf(t, x):count = 0.0


for value in t:
if value <= x:
count += 1
prob = count / len(t)return prob
● This function is almost identical to PercentileRank, except that the result is probabilty inthe range 0–
1 rather than a percentile rank in the range 0–100.
As an example, suppose we collect a sample with the values [1, 2, 2, 3, 5]. Here aresome values from
its CDF:
CDF (0) = 0
CDF (1) = 0.2
CDF (2) = 0.6
CDF (3) = 0.8
CDF (4) = 0.8
CDF (5) = 1
● We can evaluate the CDF for any value of x, not just values that appear in the sample.If x is less than
the smallest value in the sample, CDF(x) is 0. If x is greater than the largest value, CDF(x) is 1.
Representing CDFs:
● Prob(x)
Given a value x, computes the probability p = CDF(x). The bracket operator isequivalent to Prob.
● Value(p)
Given a probability p, computes the corresponding value, x; that is, the inverseCDF of p.
➔ The Cdf constructor can take as an argument a list of values, a pandas Series, a Hist,Pmf, or another
Cdf.
➔ The following code makes a Cdf for the distribution of pregnancylengths in the NSFG:
live, firsts, others = first.MakeFrames()
cdf = thinkstats2.Cdf(live.prglngth, label='prglngth')
❖ thinkplot provides a function named Cdf that plots Cdfs as lines:thinkplot.Cdf(cdf)
thinkplot.Show(xlabel='weeks', ylabel='CDF')
Percentile based statistics:
Introduction to Percentiles
Percentiles are statistical measures that divide a dataset into specific segments, indicating the relative
position of a particular value within the distribution. They help us understand how data points are
distributed in relation to each other and provide valuable insights into the dataset's overall
characteristics. In data science, percentiles are commonly used for various purposes such as analyzing
data distributions, outlier detection, and understanding data variability. Calculating Percentiles
To calculate percentiles, follow these steps:

Step 1: Arrange Data in Ascending Order Sort the dataset in ascending order. This ensures that thedata
points are organized for percentile calculation.
Step 2: Identify the Desired Percentile Determine which percentile you want to calculate. Common
percentiles include the median (50th percentile), quartiles (25th and 75th percentiles), and various
other percentiles like the 90th or 95th percentile.
Step 3: Calculate the Position Use the formula position = (percentile / 100) * (n + 1) to find the position
of the desired percentile within the dataset, where n is the total number of data points.
Step 4: Find the Data Value If the position is a whole number, the data value at that position is the
desired percentile. If the position is not a whole number, calculate the weighted average of the data
values at the floor and ceiling positions.
Code:

import numpy as np

data = [18, 22, 25, 27, 38, 33, 37, 41, 45, 50]

# Step 1: Sort data

sorted_data = np.sort(data)

# Step 2: Calculate positions

percentile_25 = 25

position_25 = (percentile_25 / 100) *(len(sorted_data) + 1)

percentile_75 = 75

position_75 = (percentile_75 / 100) *(len(sorted_data) + 1)

# Step 3 and 4: Calculate percentiles

if position_25.is_integer():
percentile_value_25 = sorted_data[int(position_25)-1]

else:
floor_position_25= int(np.floor(position_25))

ceil_position_25= int(np.ceil(position_25))
percentile_value_25 = sorted_data[floor_position_25] + (position_25 -
floor_position_25) * (sorted_data[ceil_position_25] -
sorted_data[floor_position_25])

if position_75.is_integer():

percentile_value_75= sorted_data[int (position_75) - 1]

else:
floor_position_75= int(np.floor(position_75))
ceil_position_75= int(np.ceil(position_75))
percentile_value_75 = sorted_data[floor_position_75] + (position_75 -
floor_position_75) * (sorted_data[ceil_position_75] -
sorted_data[floor_position_75])

print(f"25th percentile: {percentile_value_25}")


print(f"75th percentile: {percentile_value_75}")

output:

25th percentile: 26.5


75th percentile: 46.25

In this example, we calculate the 25th and 75th percentiles of the dataset using Python's NumPy library
for sorting and mathematical functions.
Percentiles are crucial statistical measures in data science for understanding data distributions and
making informed decisions. By calculating percentiles, you gain insights into the relative positionof
data points within a dataset, helping you analyze the spread and variability of the data. Read more on
Sarthaks.com -
Q: What is a percentile in statistics?

A: A percentile is a measure used in statistics to indicate a particular position within a dataset. It


represents the value below which a given percentage of observations fall. For example, the 25th
percentile (also known as the first quartile) is the value below which 25% of the data points lie.
Q: How do you calculate the kth percentile?
A: To calculate the kth percentile, follow these steps: Arrange the data in ascending order. Calculate
the desired position using the formula: Position = (k / 100) * (N + 1), where N is the total number of
data points. If the position is an integer, the kth percentile is the value at that position. If the position
is not an integer, the kth percentile is the average of the values at the floor(position) and ceil(position)
positions.
Q: How do you calculate percentiles in Python?

A: You can calculate percentiles in Python using the numpy library.Here's an example:
import numpy as np data = [12, 15, 17, 20, 23, 25, 27, 30, 32, 35]
percentile_value = 25
# For the 25th percentile
result = np. percentile (data, percentile_value)
print (f"The {percentile_value}th percentile is: {result}")
output:

The 25th percentile is: 17.75

Q: How do you find the median using percentiles?


A: The median is the 50th percentile. You can find it using the same numpy library: import numpyas
np
data = [12, 15, 17, 20, 23, 25, 27, 30, 32, 35]
median = np. percentile (data, 50) print (f"The median is: {median}")output:
The median is: 24.0
Q: What is the interquartile range (IQR)?
A: The interquartile range is a measure of statistical dispersion, based on the difference between the
third quartile (Q3, 75th percentile) and the first quartile (Q1, 25th percentile). It gives an idea of how
spread out the middle 50% of the data is.
Q: How do you calculate the IQR in Python?
A: You can calculate the interquartile range using the numpy library:
import numpy as np data = [12, 15, 17, 20, 23, 25, 27, 30, 32, 35]q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)iqr = q3 - q1
print(f"The interquartile range (IQR) is: {iqr}")
output:
The interquartile range (IQR) is: 11.5
Important Interview Questions and Answers on Data Science - Statistics Percentiles
Q: What is a percentile?
A percentile is a measure used in statistics to describe the relative standing of a particular value
within a dataset. It indicates the percentage of values that are less than or equal to the given value.

Q: How do you calculate the nth percentile?


To calculate the nth percentile, you can follow these steps:
1. Arrange the data in ascending order.
2. Compute the index (position) of the desired percentile using the formula: index = (percentile /
100) * (N + 1), where N is the total number of data points.
3. If the index is an integer, the nth percentile is the value at that index. If the index is not an integer,
take the weighted average of the values at the integer part of the index and the next higher index.

Q: What is the median? How is it related to the 50th percentile?


The median is the value that separates a dataset into two equal halves when the data is ordered. It'salso
the 50th percentile, as it's the value below which 50% of the data falls.

Q: How do you interpret the 25th and 75th percentiles, also known as the first and third quartiles? The
25th percentile (Q1) is the value below which 25% of the data falls. The 75th percentile (Q3)is the
value below which 75% of the data falls. The interquartile range (IQR) is the difference between the
third and first quartiles (IQR = Q3 - Q1), which provides a measure of the spread of the middle 50%
of the data.
Q: How do you handle outliers when calculating percentiles?
Outliers can significantly affect percentile calculations. One approach to handle outliers is to use the
"trimmed" dataset (removing the extreme values) to calculate percentiles. Another approach isto use
robust estimators like the median, which is less sensitive to outliers compared to the mean.

Example Code: Calculating Percentiles in Pythonimport numpy as np


# Sample dataset
data = [12, 15, 18, 20, 22, 25, 28, 30, 35, 40]
# Calculate the 25th and 75th percentiles (first and third quartiles)q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75) print("Q1 (25th percentile):", q1)print("Q3 (75th percentile):", q3)output:
Q1 (25th percentile): 18.5Q3 (75th percentile): 29.5
In this example, the numpy library's percentile function is used to calculate the desired percentiles.

4.8 Random Numbers:


Example:
● If we choose a random sample from the population of live births and look up thepercentile
rank of their birth weights we use CDF which is shown in 4.3
● Computation of CDF’s of birth weights use the following codeweights = live.totalwgt_lb
cdf = thinkstats2.Cdf(weights, label='totalwgt_lb')
● Then we generate a sample and compute the percentile rank of each value in the sample.sample =
np.random.choice(weights, 100, replace=True)
ranks = [cdf.PercentileRank(x) for x in sample]
● sample is a random sample of 100 birth weights, chosen with replacement; that is, the same
value could be chosen more than once. ranks is a list of percentile ranks.
● Finally we make and plot the Cdf of the percentile ranks. rank_cdf = thinkstats2.Cdf(ranks)
thinkplot.Cdf(rank_cdf) thinkplot.Show(xlabel='percentile rank', ylabel='CDF')

Fig 4.3: CDF of Percentile Rank


The CDF is approximately a straight line, which means that the distribution is uniform
4.9 Comparing Percentile Ranks
● Percentile ranks are useful for comparing measurements across different groups. Example:
People who compete in foot races are usually grouped by age and gender. To compare people in
different age groups, you can convert race times to percentile ranks.
● A few years ago Sam ran the James Joyce Ramble 10K in Dedham MA; Sam finished in
42:44, which was 97th in a field of 1633. sam beat or tied 1537 runners out of 1633, Sam percentile
rank in the field was 94%.
● More generally, given position and field size, we can compute percentile rank: def
PositionToPercentile(position, field_size):
beat = field_size - position + 1 percentile = 100.0 * beat / field_size
return percentile
➔ In sam age group, denoted M4049 for “males between 40 and 49 years of age”, sam camein 26th out
of 256. So my percentile rank in my age group was 90%.
Modeling distributions:
Exponential distribution:
The exponential distribution represents the time until a continuous, random event occurs. In the
context of reliability engineering, this distribution is employed to model the lifespan of a device or
system before it fails. This information aids in maintenance planning and ensuring uninterrupted
operation.
The time intervals between successive earthquakes in a certain region can be accurately modeled by
an exponential distribution. This is especially true when these events occur randomly over time,but the
probability of them happening in a particular time frame is constant.
Normal distribution:
The normal distribution, characterized by its bell-shaped curve, is prevalent in various natural
phenomena. For instance, IQ scores in a population tend to follow a normal distribution. This allows
psychologists and educators to understand the distribution of intelligence levels and make informed
decisions regarding education programs and interventions.
Heights of adult males in a given population often exhibit a normal distribution. In such a scenario,most
men tend to cluster around the average height, with fewer individuals being exceptionally tall or short.
This means that the majority fall within one standard deviation of the mean, while a smaller percentage
deviates further from the average.
Lognormal distribution:
The log normal distribution describes a random variable whose logarithm is normally distributed. In
finance, this distribution is applied to model the prices of financial assets, such as stocks.
Understanding the log normal distribution is crucial for making informed investment decisions.
The distribution of wealth among individuals in an economy often follows a log-normal
distribution. This means that when the logarithm of wealth is considered, the resulting values
tendto cluster around a central point, reflecting the skewed nature of wealth distribution in
many societies.
UNIT V
Time Series Analysis

Time series analysis – Importing and cleaning, Plotting, Moving averages, Missing values, Serial
correlation, Autocorrelation; Predictive modeling – Overview, Evaluating predictive models,
Building predictive model solutions, Sentiment analysis

5.1 Introduction to Time series analysis


A time series is a sequence of measurements from a system that varies in time. One famous
example is the “hockey stick graph” that shows global average temperature over time.
Time series analysis is a specific way of analyzing a sequence of data points collected over an
interval of time. In time series analysis, analysts record data points at consistent intervals over a
set period of time rather than just recording the data points intermittently or randomly. However,
this type of analysis is not merely the act of collecting data over time.
Time is a crucial variable in time series analysis ,because it shows how the data adjusts over the
course of the data points as well as the final results. It provides an additional source of information
and a set order of dependencies between the data.
Time series analysis typically requires a large number of data points to ensure consistency and
reliability. An extensive data set ensures you have a representative sample size and that analysis
can cut through noisy data. It also ensures that any trends or patterns discovered are not outliers
and can account for seasonal variance. Additionally, time series data can be used for forecasting—
predicting future data based on historical data.

Time series analysis examples


Time series analysis is used for non-stationary data—things that are constantly fluctuating over
time or are affected by time. Industries like finance, retail, and economics frequently use time
series analysis because currency and sales are always changing. Stock market analysis is an
excellent example of time series analysis in action, especially with automated trading algorithms.
Likewise, time series analysis is ideal for forecasting weather changes, helping meteorologists
predict everything from tomorrow’s weather report to future years of climate change.
Examples of time series analysis in action include:
• Weather data
• Rainfall measurements
• Temperature readings
• Heart rate monitoring (EKG)
• Brain monitoring (EEG)
• Quarterly sales
• Stock prices
• Automated stock trading
• Industry forecasts
• Interest rates

Different Models used in time series analysis include:


• Classification: Identifies and assigns categories to the data.
• Curve fitting: Plots the data along a curve to study the relationships of variables within the
data.
• Descriptive analysis: Identifies patterns in time series data, like trends, cycles, or seasonal
variation.
• Explanative analysis: Attempts to understand the data and the relationships within it, as
well as cause and effect.
• Exploratory analysis: Highlights the main characteristics of the time series data, usually in
a visual format.
• Forecasting: Predicts future data. This type is based on historical trends. It uses the
historical data as a model for future data, predicting scenarios that could happen along
future plot points.
• Intervention analysis: Studies how an event can change the data.
• Segmentation: Splits the data into segments to show the underlying properties of the source
information.
Time series data can be classified into two main categories:
• Stock time series data means measuring attributes at a certain point in time, like a static
snapshot of the information as it was.
• Flow time series data means measuring the activity of the attributes over a certain period,
which is generally part of the total whole and makes up a portion of the results.

Important Considerations for Time Series Analysis


While time series data is data collected over time, there are different types of data that describe
how and when that time data was recorded. For example:
• Time series data is data that is recorded over consistent intervals of time.
• Cross-sectional data consists of several variables recorded at the same time.
• Pooled data is a combination of both time series data and cross-sectional data.

5.2 Importing and cleaning


When running python programs, we need to use datasets for data analysis. Python has various
modules which help us in importing the external data in various file formats to a python program.
This example shows how to import data of various formats to a python program.

5.2.1 Importing Data


Import csv file
The csv module enables us to read each of the row in the file using a comma as a delimiter. We
first open the file in read only mode and then assign the delimiter. Finally use a for loop to read
each row from the csv file.

import csv
with open("E:\customers.csv",'r') as custfile:
rows=csv.reader(custfile,delimiter=',')
for r in rows:
print(r)

Running the above code gives us the following result


['customerID', 'gender', 'Contract', 'PaperlessBilling', 'Churn']
['7590-VHVEG', 'Female', 'Month-to-month', 'Yes', 'No']
['5575-GNVDE', 'Male', 'One year', 'No', 'No']
['3668-QPYBK', 'Male', 'Month-to-month', 'Yes', 'Yes']
['7795-CFOCW', 'Male', 'One year', 'No', 'No']

5.2.1.1 Steps to Import a CSV File into Python using Pandas


Step 1: Capture the File Path
Firstly, capture the full path where the CSV file is stored.
For example, let’s suppose that a CSV file is stored under the following path:
C:\Users\Ron\Desktop\Clients.csv
Step 2: Apply the Python code
Type/copy the following code into Python, while making the necessary changes to the path.
Here is the code for our example
import pandas as pd
df = pd.read_csv (r'C:\Users\Ron\Desktop\Clients.csv')
print (df)
Step 3: Run the Code
Finally, run the Python code and we get:

Purchase
Person Name Country Product
Price
0 Jon Japan Computer $800
1 Bill US Printer $450
2 Maria Brazil Laptop $150
3 Rita UK Computer $1,200
4 Jack Spain Printer $150
5 Ron China Computer $1,200

5.2.2 Cleaning Data


Data cleaning means fixing bad data in the data set.
Bad data could be:
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates

5.2.2.1 Cleaning Empty Cells : Empty cells can potentially give the wrong result when we analyze
data.

Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.
Consider the following data set

Duration Date Pulse Maxpulse Calories


0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 2020/12/26 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0

Example
Return a new Data Frame with no empty cells:
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())

After executing the above code rows 18, 22 and 28 have been removed
By default, the dropna() method returns a new Data Frame, and will not change the original.

5.2.2.2 Data of Wrong Format


Cells with data of wrong format can make it difficult, or even impossible, to analyze data.
To fix it, there are two options: remove the rows, or convert all cells in the columns into the same
format.
Convert Into a Correct Format
In our Data Frame, we have two cells 22 and 26 with the wrong format. To convert all cells in the
'Date' column into dates Pandas has a to date time() method for this.

Example
Convert to date:
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())

After running the above code the following output gives as sample result.

Duration Date Pulse Maxpulse Calories


0 60 2020-12-01 110 130 409.1
1 60 2020-12-02 117 145 479.0
2 60 2020-12-03 103 135 340.0
3 45 2020-12-04 109 175 282.4
4 45 2020-12-05 117 148 406.0
5 60 2020-12-06 102 127 300.0
6 60 2020-12-07 110 136 374.0
7 450 2020-12-08 104 134 253.3

5.2.2.3 Fixing Wrong Data


Wrong Data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if
someone registered "199" instead of "1.99".
Sometimes the wrong data can be identified by looking at the data set, because we have an
expectation of what it should be.
If you take a look at our data set, you can see that in row 7, the duration is 450, but for all the other
rows the duration is between 30 and 60.
It doesn't have to be wrong, but taking in consideration that this is the data set of someone's
workout sessions, we conclude with the fact that this person did not work out in 450 minutes.

Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of "450", and we
could just insert "45" in row 7:

Example
Set "Duration" = 45 in row 7:
df.loc[7, 'Duration'] = 45

5.2.2.4 Removing Duplicates


Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
In the above test data set, we can assume that row 11 and 12 are duplicates.
To discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row:

Example
Returns True for every row that is a duplicate, othwerwise False:
print(df.duplicated())

Cleaning the data using Numpy


The following code reads it into a pandas DataFrame:
transactions = pandas.read_csv('mj-clean.csv', parse_dates=[5])
parse_dates tells read_csv to interpret values in column 5 as dates and convert them
to NumPy datetime64 objects.
The DataFrame has a row for each reported transaction and the following columns:
City
String city name
State
Two-letter state abbreviation
Price
Price paid in dollars

Amount
Quantity purchased in grams
Quality
High, medium, or low quality, as reported by the purchaser
Date
Date of report, presumed to be shortly after date of purchase
Ppg
Price per gram in dollars

State.name
String state name
Lat
Approximate latitude of the transaction, based on city name
Lon
Approximate longitude of the transaction

Each transaction is an event in time, so we could treat this dataset as a time series. But
the events are not equally spaced in time; the number of transactions reported each day
varies from 0 to several hundred.
In order to demonstrate these methods, let's divide the dataset into groups by reported
quality, and then transform each group into an equally spaced series by computing the
mean daily price per gram.

def GroupByQualityAndDay(transactions):
groups = transactions.groupby('quality')
dailies = {}
for name, group in groups:
dailies[name] = GroupByDay(group)
return dailies

groupby is a DataFrame method that returns a GroupBy object, groups; used in a for
loop, it iterates the names of the groups and the DataFrames that represent them. Since
the values of quality are low, medium, and high, we get three groups with those names.
The loop iterates through the groups and calls GroupByDay, which computes the daily
average price and returns a new DataFrame:
def GroupByDay(transactions, func=np.mean):

grouped = transactions[['date', 'ppg']].groupby('date')


daily = grouped.aggregate(func)
daily['date'] = daily.index
start = daily.date[0]
one_year = np.timedelta64(1, 'Y')
daily['years'] = (daily.date - start) / one_year
return daily

The parameter, transactions, is a DataFrame that contains columns date and ppg. We select these
two columns, then group by date. The result, grouped, is a map from each date to a DataFrame that
contains prices reported on that date. aggregate is a GroupBy method that iterates through the
groups and applies a function to each column of the group; in this case there is only one column,
ppg. So the result of aggregate is a DataFrame with one row for each date and one column, ppg.

Dates in these DataFrames are stored as NumPy datetime64 objects, which are represented as 64-
bit integers in nanoseconds. For some of the analyses coming up, it will be convenient to work
with time in more human-friendly units, like years.
So GroupByDay adds a column named date by copying the index, then adds years, which contains
the number of years since the first transaction as a floating-point number.
The resulting DataFrame has columns ppg, date, and years.
5.3 Plotting the Data
Pandas uses the plot() method to create diagrams.
We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the screen
Example
Import pyplot from Matplotlib and visualize our DataFrame:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
The above code results the following graph

Plotting the time series data


The result from GroupByQualityAndDay is a map from each quality to a DataFrame of daily
prices. Here’s the code to plot the three time series:

thinkplot.PrePlot(rows=3)
for i, (name, daily) in enumerate(dailies.items()):
thinkplot.SubPlot(i+1)
title = 'price per gram ($)' if i==0 else ''
thinkplot.Config(ylim=[0, 20], title=title)
thinkplot.Scatter(daily.index, daily.ppg, s=10, label=name)
if i == 2:
pyplot.xticks(rotation=30)
else:
thinkplot.Config(xticks=[])

PrePlot with rows=3 means that we are planning to make three subplots laid out in three rows. The
loop iterates through the DataFrames and creates a scatter plot for each is shown in Fig 5.1.
It is common to plot time series with line segments between the points, but in this case there are
many data points and prices are highly variable, so adding lines would not help. Since the labels
on the x-axis are dates, we use pyplot.xticks to rotate the “ticks” 30 degrees, making them more
readable.

Figure 5-1 Time series of daily price per gram for high, medium, and low quality
cannabis
One apparent feature in these plots is a gap around November 2013. It’s possible that data
collection was not active during this time, or the data might not be available.

5.4 Moving averages


Most time series analysis is based on the modeling assumption that the observed series is the sum
of three components:
Trend
A smooth function that captures persistent changes
Seasonality
Periodic variation, possibly including daily, weekly, monthly, or yearly cycles
Noise
Random variation around the long term trend Regression is one way to extract the trend from a
series, as we saw in the previous section. But if the trend is not a simple function, a good alternative
is a moving average. A moving average divides the series into overlapping regions, called
windows, and computes the average of the values in each window shown in Fig 5.2.

One of the simplest moving averages is the rolling mean, which computes the mean of the values
in each window. For example, if the window size is 3, the rolling mean computes the mean of
values 0 through 2, 1 through 3, 2 through 4, etc.
pandas provides rolling_mean, which takes a Series and a window size and returns a new
Series. >>> series = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> pandas.rolling_mean(series, 3)
array([ nan, nan, 1, 2, 3, 4, 5, 6, 7, 8])
The first two values are nan; the next value is the mean of the first three elements, 0, 1, and 2. The
next value is the mean of 1, 2, and 3. And so on.
Before we can apply rolling_mean to the cannabis data, we have to deal with missing values. There
are a few days in the observed interval with no reported transactions for one or more quality
categories, and a period in 2013 when data collection was not active. In the DataFrames we have
used so far, these dates are absent; the index skips days with no data. For the analysis that follows,
we need to represent this missing data explicitly. We can do that by “reindexing” the DataFrame:
dates = pandas.date_range(daily.index.min(), daily.index.max()) reindexed =
daily.reindex(dates)

The first line computes a date range that includes every day from the beginning to the end of the
observed interval. The second line creates a new DataFrame with all of the data from daily, but
including rows for all dates, filled with nan.

Now we can plot the rolling mean like this:


roll_mean = pandas.rolling_mean(reindexed.ppg, 30)
thinkplot.Plot(roll_mean.index, roll_mean)

The window size is 30, so each value in roll_mean is the mean of 30 values from reindexed.ppg.

Figure 5-2 Daily price and a rolling mean (left) and exponentially-weighted
moving average (right)

5.5 Missing values:

The datasets where information is collected along with timestamps in an orderly fashion are
denoted as time-series data. If you have missing values in time series data, you can obviously try
any of the above-discussed methods. But there are a few specific methods also which can be used
here.

To get an idea, I’ll create a simple dummy dataset.

time= pd.date_range("1/01/2021", periods=10, freq="W")


df = pd.DataFrame(index=time);
df["Units sold"] = [5.0,4.0,np.nan,np.nan,1.0,np.nan,3.0,6.0,np.nan,2.0];
print(df)

Let’s move on to the methods

Forward-fill missing values


The value of the next row will be used to fill the missing value.’ffill’ stands for ‘forward fill’. It is
very easy to implement. You just have to pass the “method” parameter as “ffill” in the fillna()
function.

forward_filled=df.fillna(method='ffill')
print(forward_filled)

Backward-fill missing values


Here, we use the value of the previous row to fill the missing value. ‘bfill’ stands for backward
fill. Here, you need to pass ‘bfill’ as the method parameter.

backward_filled=df.fillna(method='bfill')
print(backward_filled)

I hope you are able to spot the difference in both cases with the above images.

Linear Interpolation
Time series data has a lot of variations. The above methods of imputing using backfill and forward
fill isn’t the best possible solution. Linear Interpolation to the rescue!

Here, the values are filled with incrementing or decrementing values. It is a kind of imputation
technique, which tries to plot a linear relationship between data points. It uses the non-null values
available to compute the missing points.

interpolated=df.interpolate(limit_direction="both")
print(interpolated)
Compare these values to backward and forward fill and check for yourself which is good!

These are some basic ways of handling missing values in time-series data

Algorithms robust to missing values

There are some cases, where none of the above works well. Yet, you need to do an analysis. Then,
you should opt for algorithms that support missing values. KNN (K nearest neighbors) is one such
algorithm. It will consider the missing values by taking the majority of the K nearest values. The
random forest also is robust to categorical data with missing values. Many decision tree-based
algorithms like XGBoost, Catboost support data with missing values.

5.6 Serial correlation


Serial correlation occurs in a time series when a variable and a lagged version of itself (for instance
a variable at times T and at T-1) are observed to be correlated with one another over periods of
time. Repeating patterns often show serial correlation when the level of a variable affects its future
level. In finance, this correlation is used by technical analysts to determine how well the past price
of a security predicts the future price.
To compute serial correlation, we can shift the time series by an interval called a lag, and then
compute the correlation of the shifted series with the original:
def SerialCorr(series, lag=1):
xs = series[lag:]
ys = series.shift(lag)[lag:]
corr = thinkstats2.Corr(xs, ys)
return corr
After the shift, the first lag values are nan, so use a slice to remove them before computing Corr.

If we apply SerialCorr to the raw price data with lag 1, we find serial correlation 0.48 for the high
quality category, 0.16 for medium and 0.10 for low. In any time series with a long-term trend, we
expect to see strong serial correlations; for example, if prices are falling, we expect to see values
above the mean in the first half of the series and values below the mean in the second half.
It is more useful to find, if the correlation persists if you subtract away the trend. For example, we
can compute the residual of the EWMA and then compute its serial correlation:
ewma = pandas.ewma(reindexed.ppg, span=30)
resid = reindexed.ppg - ewma
corr = SerialCorr(resid, 1)

With lag=1, the serial correlations for the de-trended data are -0.022 for high quality, -0.015 for
medium, and 0.036 for low. These values are small, indicating that there is little or no one-day
serial correlation in this series.

5.7 Autocorrelation
Autocorrelation is a mathematical representation of the degree of similarity between a given time
series and a lagged version of itself over successive time intervals shown in Fig 5.3. It's
conceptually similar to the correlation between two different time series, but autocorrelation uses
the same time series twice: once in its original form and once lagged one or more time periods.
● Autocorrelation represents the degree of similarity between a given time series and a lagged
version of itself over successive time intervals.
● Autocorrelation measures the relationship between a variable's current value and its past
values.
● An autocorrelation of +1 represents a perfect positive correlation, while an autocorrelation
of negative 1 represents a perfect negative correlation.
● Technical analysts can use autocorrelation to measure how much influence past prices for
a security have on its future price.
The autocorrelation function is a function that maps from lag to the serial correlation with the
given lag. “Autocorrelation” is another name for serial correlation, used more often when the lag
is not 1.
StatsModels, which we used for linear regression in “StatsModels”, also provides functions for
time series analysis, including acf, which computes the autocorrelation function:
import statsmodels.tsa.stattools as smtsa acf = smtsa.
acf(filled.resid, nlags=365, unbiased=True)
acf computes serial correlations with lags from 0 through nlags. The unbiased flag tells acf to
correct the estimates for the sample size. The result is an array of correlations. If we select daily
prices for high quality, and extract correlations for lags 1, 7, 30, and 365, we can confirm that acf
and SerialCorr yield approximately the same results:
>>> acf[0], acf[1], acf[7], acf[30], acf[365]
1.000, -0.029, 0.020, 0.014, 0.044
With lag=0, acf computes the correlation of the series with itself, which is always 1

Figure 5.3 Autocorrelation function for daily prices (left), and daily prices with a
simulated weekly seasonality (right)
5.8 Predictive modeling
Predictive modeling is a statistical technique using machine learning and data mining to predict
and forecast likely future outcomes with the aid of historical and existing data. It works by
analyzing current and historical data and projecting what it learns on a model generated to forecast
likely outcomes. Predictive modeling can be used to predict just about anything, from TV ratings
and a customer’s next purchase to credit risks and corporate earnings.

A predictive model is not fixed; it is validated or revised regularly to incorporate changes in the
underlying data. In other words, it’s not a one-and-done prediction. Predictive models make
assumptions based on what has happened in the past and what is happening now. If incoming, new
data shows changes in what is happening now, the impact on the likely future outcome must be
recalculated, too.
Predictive analytics tools use a variety of vetted models and algorithms that can be applied to a
wide spread of use cases.

The top five predictive analytics models are:


1. Classification model: Considered the simplest model, it categorizes data for simple and direct
query response. An example use case would be to answer the question “Is this a fraudulent
transaction?”
2. Clustering model: This model nests data together by common attributes. It works by
grouping things or people with shared characteristics or behaviors and plans strategies for
each group at a larger scale. An example is in determining credit risk for a loan applicant
based on what other people in the same or a similar situation did in the past.
3. Forecast model: This is a very popular model, and it works on anything with a numerical
value based on learning from historical data. For example, in answering how much lettuce a
restaurant should order next week or how many calls a customer support agent should be able
to handle per day or week, the system looks back to historical data.
4. Outliers model: This model works by analyzing abnormal or outlying data points. For
example, a bank might use an outlier model to identify fraud by asking whether a transaction
is outside of the customer’s normal buying habits or whether an expense in a given category
is normal or not. For example, a $1,000 credit card charge for a washer and dryer in the
cardholder’s preferred big box store would not be alarming, but $1,000 spent on designer
clothing in a location where the customer has never charged other items might be indicative
of a breached account.
5. Time series model: This model evaluates a sequence of data points based on time. For
example, the number of stroke patients admitted to the hospital in the last four months is used
to predict how many patients the hospital might expect to admit next week, next month or the
rest of the year. A single metric measured and compared over time is thus more meaningful
than a simple average.

5.8.1 Evaluating predictive models


There are two categories of problems that a predictive model can solve depending on the category
of business — classification and regression problems. The classification category describes
predicting which category the sample should fall into and the latter describes predicting quantity.
These two categories are the initial points of a data science team for choosing the right metrics and
then determining a good working model.

Classification Problems
A classification problem is about predicting what category something falls into. An example of a
classification problem is analyzing medical data to determine if a patient is in a high risk group for
a certain disease or not. Metrics that can be used for evaluation a classification model: Percent
correction classification (PCC): measures overall accuracy. Every error has the same weight.

Confusion matrix: also measures accuracy but distinguished between errors, i.e false positives,
false negatives and correct predictions.

Both of these metrics are good to use when every data entry needs to be scored. For example, if
every customer who visits a website needs to be shown customized content based on their browsing
behavior, every visitor will need to be categorized.
The following are the measures used in classification problem:
Area Under the ROC Curve (AUC – ROC): is one of the most widely used metrics for
evaluation. Popular because it ranks the positive predictions higher than the negative. Also, ROC
curve is independent of the change in proportion of responders.

Lift and Gain charts: both charts measure the effectiveness of a model by calculating the ratio
between the results obtained with and without the performance evaluation model. In other words,
these metrics examine if using predictive models has any positive effects or not.

Regression Problems
A regression problem is about predicting a quantity. A simple example of a regression problem is
prediction of the selling price of a real estate property based on its attributes (location, square
meters available, condition, etc.).
To evaluate how good your regression model is, you can use the following metrics:
R-squared: indicate how many variables compared to the total variables the model predicted. R-
squared does not take into consideration any biases that might be present in the data. Therefore, a
good model might have a low R-squared value, or a model that does not fit the data might have a
high R-squared value.
Average error: the numerical difference between the predicted value and the actual value.

Mean Square Error (MSE): good to use if you have a lot of outliers in the data.
Median error: the average of all differences between the predicted and the actual values.
Average absolute error: similar to the average error, only you use the absolute value of the
difference to balance out the outliers in the data.
Median absolute error: represents the average of the absolute differences between prediction and
actual observation. All individual differences have equal weight, and big outliers can therefore
affect the final evaluation of the model.

5.9 Building predictive model solutions


Predictive analytics uses a variety of statistical, data mining, and game theory to analyze current
and historical facts, to make predictions about future events. Predictive analytics enables
organizations to understand who their customers and prospects are, how to up-sell and cross-sell
products and services, and how to anticipate customer behavior. Predictive analytics enables
organizations to understand who their customers and prospects are, how to up-sell and cross-sell
products and services, and how to anticipate customer behavior.

Six Steps to Use and Develop Predictive Models


In addition to data and statistical expertise, predictive model builders and users need strong
knowledge of an organization’s business operations and the industry in which the organization
competes. If we imagine these are usually highly skilled analytics personnel. These six steps will
help to develop and use predictive models.

1. Scope and define the predictive analytics model you want to build. In this step if we want to
determine what business processes will be analyzed and what the desired business outcomes are,
such as the adoption of a product by a certain segment of customers.
2. Explore and profile your data. Predictive analytics is data-intensive. In this step we need to
determine the needed data, where it’s stored, whether it’s readily accessible, and its current state.
3. Gather, cleanse and integrate the data. Once if we know where the necessary data is located,
we may need to clean the data. Build the model from a consistent and comprehensive set of
information that is ready to be analyzed.
4. Build the predictive model. Establish the hypothesis and then build the test model. The goal is
to include, and rule out, different variables and factors and then test the model using historical data
to see if the results produced by the model prove the hypothesis.
5. Incorporate analytics into business processes. To make the model valuable, we need to
integrate it into the business process so it can be used to help achieve the outcome.
6. Monitor the model and measure the business results. We live and market in a dynamic
environment, where buying, competition and other factors change. You will need to monitor the
model and measure how effective it is at continuing to produce the desired outcome. It may be
necessary to make adjustments and fine tune the model as conditions evolve.

5.10 Sentiment analysis


Sentiment analysis is the process of using natural language processing, text analysis, and statistics
to analyze customer sentiment. The best businesses understand the sentiment of their customers—
what people are saying, how they’re saying it, and what they mean. Customer sentiment can be
found in tweets, comments, reviews, or other places where people mention the brand. Sentiment
Analysis is the domain of understanding these emotions with software, and it’s a must-understand
for developers and business leaders in a modern workplace.

Businesses can use insights from sentiment analysis to improve their products, fine-tune marketing
messages, correct misconceptions, and identify positive influencers. Social media has
revolutionized the way people make decisions about products and services. In markets like travel,
hospitality, and consumer electronics, customer reviews are now considered to be at least as
important as evaluations by professional reviewers.

You might also like