Foundation of Data Science
Foundation of Data Science
Data
In computing, data is information that has been translated into a form that is efficient for movement or
processing
Data Science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today. It adds methods from computer science to the repertoire of statistics.
Facets of data
In data science and big data you’ll come across many different types of data, and each of them tends to require
different tools and techniques. The main categories of data are these:
Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and images
Streaming
Let’s explore all these interesting data types.
Structured data
Structured data is data that depends on a data model and resides in a fixed field within a record. As
such, it’s often easy to store structured data in tables within databases or Excel files
SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.
1
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or
varying. One example of unstructured data is your regular email
Natural language
Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.
2
The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text.
Machine-generated data
Machine-generated data is information that’s automatically created by a computer, process,
application, or other machine without human intervention.
Machine-generated data is becoming a major data resource and will continue to do so.
The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and telemetry.
1. The first step of this process is setting a research goal. The main purpose here is making sure all the
stakeholders understand the what, how, and why of the project. In every serious project this will result
in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step includes
finding suitable data and getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw
form into data that’s directly usable in your models. To achieve this, you’ll detect and correct different
kinds of errors in the data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and modeling.
4
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data.
You’ll look for patterns, correlations, and deviations based on visual and descriptive techniques. The
insights you gain from this phase will enable you to start modeling.
5. Finally, we get to model building (often referred to as “data modeling” throughout this book). It is now
that you attempt to gain the insights or make the predictions stated in your project charter. Now is the
time to bring out the heavy guns, but remember research has taught us that often (but not always) a
combination of simple models tends to outperform one complicated model. If you’ve done this phase
right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis, if
needed. One goal of a project is to change a process and/or make better decisions. You may still need
to convince the business that your findings will indeed change the business process as expected. This
is where you can shine in your influencer role. The importance of this step is more apparent in projects
on a strategic and tactical level. Certain projects require you to perform the business process over and
over again, so automating the project will save time.
Retrieving data
The next step in data science is to retrieve the required data. Sometimes you need to go into the field
and design a data collection process yourself, but most of the time you won’t be involved in this step.
Many companies will have already collected and stored the data for you, and what they don’t have can
often be bought from third parties.
More and more organizations are making even high-quality data freely available for public and
commercial use.
Data can be stored in many forms, ranging from simple text files to tables in a database. The objective
now is acquiring all the data you need.
5
Most companies have a program for maintaining key data, so much of the cleaning work may already
be done. This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals.
Data warehouses and data marts are home to preprocessed data, data lakes contain data in its natural or
raw format.
Finding data even within your own company can sometimes be a challenge. As companies grow, their
data becomes scattered around many places. the data may be dispersed as people change positions and
leave the company.
Getting access to data is another difficult task. Organizations understand the value and sensitivity of
data and often have policies in place so everyone has access to what they need and nothing more.
These policies translate into physical and digital barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer data in most countries.
External Data
If data isn’t available inside your organization, look outside your organizations. Companies provide
data so that you, in turn, can enrich their services and ecosystem. Such is the case with Twitter,
LinkedIn, and Facebook.
More and more governments and organizations share their data for free with the world.
A list of open data providers that should get you started.
Cleansing data
Data cleansing is a sub process of the data science process that focuses on removing errors in your data so
your data becomes a true and consistent representation of the processes it originates from.
The first type is the interpretation error, such as when you take the value in your data for granted, like
saying that a person’s age is greater than 300 years.
The second type of error points to inconsistencies between data sources or against your company’s
standardized values.
An example of this class of errors is putting “Female” in one table and “F” in another when they represent
the same thing: that the person is female.
6
Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify data errors;
diagnostic plots can be especially insightful. For example, in figure we use a measure to identify data points
that seem out of place. We do a regression to get acquainted with the data and detect the influence of
individual observations on the regression line.
7
Data collected by machines or computers isn’t free from errors. Errors can arise from human
sloppiness, whereas others are due to machine or hardware failure.
Detecting data errors when the variables you study don’t have many classes can be done by tabulating
the data with counts.
When you have a variable that can take only two values: “Good” and “Bad”, you can create a
frequency table and see if those are truly the only two values present. In table the values “Godo” and
“Bade” point out something went wrong in at least 16 cases.
Most errors of this type are easy to fix with simple assignment statements and if-thenelse
rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Redundant Whitespace
Whitespaces tend to be hard to detect but cause errors like other redundant characters would.
The whitespace cause the miss match in the string such as “FR ” – “FR”, dropping the observations
that couldn’t be matched.
If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and trailing
whitespaces. For instance, in Python you can use the strip() function to remove leading and trailing
spaces.
Outliers
An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations. The easiest way to
find outliers is to use a plot or a table with the minimum and maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on the upper
side when a normal distribution is expected.
8
Dealing with Missing Values
Missing values aren’t necessarily wrong, but you still need to handle them separately; certain modeling
techniques can’t handle missing values. They might be an indicator that something went wrong in your data
collection or that an error happened in the ETL process. Common techniques data scientists use are listed in
table
Integrating data
Your data comes from several different places, and in this substep we focus on integrating these different
sources. Data varies in size, type, and structure, ranging from databases and Excel files to text documents.
Joining Tables
Joining tables allows you to combine the information of one observation found in one table with the
information that you find in another table. The focus is on enriching a single observation.
Let’s say that the first table contains information about the purchases of a customer and the other table
contains information about the region where your customer lives.
Joining the tables allows you to combine the information so that you can use it for your model, as
shown in figure.
Appending Tables
Appending or stacking tables is effectively adding observations from one table to another table.
One table contains the observations from the month January and the second table contains
observations from the month February. The result of appending these tables is a larger one with the
observations from January as well as February.
Figure. Appending data from tables is a common operation but requires an equal structure in the tables begin
appended,
Transforming data
Certain models require their data to be in a certain shape. Transforming your data so it takes a suitable form
for data modeling.
Relationships between an input variable and an output variable aren’t always linear. Take, for instance, a
relationship of the form y = aebx. Taking the log of the independent variables simplifies the estimation
problem dramatically. Transforming the input variables greatly simplifies the estimation problem. Other times
you might want to combine two variables into a new variable.
10
Reducing the Number of Variables
Having too many variables in your model makes the model difficult to handle, and certain techniques
don’t perform well when you overload them with too many input variables. For instance, all the
techniques based on a Euclidean distance perform well only up to 10 variables.
Data scientists use special methods to reduce the number of variables but retain the maximum amount
of data.
11
Figure shows how reducing the number of variables makes it easier to understand the key values. It also
shows how two variables account for 50.6% of the variation within the data set (component1 = 27.8% +
component2 = 22.8%). These variables, called “component1” and “component2,” are both combinations of
the original variables. They’re the principal components of the underlying data structure
Dummy variables can only take two values: true(1) or false(0). They’re used to indicate the absence of
a categorical effect that may explain the observation.
In this case you’ll make separate columns for the classes stored in one variable and indicate it with 1 if
the class is present and 0 otherwise.
An example is turning one column named Weekdays into the columns Monday through Sunday. You
use an indicator to show if the observation was on a Monday; you put 1 on Monday and 0 elsewhere.
Turning variables into dummies is a technique that’s used in modeling and is popular with, but not
exclusive to, economists.
Figure. Turning variables into dummies is a data transformation that breaks a variable that has multiple
classes into multiple variables, each having only two possible values: 0 or 1
During exploratory data analysis you take a deep dive into the data (see figure below). Information
becomes much easier to grasp when shown in a picture, therefore you mainly use graphical techniques to
gain an understanding of your data and the interactions between variables.
The goal isn’t to cleanse the data, but it’s common that you’ll still discover anomalies you missed before,
forcing you to take a step back and fix them.
12
13
The visualization techniques you use in this phase range from simple line graphs or histograms, as
shown in below figure , to more complex diagrams such as Sankey and network graphs.
Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight
into the data Other times the graphs can be animated or made interactive to make it easier and,
let’s admit it, way more fun
The techniques we described in this phase are mainly visual, but in practice they’re certainly not limited to
visualization techniques. Tabulation, clustering, and other modeling techniques can also be a part of
exploratory analysis. Even building simple models can be a part of this step.
Building a model is an iterative process. The way you build your model depends on whether you go with
classic statistics or the somewhat more recent machine learning school, and the type of technique you want to
use. Either way, most models consist of the following main steps:
Selection of a modeling technique and variables to enter in the model
Execution of the model
Diagnosis and model comparison
Model execution
Once you’ve chosen a model you’ll need to implement it in code.
14
Most programming languages, such as Python, already have libraries such as StatsModels or Scikit-
learn. These packages use several of the most popular techniques.
Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the
process. As you can see in the following code, it’s fairly easy to use linear regression with
StatsModels or Scikit-learn
Doing this yourself would require much more effort even for the simple techniques. The following
listing shows the execution of a linear prediction model.
Mean square error is a simple measure: check for every prediction how far it was from the truth, square this
error, and add up the error of every prediction.
15
Above figure compares the performance of two models to predict the order size from the price. The first
model is size = 3 * price and the second model is size = 10.
To estimate the models, we use 800 randomly chosen observations out of 1,000 (or 80%), without
showing the other 20% of data to the model.
Once the model is trained, we predict the values for the other 20% of the variables based on those for
which we already know the true value, and calculate the model error with an error measure.
Then we choose the model with the lowest error. In this example we chose model 1 because it has the
lowest total error.
Many models make strong assumptions, such as independence of the inputs, and you have to verify that these
assumptions are indeed met. This is called model diagnostics.
Sometimes people get so excited about your work that you’ll need to repeat it over and over again
because they value the predictions of your models or the insights that you produced.
This doesn’t always mean that you have to redo all of your analysis all the time. Sometimes it’s
sufficient that you implement only the model scoring; other times you might build an application that
automatically updates reports, Excel spreadsheets, or PowerPoint presentations. The last stage of the
data science process is where your soft skills will be most useful, and yes, they’re extremely important.
Data mining
Data mining is the process of discovering actionable information from large sets of data. Data mining uses
mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be
discovered by traditional data exploration because the relationships are too complex or because there is too
much data.
These patterns and trends can be collected and defined as a data mining model. Mining models can be applied
to specific scenarios, such as:
Forecasting: Estimating sales, predicting server loads or server downtime
16
Risk and probability: Choosing the best customers for targeted mailings, determining the probable
break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
Recommendations: Determining which products are likely to be sold together, generating
recommendations
Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
Grouping: Separating customers or events into cluster of related items, analyzing and predicting
affinities
Building a mining model is part of a larger process that includes everything from asking questions about the
data and creating a model to answer those questions, to deploying the model into a working environment. This
process can be defined by using the following six basic steps:
1. Defining the Problem
2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and Validating Models
6. Deploying and Updating Models
The following diagram describes the relationships between each step in the process, and the technologies in
Microsoft SQL Server that you can use to complete each step.
The first step in the data mining process is to clearly define the problem, and consider ways that data can be
utilized to provide an answer to the problem.
This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by
which the model will be evaluated, and defining specific objectives for the data mining project. These tasks
translate into questions such as the following:
What are you looking for? What types of relationships are you trying to find?
Does the problem you are trying to solve reflect the policies or processes of the business?
Do you want to make predictions from the data mining model, or just look for interesting patterns and
associations?
Which outcome or attribute do you want to try to predict?
17
What kind of data do you have and what kind of information is in each column? If there are multiple
tables, how are the tables related? Do you need to perform any cleansing, aggregation, or processing to
make the data usable?
How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of
the business?
Preparing Data
The second step in the data mining process is to consolidate and clean the data that was identified in
the Defining the Problem step.
Data can be scattered across a company and stored in different formats, or may contain inconsistencies
such as incorrect or missing entries.
Data cleaning is not just about removing bad data or interpolating missing values, but about finding
hidden correlations in the data, identifying sources of data that are the most accurate, and determining
which columns are the most appropriate for use in analysis
Exploring Data
Exploration techniques include calculating the minimum and maximum values, calculating mean and standard
deviations, and looking at the distribution of the data. For example, you might determine by reviewing the
maximum, minimum, and mean values that the data is not representative of your customers or business
processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis
for your expectations. Standard deviations and other distribution values can provide useful information about
the stability and accuracy of the results.
Building Models
The mining structure is linked to the source of data, but does not actually contain any data until you process it.
When you process the mining structure, SQL Server Analysis Services generates aggregates and other
statistical information that can be used for analysis. This information can be used by any mining model that is
based on the structure.
18
Data warehousing
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed
by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad
hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data
consolidations.
Although a data warehouse and a traditional database share some similarities, they need not be the same idea.
The main difference is that in a database, data is collected for multiple transactional purposes. However, in a
data warehouse, data is collected on an extensive scale to perform analytics. Databases provide real-time data,
while warehouses store data to be accessed for big analytical queries.
Data Warehousing integrates data and information collected from various sources into one comprehensive
database. For example, a data warehouse might combine customer information from an organization’s point-
of-sale systems, its mailing lists, website, and comment cards. It might also incorporate confidential
19
information about employees, salary information, etc. Businesses use such components of data warehouse to
analyze customers.
Data mining is one of the features of a data warehouse that involves looking for meaningful data patterns in
vast volumes of data and devising innovative strategies for increased sales and profits.
Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department, region, or business unit.
Every department of a business has a central repository or data mart to store data. The data from the data mart
is stored in the ODS periodically. The ODS then sends the data to the EDW, where it is stored and used.
Summary
In this chapter you learned the data science process consists of six steps:
Setting the research goal—Defining the what, the why, and the how of your project in a project
charter.
Retrieving data—Finding and getting access to data needed in your project. This data is either found
within the company or retrieved from a third party.
Data preparation—Checking and remediating data errors, enriching the data with data from other data
sources, and transforming it into a suitable format for your models.
Data exploration—Diving deeper into your data using descriptive statistics and visual techniques.
Data modeling—Using machine learning and statistical techniques to achieve your project goal.
Presentation and automation—Presenting your results to the stakeholders and industrializing your
analysis process for repetitive reuse and integration with other tools.
20
Unit – II
DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data with
Averages - Describing Variability - Normal Distributions and Standard (z) Scores
TYPES OF VARIABLES
A variable is a characteristic or property that can take on different values.
The weights can be described not only as quantitative data but also as observations for a quantitative
variable, since the various weights take on different numerical values.
By the same token, the replies can be described as observations for a qualitative variable, since the
replies to the Facebook profile question take on different values of either Yes or No.
Given this perspective, any single observation can be described as a constant, since it takes on only
one value.
A continuous variable consists of numbers whose values, at least in theory, have no restrictions.
Continuous variables can assume any numeric value and can be meaningfully split into smaller parts.
Consequently, they have valid fractional and decimal values. In fact, continuous variables have an infinite
number of potential values between any two points. Generally, you measure them using a scale.
Examples of continuous variables include weight, height, length, time, and temperature.
Durations, such as the reaction times of grade school children to a fire alarm; and standardized test scores,
such as those on the Scholastic Aptitude Test (SAT).
1
Independent and Dependent Variables
Independent Variable
In an experiment, an independent variable is the treatment manipulated by the investigator.
Independent variables (IVs) are the ones that you include in the model to explain or predict changes in
the dependent variable.
Independent indicates that they stand alone and other variables in the model do not influence them.
Independent variables are also known as predictors, factors, treatment variables, explanatory variables,
input variables, x-variables, and right-hand variables—because they appear on the right side of the
equals sign in a regression equation.
It is a variable that stands alone and isn't changed by the other variables you are trying to measure.
For example, someone's age might be an independent variable. Other factors (such as what they eat, how
much they go to school, how much television they watch)
The impartial creation of distinct groups, which differ only in terms of the independent variable, has a most
desirable consequence. Once the data have been collected, any difference between the groups can be
interpreted as being caused by the independent variable.
Dependent Variable
When a variable is believed to have been influenced by the independent variable, it is called a dependent
variable. In an experimental setting, the dependent variable is measured, counted, or recorded by the
investigator.
The dependent variable (DV) is what you want to use the model to explain or predict. The values of
this variable depend on other variables.
It’s also known as the response variable, outcome variable, and left-hand variable. Graphs place
dependent variables on the vertical, or Y, axis.
a dependent variable is exactly what it sounds like. It is something that depends on other factors.
For example the blood sugar test depends on what food you ate, at which time you ate etc.
Unlike the independent variable, the dependent variable isn’t manipulated by the investigator. Instead, it
represents an outcome: the data produced by the experiment.
Confounding Variable
An uncontrolled variable that compromises the interpretation of a study is known as a confounding variable.
Sometimes a confounding variable occurs because it’s impossible to assign subjects randomly to different
conditions.
Grouped Data
According to their frequency of occurrence. When observations are sorted
into classes of more than one value result is referred to as a frequency
for grouped data. (Shown in table 2.2)
The general structure of this frequency distribution is the data’s are
grouped into class intervals with 10 possible values each.
The frequency ( f ) column shows the frequency of observations in
each class and, at the bottom, the total number of observations in all classes.
GUIDELINES
3
OUTLIERS
An outlier is an extremely high or extremely low data point relative to the nearest data point and the rest
of the neighboring co-existing values in a data graph or dataset you're working with.
Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.
RELATIVE FREQUENCY DISTRIBUTIONS
Relative frequency distributions show the frequency of each
class as a part or fraction of the total frequency for the entire
distribution.
This type of distribution is especially helpful when you must
compare two or more distributions based on different total
numbers of observations.
The conversion to relative frequencies allows a direct
comparison of the shapes of two distributions without
adjust other observations.
Percentages or Proportions
Some people prefer to deal with percentages rather than proportions because percentages usually lack
decimal points. A proportion always varies between 0 and 1, whereas a percentage always varies between
0 percent and 100 percent.
To convert the relative frequencies, multiply each proportion by 100; that is, move the decimal point two
places to the right.
5
Cumulative Percentages
As has been suggested, if relative standing within a distribution is particularly important, then cumulative
frequencies are converted to cumulative percentages
To obtain this cumulative percentage, the cumulative frequency of the class should be divided by the total
frequency of the entire distribution.
Percentile Ranks
When used to describe the relative position of any score within its parent distribution, cumulative
percentages are referred to as percentile ranks.
The percentile rank of a score indicates the percentage of scores in the entire distribution with similar or
smaller values than that score. Thus a weight has a percentile rank of 80 if equal or lighter weights
constitute 80 percent of the entire distribution.
6
GRAPHS
Data can be described clearly and concisely with the aid of a well-constructed frequency distribution. And
data can often be described even more vividly by converting frequency distributions into graphs.
Figure: Histogram
7
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph. Frequency polygons may
be constructed directly from frequency distributions.
8
Stem and Leaf Displays
Another technique for summarizing quantitative data is a stem and leaf display. Stem and leaf displays are
ideal for summarizing distributions, such as that for weight data, without destroying the identities of
individual observations.
For example
Enter each raw score into the stem and leaf display. As suggested by the shaded coding in Table 2.9, the first
raw score of 160 reappears as a leaf of 0 on a stem of 16. The next raw score of 193 reappears as a leaf of 3 on
a stem of 19, and the third raw score of 226 reappears as a leaf of 6 on a stem of 22, and so on, until each raw
score reappears as a leaf on its appropriate stem.
TYPICAL SHAPES
Whether expressed as a histogram, a frequency polygon, or a stem and leaf display, an important
characteristic of a frequency distribution is its shape. Below figure shows some of the more typical shapes for
smoothed frequency polygons (which ignore the inevitable irregularities of real data).
9
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
As with histograms, equal segments along the horizontal axis are allocated to the different words or
classes that appear in the frequency distribution for qualitative data. Likewise, equal segments along
the vertical axis reflect increases in frequency. The body of the bar graph consists of a series of bars
whose heights reflect the frequencies for the various words or classes.
A person’s answer to the question “Do you have a Facebook profile?” is either Yes or No, not some
impossible intermediate value, such as 40 percent Yes and 60 percent No.
Gaps are placed between adjacent bars of bar graphs to emphasize the discontinuous nature of
qualitative data.
MISLEADING GRAPHS
Graphs can be constructed in an unscrupulous manner to support a particular point of view.
Popular sayings says, including “Numbers don’t lie, but statisticians do” and “There are three kinds of lies—
lies, damned lies, and statistics.”
10
11
Describing Data with Averages
MODE
The mode reflects the value of the most frequently occurring score.
In other words
A mode is defined as the value that has a higher frequency in a given set of values. It is the value that appears
the most number of times.
Example:
In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it has appeared in the set twice.
Types of Modes
Bimodal, Trimodal & Multimodal (More than one mode)
When there are two modes in a data set, then the set is called bimodal
For example, The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because both 2 and 5 is repeated three times
in the given set.
When there are three modes in a data set, then the set is called trimodal
For example, the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
When there are four or more modes in a data set, then the set is called multimodal
Example: The following table represents the number of wickets taken by a bowler in 10 matches. Find the
mode of the given set of data.
It can be seen that 2 wickets were taken by the bowler frequently in different matches. Hence, the mode of the
given data is 2.
MEDIAN
The median reflects the middle value when observations are ordered from least to most.
The median splits a set of ordered observations into two equal parts, the upper and lower halves.
Example 1:
4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23, 29
12
Solution:
n= 15
When we put those numbers in the order we have:
4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,
Example 2:
Find the median of the following:
9,7,2,11,18,12,6,4
Solution
n=8
When we put those numbers in the order we have:
2, 4, 6, 7, 9,11, 12, 18
MEAN
The mean is found by adding all scores and then dividing by the number of scores.
Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total
number of numbers.
Types of means
Sample mean
Population mean
Sample Mean
The sample mean is a central tendency measure. The arithmetic average is computed using samples or random
values taken from the population. It is evaluated as the sum of all the sample variables divided by the total
number of variables.
13
Population Mean
The population mean can be calculated by the sum of all values in the given data/population divided by a total
number of values in the given data/population.
Describing Variability
RANGE
The range is the difference between the largest and smallest scores.
The range in statistics for a given data set is the difference between the highest and lowest values. For
example, if the given data set is {2,5,8,10,3}, then the range will be 10 – 2 = 8.
Example 1: Find the range of given observations: 32, 41, 28, 54, 35, 26, 23, 33, 38, 40.
VARIANCE
Variance is a measure of how data points differ from the mean. A variance is a measure of how far a set of
data (numbers) are spread out from their mean (average) value.
Formula
σ = Σ(x-μ)2 or
Variance = (Standard deviation)2= σ2 = > σ 2= Σ(x-μ)2 /n
the values of all scores must be added and then divided by the total number of scores.
Example
X = 5, 8, 6, 10, 12, 9, 11, 10, 12, 7
Solution
Mean = sum (x)/ n
n= 10
um (x) = 5+8+6+10+12+9+11+10+12+ 7
14
= 90
Mean=> μ = 90 / 10 = 9
Deviation from mean
x- μ = -4, -1, -3, 1, 3, 0, 2,1,3,-2
(x-μ)2 = 16,1,9,1,9,0,4,1,9,4
Σ(x-μ)2 = 16+1+9+1+9+0+4+1+9+4
=54
σ 2= Σ(x-μ)2 /n
=54/10
= 5.4
STANDARD DEVIATION
The standard deviation, the square root of the mean of all squared deviations from the mean, that is,
Standard deviation = √variance
Standard Deviation: A rough measure of the average (or standard) amount by which scores deviate
“The sum of squares equals the sum of all squared deviation scores.” You can reconstruct this formula by
remembering the following three steps:
1. Subtract the population mean, μ, from each original score, X, to obtain a deviation score, X − μ.
2. Square each deviation score, (X − μ)2, to eliminate negative signs.
3. Sum all squared deviation scores, Σ (X − μ)2.
15
Sum of Squares Formulas for Sample
Sample notation can be substituted for population notation in the above two formulas without causing any
essential changes:
16
17
DEGREES OF FREEDOM (df)
Degrees of freedom (df) refers to the number of values that are free to vary, given one or more
mathematical restrictions, in a sample being used to estimate a population characteristic.
Degrees of freedom are the number of independent variables that can be estimated in a statistical
analysis. These values of these variables are without constraint, although the values do impost
restrictions on other variables if the data set is to comply with estimate parameters.
Degrees of Freedom (df ) The number of values free to vary, given one or more mathematical
restrictions.
Formula
Degree of freedom df = n-1
Example
Consider a data set consists of five positive integers. The sum of the five integers must be the multiple of 6.
The values are randomly selected as 3, 8, 5, and 4.
The sum of this for values is 20. So we have to choose the fifth integer to make the sum divisible by 6.
Therefore the fifth element is 10.
The number of degrees of Degrees of Freedom (df ) The number of values free to vary, given one or more
mathematical restrictions. Freedom—in the numerator, as in the formulas for s2 and s. In fact, we can use
degrees of freedom to rewrite the formulas for the sample variance and standard deviation:
The interquartile range (IQR), is simply the range for the middle 50 percent of the scores. More specifically,
the IQR equals the distance between the third quartile (or 75th percentile) and the first quartile (or 25 th
percentile), that is, after the highest quarter (or top 25 percent) and the lowest quarter (or bottom 25 percent)
have been trimmed from the original set of scores. Since most distributions are spread more widely in their
extremities than their middle, the IQR tends to be less than half the size of the range.
Simply, The IQR describes the middle 50% of values when ordered from lowest to highest. To find the
interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These
values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
18
Normal Distributions and Standard (z) Scores
THE NORMAL CURVE
The normal distribution is a continuous probability distribution that is symmetrical on both sides of the mean,
so the right side of the center is a mirror image of the left side.
19
Different Normal Curves
As a theoretical exercise, it is instructive to note the various types of normal curves that are produced
by an arbitrary change in the value of either the mean (μ) or the standard deviation (σ).
Obvious differences in appearance among normal curves are less important than you might suspect.
Because of their common mathematical origin, every normal curve can be interpreted in exactly the same way
once any distance from the mean is expressed in standard deviation units.
z SCORES
A z score is a unit-free, standardized score that, regardless of the original units of measurement, indicates how
many standard deviations a score is above or below the mean of its distribution.
A z score can be defined as a measure of the number of standard deviations by which a score is below or
above the mean of a distribution. In other words, it is used to determine the distance of a score from the mean.
If the z score is positive it indicates that the score is above the mean. If it is negative then the score will be
below the mean. However, if the z score is 0 it denotes that the data point is the same as the mean.
To obtain a z score, express any original score, whether measured in inches, milliseconds, dollars, IQ points,
etc., as a deviation from its mean (by subtracting its mean) and then split this deviation into standard deviation
units (by dividing by its standard deviation),
Where X is the original score and μ and σ are the mean and the standard deviation, respectively, for the
normal distribution of the original scores. Since identical units of measurement appear in both the numerator
20
and denominator of the ratio for z, the original units of measurement cancel each other and the z score
emerges as a unit-free or standardized number, often referred to as a standard score.
Converting to z Scores
Example
Suppose on a GRE test a score of 1100 is obtained. The mean score for the GRE test is 1026 and the
population standard deviation is 209. In order to find how well a person scored with respect to the score of an
average test taker, the z score will have to be determined.
Although there is an infinite number of different normal curves, each with its own mean and standard
deviation, there is only one standard normal curve, with a mean of 0 and a standard deviation of 1.
Standard deviation = 1
21
Given a z score of zero or more, columns B and C indicate how the z score splits the area in the upper half of
the normal curve. As suggested by the shading in the top legend, column B indicates the proportion of area
between the mean and the z score, and column C indicates the proportion of area beyond the z score, in the
upper tail of the standard normal curve.
22
FINDING PROPORTIONS
Finding Proportions for One Score
Sketch a normal curve and shade in the target area,
Plan your solution according to the normal table.
Convert X to z.
23
FINDING SCORES
So far, we have concentrated on normal curve problems for which Table A must be consulted to find
the unknown proportion (of area) associated with some known score or pair of known scores
Now we will concentrate on the opposite type of normal curve problem for which Table A must be
consulted to find the unknown score or scores associated with some known proportion.
For this type of problem requires that we reverse our use of Table A by entering proportions in
columns B, C, B′, or C′ and finding z scores listed in columns A or A′.
It’s often helpful to visualize the target score as splitting the total area into two sectors—one to the left of
(below) the target score and one to the right of (above) the target score
When converting z scores to original scores, you will probably find it more efficient to use the following
equation
24
Finding Two Scores
Sketch a normal curve. On either side of the mean, draw two lines representing the two target scores,
as in figure
Points to Remember
1. range = largest value – smallest value in a list
2. class interval = range / desired no of classes
3. relative frequency = frequency (f)/ε(f)
4. Cumulative frequency - add to the frequency of each class the sum of the frequencies of
all classes ranked below it.
5. Cumulative percentage = (f/cumulative f)*100
6. Histograms
7. Construction of frequency polygon
8. Stem and leaf display
9. Mode - The value of the most frequent score.
10. For odd no of terms Median = {(n+1)/2}th term / observation. For even no of terms Median
= 1/2[(n/2)th term + {(n/2)+1}th term ]
11. Mean = sum of all scores / number of scores
Variance σ = Σ(x-μ)2 or
=>σ = Σ(x-μ)2 /n
2
Variance = (Standard deviation)2= σ2
12. Range (X) = Max (X) – Min (X)
25
13. Degree of freedom df = n-1
14. Types of normal curve
15. z – score
26
16. Standard normal curve; mean = 0, standard deviation = 1
17. Finding proportion
Two scores
27
28
Unit – III
DESCRIBING RELATIONSHIPS
Correlation – Scatter plots – correlation coefficient for quantitative data – computational formula for correlation
coefficient – Regression – regression line – least squares regression line – Standard error of estimate –
interpretation of r2 – multiple regression equations – regression towards the mean
Correlation
Correlation refers to a process for establishing the relationships between two variables. You learned a way to
get a general idea about whether or not two variables are related, is to plot them on a “scatter plot”. While there
are many measures of association for variables which are measured at the ordinal or higher level of
measurement, correlation is the most commonly used approach.
Types of Correlation
ositive Correlation – when the values of the two variables move in the same direction so that an
increase/decrease in the value of one variable is followed by an increase/decrease in the value of the
other variable.
Negative Correlation – when the values of the two variables move in the opposite direction so that an
increase/decrease in the value of one variable is followed by decrease/increase in the value of the other
variable.
No Correlation – when there is no linear dependence or no relation between the two variables.
SCATTERPLOTS
A scatter plot is a graph containing a cluster of dots that represents all pairs of scores. In other words
Scatter plots are the graphs that present the relationship between two variables in a data-set. It represents data
points on a two-dimensional plane or on a Cartesian system.
The first step is to note the tilt or slope, if any, of a dot cluster.
A dot cluster that has a slope from the lower left to the upper right, as in panel A of below figure reflects a
positive relationship.
A dot cluster that has a slope from the upper left to the lower right, as in panel B of below figure reflects a
negative relationship.
A dot cluster that lacks any apparent slope, as in panel C of below figure reflects little or no relationship.
Perfect Relationship
A dot cluster that equals (rather than merely approximates) a straight line reflects a perfect relationship between
two variables.
Curvilinear Relationship
The previous discussion assumes that a dot cluster approximates a straight line and, therefore, reflects a linear
relationship. But this is not always the case. Sometimes a dot cluster approximates a bent or curved line, as in
below figure, and therefore reflects a curvilinear relationship.
2
A CORRELATION COEFFICIENT FOR QUANTITATIVE DATA : r
The correlation coefficient, r, is a summary measure that describes the extent of the statistical
relationship between two interval or ratio level variables.
Properties of r
The correlation coefficient is scaled so that it is always between -1 and +1.
When r is close to 0 this means that there is little relationship between the variables and the farther away
from 0 r is, in either the positive or negative direction, the greater the relationship between the two
variables.
The sign of r indicates the type of linear relationship, whether positive or negative.
The numerical value of r, without regard to sign, indicates the strength of the linear relationship.
A number with a plus sign (or no sign) indicates a positive relationship, and a number with a minus sign
indicates a negative relationship
Where the two sum of squares terms in the denominator are defined as
The sum of the products term in the numerator, SPxy, is defined in below formula
3
REGRESSION
A regression is a statistical technique that relates a dependent variable to one or more independent
(explanatory) variables. A regression model is able to show whether changes observed in the dependent variable
are associated with changes in one or more of the explanatory variables.
Regression captures the correlation between variables observed in a data set, and quantifies whether
those correlations are statistically significant or not.
A Regression Line
a regression line is a line that best describes the behaviour of a set of data. In other words, it’s a line that best
fits the trend of a given data.
4
The purpose of the line is to describe the interrelation of a
dependent variable (Y variable) with one or many
independent variables (X variable). By using the equation
obtained from the regression line an analyst can forecast
future behaviours of the dependent variable by inputting
different values for the independent ones.
Types of regression
The two basic types of regression are
Simple linear regression
Simple linear regression uses one independent variable to
explain or predict the outcome of the dependent variable Y
Multiple linear regression
Multiple linear regressions use two or more independent
variables to predict the outcome
Predictive Errors
Prediction error refers to the difference between the predicted values made by some model and the
actual values.
Formula
b= N Σ(xy) − Σx Σy
N Σ(x2) − (Σx)2
5
b = Σy − m Σx
N
Example
"x" "y"
2 4
3 5
5 7
7 10
9 15
= 5 x 263 − 26 x 41
5 x 168 − 262
= 1315 − 1066
840 − 676
= 249
164
b = 1.5183.
Step 5: y’ = bx+a
y’ = 1.518x + 0.305
6
x y y = 1.518x + 0.305 error
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
Example
Calculate the standard error of estimate for the given X and Y values. X = 1,2,3,4,5 Y=2,4,5,4,5
7
Solution
Create five columns labeled x, y, y’, y – y’, ( y – y’)2 and N=5
b = N Σ(xy) − Σx Σy
N Σ(x2) − (Σx)2
b=5(66)-15x20
5(55)-(15)2
=
330 – 300
275-225
b= 30/50 = 0.6
a = Σy − b Σx
N
= 20 – (0.6 x 15)
5
= 20 – 11
5
a= 9/5 = 2.2
=√(2.4/3)
SSy/x = 0.894
INTERPRETATION OF r 2
R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that
determines the proportion of variance in the dependent variable that can be explained by the independent
variable. In other words, r-squared shows how well the data fit the regression model (the goodness of fit).
R-squared can take any values between 0 to 1. Although the statistical measure provides some useful
insights regarding the regression model, the user should not rely only on the measure in the assessment of a
statistical model.
8
In addition, it does not indicate the correctness of the regression model. Therefore, the user should
always draw conclusions about the model by analyzing r-squared together with the other variables in a
statistical model.
The most common interpretation of r-squared is how well the regression model explains observed data.
Example:
A researcher decides to study students’ performance from a school over a period of time. He observed that as
the lectures proceed to operate online, the performance of students started to decline as well. The parameters for
the dependent variable “decrease in performance” are various independent variables like “lack of attention,
more internet addiction, neglecting studies” and much more.
Example
A military commander has two units return, one with 20% casualties and another with 50% casualties. He
praises the first and berates the second. The next time, the two units return with the opposite results. From this
experience, he “learns” that praise weakens performance and berating increases performance.
9
10
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic – fancy
indexing – structured arrays – Data manipulation with Pandas – data indexing and selection – operating on data
– missing data – Hierarchical indexing – combining datasets – aggregation and grouping – pivot tables
NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data
buffers. NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient
storage and data operations as the arrays grow larger in size.
Example
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print("dtype:", x3.dtype)
Array Indexing:
Accessing Single Elements
1
In a one-dimensional array, you can access the ith value (counting from zero) by specifying the
desired index in square brackets, just as with Python lists
To index from the end of the array, you can use negative indices
In a multidimensional array, you access items using a comma-separated tuple of indices
Unlike Python lists, NumPy arrays have a fixed type. This means, for example, that if you attempt to
insert a floating-point value to an integer array, the value will be silently truncated.
x[start:stop:step]
start – starting array index
stop – array index to stop ( last value will not be considered)
step – terms has to be printed from start to stop
Default to the values start=0, stop=size of dimension, step=1.
Example
x = np.arange(10)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
While using negative indices the defaults for start and stop are swapped. This becomes a convenient way
to reverse an array
x[::-1] # all elements, reversed
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
2
x2[:2, :3] # two rows, three columns
array([[12, 5, 2],
[ 7, 6, 8]])
Reshaping of Arrays
The most flexible way of doing this is with the reshape() method. For example, if you want to put the
numbers 1 through 9 in a 3×3 grid, you can do the following
grid = np.arange(1, 10).reshape((3, 3))
print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]
array([1, 2, 3, 3, 2, 1])
[ 1 2 3 3 2 1 99 99 99]
array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
3
[4, 5, 6]])
array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])
np.vstack([x, grid])
array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])
y = np.array([[99],
[99]])
np.hstack([grid, y])
array([[ 9, 8, 7, 99],
[ 6, 5, 4, 99]])
Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and
np.vsplit. For each of these, we can pass a list of indices giving the split points
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]
Notice that N split points lead to N + 1 subarrays. The related functions np.hsplit and np.vsplit are similar
grid = np.arange(16).reshape((4, 4))
grid
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
4
[[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is to quickly execute
repeated operations on values in NumPy arrays. Ufuncs are extremely flexible—before we saw an operation
between a scalar and an array, but we can also operate between two arrays
Ufuncs exist in two flavors: unary ufuncs, which operate on a single input, and binary ufuncs, which operate on
two inputs. We’ll see examples of both these types of functions here.
Array arithmetic
NumPy’s ufuncs make use of Python’s native arithmetic operators. The standard addition, subtraction,
multiplication, and division can all be used.
x = np.arange(4)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
5
// np.floor_divide Floor division (e.g., 3 // 2 = 1)
** np.power Exponentiation (e.g., 2 ** 3 = 8)
% np.mod Modulus/remainder (e.g., 9 % 4 = 1)
Absolute value
Just as NumPy understands Python’s built-in arithmetic operators, it also understands Python’s built-in absolute
value function.
np.abs()
np.absolute()
x = np.array([-2, -1, 0, 1, 2])
abs(x)
array([2, 1, 0, 1, 2])
The corresponding NumPy ufunc is np.absolute, which is also available under the alias np.abs
np.absolute(x)
array([2, 1, 0, 1, 2])
np.abs(x)
array([2, 1, 0, 1, 2])
Trigonometric functions
NumPy provides a large number of useful ufuncs, and some of the most useful for the data scientist are the
trigonometric functions.
np.sin()
np.cos()
np.tan()
inverse trigonometric functions
np.arcsin()
np.arccos()
np.arctan()
x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
6
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
The inverse of the exponentials, the logarithms, are also available. The basic np.log gives the natural logarithm;
if you prefer to compute the base-2 logarithm or the base-10 logarithm as .
np.log(x) - is a mathematical function that helps user to calculate Natural logarithm of x where x
belongs to all the input array elements
np.log2(x) - to calculate Base-2 logarithm of x
np.log10(x) - to calculate Base-10 logarithm of x
x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
Specialized ufuncs
NumPy has many more ufuncs available like
Hyperbolic trig functions,
Bitwise arithmetic,
Comparison operators,
Conversions from radians to degrees,
Rounding and remainders, and much more
More specialized and obscure ufuncs is the submodule scipy.special. If you want to compute some obscure
mathematical function on your data, chances are it is implemented in scipy.special.
Gamma function
Aggregates
To reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly
applies a given operation to the elements of an array until only a single result remains.
x = np.arange(1, 6)
np.add.reduce(x)
Similarly, calling reduce on the multiply ufunc results in the product of all array elements
np.multiply.reduce(x)
120
7
If we’d like to store all the intermediate results of the computation, we can instead use
Accumulate
np.add.accumulate(x)
array([ 1, 3, 6, 10, 15])
Outer products
ufunc can compute the output of all pairs of two different inputs using the outer method. This allows you, in one
line, to do things like create a multiplication table.
x = np.arange(1, 6)
np.multiply.outer(x, x)
array([[ 1, 2, 3, 4, 5],
[ 2, 4, 6, 8, 10],
[ 3, 6, 9, 12, 15],
[ 4, 8, 12, 16, 20],
[ 5, 10, 15, 20, 25]])
Example
x=[1,2,3,4]
np.min(x)
1
np.max(x)
4
Multidimensional aggregates
One common type of aggregation operation is an aggregate along a row or column.
By default, each NumPy aggregation function will return the aggregate over the entire array. ie. If we use the
np.sum() it will calculates the sum of all elements of the array.
Example
m = np.random.random((3, 4))
print(M)
8
M.sum()
6.0850555667307118
Aggregation functions take an additional argument specifying the axis along which the aggregate is computed.
The axis normally takes either 0 or 1. if the axis = 0 then it runs along with columns, if axis =1 it runs along
with rows.
Example
We can find the minimum value within each column by specifying axis=0
M.min(axis=0)
array([ 0.66859307, 0.03783739, 0.19544769, 0.06682827])
array([5, 6, 7])
Broadcasting allows these types of binary operations to be performed on arrays of different sizes.
a+5
array([5, 6, 7])
9
We can think of this as an operation that stretches or duplicates the value 5 into the array [5, 5, 5], and adds the
results. The advantage of NumPy’s broadcasting is that this duplication of values does not actually take place.
We can similarly extend this to arrays of higher dimension. Observe the result when we add a one-dimensional
array to a two-dimensional array.
Example
M = np.ones((3, 3))
M
M+a
Just as before we stretched or broadcasted one value to match the shape of the other, here we’ve stretched both
a and b to match a common shape, and the result is a two dimensional array.
10
The light boxes represent the broadcasted values: again, this extra memory is not actually allocated in the
course of the operation, but it can be useful conceptually to imagine that it is.
Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays.
• Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer
dimensions is padded with ones on its leading (left) side.
• Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in
that dimension is stretched to match the other shape.
• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
Broadcasting example 1
Let’s look at adding a two-dimensional array to a one-dimensional array:
M = np.ones((2, 3))
a = np.arange(3)
Let’s consider an operation on these two arrays. The shapes of the arrays are:
M.shape = (2, 3)
a.shape = (3,)
We see by rule 1 that the array a has fewer dimensions, so we pad it on the left with ones:
M.shape -> (2, 3)
a.shape -> (1, 3)
By rule 2, we now see that the first dimension disagrees, so we stretch this dimension to match:
M.shape -> (2, 3)
a.shape -> (2, 3)
The shapes match, and we see that the final shape will be (2, 3):
M+a
array([[ 1., 2., 3.],
[ 1., 2., 3.]])
Broadcasting example 2
Let’s take a look at an example where both arrays need to be broadcast:
a = np.arange(3).reshape((3, 1))
b = np.arange(3)
And rule 2 tells us that we upgrade each of these ones to match the corresponding size of the other array:
a.shape -> (3, 3)
b.shape -> (3, 3)
11
Because the result matches, these shapes are compatible. We can see this here:
a+b
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
x = np.array([1, 2, 3, 4, 5])
x < 3 # less than
array([ True, True, False, False, False], dtype=bool)
x > 3 # greater than
array([False, False, False, True, True], dtype=bool)
x != 3 # not equal
array([ True, True, False, True, True], dtype=bool)
x == 3 # equal
array([False, False, True, False, False], dtype=bool)
Just as in the case of arithmetic ufuncs, these will work on arrays of any size and shape. Here is a two-
dimensional example
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
12
x<6
The result is a Boolean array, and NumPy provides a number of straightforward patterns for working with these
Boolean results.
Boolean operators
Operator Equivalent ufunc
& np.bitwise_and
| np.bitwise_or
^ np.bitwise_xor
~ np.bitwise_not
Example
x<5
array([[False, True, True, True],
[False, False, True, False],
[ True, True, False, False]], dtype=bool)
Masking operation
To select these values from the array, we can simply index on this Boolean array; this is known as a masking
operation.
x[x < 5]
13
array([0, 3, 3, 3, 2, 4])
What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all
the values in positions at which the mask array is True.
Fancy Indexing
Fancy indexing is like the simple indexing we’ve already seen, but we pass arrays of indices in place of
single scalars. This allows us to very quickly access and modify complicated subsets of an array’s values.
Exploring Fancy Indexing
Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at
once.
Types of fancy indexing.
Indexing / accessing more values
Array of indices
In multi dimensional
Standard indexing
Example
import numpy as np
rand = np.random.RandomState(42)
x = rand.randint(100, size=10)
print(x)
[51 92 14 71 60 20 82 86 74 74]
Array of indices
We can pass a single list or array of indices to obtain the same result.
ind = [3, 7, 4]
x[ind]
In multi dimensional
Fancy indexing also works in multiple dimensions. Consider the following array.
X = np.arange(12).reshape((3, 4))
X
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Standard indexing
Like with standard indexing, the first index refers to the row, and the second to the column.
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
14
X[row, col]
array ([ 2, 5, 11])
Combined Indexing
For even more powerful operations, fancy indexing can be combined with the other indexing schemes we’ve
seen.
Example array
print(X)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Combine fancy and simple indices
X[2, [2, 0, 1]]
array([10, 8, 9])
array([[ 6, 4, 5],
[10, 8, 9]])
Combine fancy indexing with masking
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
[ 0 99 99 3 99 5 6 7 99 9]
Using at()
Use the at() method of ufuncs for other behavior of modifications.
x = np.zeros(10)
np.add.at(x, i, 1)
print(x)
[ 0. 0. 1. 2. 3. 0. 0. 0. 0. 0.]
Sorting Arrays
Sorting in NumPy: np.sort and
np.argsort
Python has built-in sort and sorted functions to work with lists, we won’t discuss them here because NumPy’s
np.sort function turns out to be much more efficient and useful for our purposes. By default np.sort uses an O[ N
log N], quicksort algorithm, though mergesort and heapsort are also available. For most applications, the default
quicksort is more than sufficient.
array([1, 2, 3, 4, 5])
[1 0 3 2 4]
Sorting along rows or columns
A useful feature of NumPy’s sorting algorithms is the ability to sort along specific rows or columns of a
multidimensional array using the axis argument. For example
rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)
[[6 3 7 4 6 9]
[2 6 7 4 3 7]
[7 2 5 4 1 7]
[5 1 4 0 9 5]]
np.sort(X, axis=0)
array([[2, 1, 4, 0, 1, 5],
16
[5, 2, 5, 4, 3, 7],
[6, 3, 7, 4, 6, 7],
[7, 6, 7, 4, 9, 9]])
np.sort(X, axis=1)
array([[3, 4, 6, 6, 7, 9],
[2, 3, 4, 6, 7, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 5, 9]])
array([2, 1, 3, 4, 6, 5, 7])
Note that the first three values in the resulting array are the three smallest in the array, and the remaining array
positions contain the remaining values. Within the two partitions, the elements have arbitrary order.
array([[3, 4, 6, 7, 6, 9],
[2, 3, 4, 7, 6, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 9, 5]])
Structured Arrays
This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient
storage for compound, heterogeneous data.
NumPy data types
Character Description Example
'b' Byte np.dtype('b')
'i' Signed integer np.dtype('i4') == np.int32
'u' Unsigned integer np.dtype('u1') == np.uint8
'f' Floating point np.dtype('f8') == np.int64
'c' Complex floating point np.dtype('c16') == np.complex128
'S', 'a' string np.dtype('S5')
'U' Unicode string np.dtype('U') ==
np.str_ 'V' Raw data (void) np.dtype('V') == np.void
17
Consider if we have several categories of data on a number of people (say, name, age, and weight), and we’d
like to store these values for use in a Python program. It would be possible to store these in three separate
arrays.
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)('Doug', 19, 61.5)]
array(['Alice', 'Doug'],dtype='<U10')
Dictionary method
np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
18
Numerical types can be specified with Python types
np.dtype({'names':('name', 'age', 'weight'),
'formats':((np.str_, 10), int, np.float32)})
List of tuples
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
Finding values
The values are simply a familiar NumPy array
data.values
0.5
data[1:3]
1 0.50
2 0.75
dtype: float64
Series as generalized NumPy array
the NumPy array has an implicitly defined integer index used to access the values, the Pandas Series has an
explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities. For example, the index need not be
an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index.
Strings as an index
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
Noncontiguous or non sequential indices.
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
data
2 0.25
5 0.50
3 0.75
7 1.00
dtype: float64
20
We can make the Series-as-dictionary analogy even more clear by constructing a Series object directly from a
Python dictionary.
For example
sub1={‘sai’:90,’ram’:85,’kasim’:92,’tamil’:89}
mark=pd.Series(sub1)
mark
sai 90
ram 85
kasim 92
tamil 89
dtype: int64
Array-style slicing
Mark[ ‘sai’:’kasim’]
sai 90
ram 85
kasim 92
02
14
26
dtype: int64
Repeated to fill the specified index
pd.Series(5, index=[100, 200, 300])
100 5
200 5
300 5
dtype: int64
Data can be a dictionary, in which index defaults to the sorted dictionary keys
pd.Series({2:'a', 1:'b', 3:'c'})
1b
2a
3c
dtype: object
3c
2a
dtype: object
sub2={'sai':91,'ram':95,'kasim':89,'tamil':90}
We can use a dictionary to construct a single two-dimensional object containing this information.
result=pd.DataFrame({'DS':sub1,'FDS':sub2})
result
DS FDS
sai 90 91
ram 85 95
kasim 92 89
tamil 89 90
sai 90
ram 85
kasim 92
22
tamil 89
Name: DS, dtype: int64
Note
In a two-dimensional NumPy array, data[0] will return the first row. For a DataFrame, data['col0'] will return
the first column. Because of this, it is probably better to think about DataFrames as generalized dictionaries
rather than generalized arrays, though both ways of looking at the situation can be useful.
DS
sai 90
ram 85
kasim 92
tamil 89
ab
000
112
224
Even if some keys in the dictionary are missing, Pandas will fill
them in with NaN (i.e.,“not a number”) values.
abc
0 1.0 2 NaN
1 NaN 3 4.0
23
As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well.
pd.DataFrame({'DS':sub1,'FDS':sub2})
DS FDS
sai 90 91
ram 85 95
kasim 92 89
tamil 89 90
food water
a 0.865257 0.213169
b 0.442759 0.108267
c 0.047110 0.905718
pd.DataFrame(A)
AB
0 0 0.0
1 0 0.0
2 0 0.0
24
ind[1]
3
ind[::2]
Int64Index([2, 5, 11], dtype='int64')
Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of
values.
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
ata['b']
25
0.5
Examine the keys/indices and values
We can also use dictionary-like Python expressions and methods to examine the keys/indices and values
i. 'a' in data
True
ii. data.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')
iii. list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
data['e'] = 1.25
data
a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
a 0.25
b 0.50
c 0.75
dtype: float64
a 0.25
b 0.50
dtype: float64
Masking
data[(data > 0.3) & (data < 0.8)]
b 0.50
c 0.75
26
dtype: float64
Fancy indexing
data[['a', 'e']]
a 0.25
e 1.25
dtype: float64
1a
3b
5c
dtype: object
loc - the loc attribute allows indexing and slicing that always references the explicit index.
data.loc[1]
'a'
data.loc[1:3]
1a
3b
dtype: object
iloc - The iloc attribute allows indexing and slicing that always references the implicit Python-style index.
data.iloc[1]
'b'
data.iloc[1:3]
3b
5c
dtype: object
ix- ix is a hybrid of the two, and for Series objects is equivalent to standard [ ]-based indexing.
DataFrame as a dictionary
The first analogy we will consider is the DataFrame as a dictionary of related Series objects.
27
The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing
of the column name.
DS
sai 90
ram 85
kasim 92
tamil 89
DS
sai 90
ram 85
kasim 92
tamil 89
True
Modify the object
Like with the Series objects this dictionary-style syntax can also be used to modify the object, in this case to add
a new column:
result[‘TOTAL’]=result[‘DS’]+result[‘FDS’]
result
DS FDS TOTAL
sai 90 91 181
ram 85 95 180
kasim 92 89 181
tamil 89 90 179
28
Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the
underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame
index and column labels are maintained in the result
loc
result.loc[: ‘ram’, : ‘FDS’ ]
DS FDS
sai 90 91
ram 85 95
iloc
result.iloc[:2, :2 ]
DS FDS
sai 90 91
ram 85 95
ix
result.ix[:2, :’FDS’ ]
DS FDS
sai 90 91
ram 85 95
DS FDS
sai 90 91
kasim 92 89
Modifying values
Indexing conventions may also be used to set or modify values; this is done in the standard way that
you might be accustomed to from working with NumPy.
result.iloc[1,1] =70
DS FDS TOTAL
Sai 90 91 181
Ram 85 70 180
kasim 92 89 181
tamil 89 90 179
result['sai':'kasim']
29
DS FDS TOTAL
Sai 90 91 181
Ram 85 70 180
kasim 92 89 181
Such slices can also refer to rows by number rather than by index:
result[1:3]
DS FDS TOTAL
ram 85 70 180
kasim 92 89 181
DS FDS TOTAL
sai 90 91 181
kasim 92 89 181
Index Preservation
Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects.
We can use all arithmetic and special universal functions as in NumPy on pandas. In outputs the index will
preserved (maintained) as shown below.
For series
x=pd.Series([1,2,3,4])
x
0 1
1 2
2 3
3 4
dtype: int64
For DataFrame
df=pd.DataFrame(np.random.randint(0,10,(3,4)),
columns=['a','b','c','d'])
30
df
a b c d
0 1 4 1 4
1 8 4 0 4
2 7 7 7 2
0 8103.083928
1 54.598150
2 403.428793
3 20.085537
dtype: float64
a b c d
Index Alignment
Pandas will align indices in the process of performing the operation. This is very convenient when you are
working with incomplete data, as we’ll.
1 3.0
2 NaN
3 9.0
4 NaN
5 NaN
dtype: float64
31
The resulting array contains the union of indices of the two input arrays, which we could determine using
standard Python set arithmetic on these indices.
Any item for which one or the other does not have an entry is marked with NaN, or “Not a Number,” which is
how Pandas marks as missing data.
x.add(y,fill_value=0)
1 3.0
2 3.0
3 9.0
4 7.0
5 6.0
dtype: float64
A B
0 1 11
151
BAC
0409
1580
2926
A+B
A B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN NaN
Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result
are sorted. As was the case with Series, we can use the associated object’s arithmetic method and pass any
desired fill_value to be used in place of missing entries. Here we’ll fill with the mean of all values in A.
32
fill = A.stack().mean()
A.add(B, fill_value=fill)
A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5
A - A[0]
array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])
In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing
integer value with –9999 or some rare bit pattern, or it could be a more global convention, such as indicating a
missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point
specification.
33
The way in which Pandas handles missing values is constrained by its NumPy package, which does not have a
built-in notion of NA values for non floating- point data types.
NumPy supports fourteen basic integer types once you account for available precisions, signedness, and
endianness of the encoding. Reserving a specific bit pattern in all available NumPy types would lead to an
unwieldy amount of overhead in special-casing various operations for various types, likely even requiring a new
fork of the NumPy package.
Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values:
the special floatingpoint NaN value, and the Python None object. This choice has some side effects, as we will
see, but in practice ends up being a good compromise in most cases of interest.
This dtype=object means that the best common type representation NumPy could infer for the contents of the
array is that they are Python objects.
dtype('float64')
You should be aware that NaN is a bit like a data virus—it infects any other object it touches. Regardless of the
operation, the result of arithmetic with NaN will be another NaN
1 + np.nan
nan
0 * np.nan
Nan
34
x = pd.Series(range(2), dtype=int)
x
00
11
dtype: int64
x[0] = None
x
0 NaN
1 1.0
dtype: float64
Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a
NaN value.
0 False
1 True
2 False
3 True
dtype: bool
notnull()
data.notnull()
0 True
1 False
2 True
35
3 False
dtype: bool
dropna()
data.dropna()
01
2 hello
dtype: object
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
df.dropna()
012
1 2.0 3.0 5
df.dropna(axis='columns')
02
15
26
Rows or columns having all null values
You can also specify how='all', which will only drop rows/columns that are all null values.
df[3] = np.nan
df
0123
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
36
df.dropna(axis='columns', how='all')
012
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
0123
1 2.0 3.0 5 NaN
a 1.0
b NaN
c 2.0
d NaN
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64
37
Fill with next value
We can specify a back-fill to propagate the next values backward.
data.fillna(method='bfill')
a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
dtype: float64
Hierarchical Indexing
Up to this point we’ve been focused primarily on one-dimensional and twodimensional data, stored in Pandas
Series and DataFrame objects, respectively. Often it is useful to go beyond this and store higher-dimensional
data—that is, data indexed by more than one or two keys.
Pandas does provide Panel and Panel4D objects that natively handle three-dimensional and four-dimensional,
a far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing)
to incorporate multiple index levels within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series
and two-dimensional DataFrame objects.
Here we’ll explore the direct creation of MultiIndex objects; considerations around indexing, slicing, and
computing statistics across multiply indexed data; and useful routines for converting between simple and
hierarchically indexed representations of your data.
38
#creating multi index
index = pd.MultiIndex.from_tuples(index)
index
California 37253956
New York 19378102
Texas 25145561
dtype: int64
MultiIndex as extra dimension
we could easily have stored the same data using a simple DataFrame with index and column labels. The
unstack() method will quickly convert a multiplyindexed Series into a conventionally indexed DataFrame.
pop_df = pop.unstack()
pop_df
2000 2010
California 33871648 37253956
New York 18976457 19378102
Texas 20851820 25145561
39
2010 25145561
dtype: int64
total under18
California 2000 33871648 9267089
2010 37253956 9284094
New York 2000 18976457 4687374
2010 19378102 4318033
Texas 2000 20851820 5906301
2010 25145561 6879014
Universal functions
All the ufuncs and other functionality work with hierarchical indices.
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()
2000 2010
California 0.273594 0.249211
New York 0.247010 0.222831
Texas 0.283251 0.273568
Methods of Multi Index Creation
To construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the
constructor.
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
data1 data2
a 1 0.554233 0.356072
2 0.925244 0.219474
b 1 0.441759 0.610054
2 0.171495 0.886688
if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a
MultiIndex by default.
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
41
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
33871648
Partial indexing
The MultiIndex also supports partial indexing, or indexing just one of the levels in the index
pop['California']
42
year
2000 33871648
2010 37253956
dtype: int64
Partial slicing
Partial slicing is available as well, as long as the MultiIndex is sorted.
pop.loc['California':'New York']
State year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype: int64
Sorted indices
With sorted indices, we can perform partial indexing on lower levels by passing an empty slice in the
first index
pop[:, 2000]
state
California 33871648
New York 18976457
Texas 20851820
dtype: int64
state year
California 2000 33871648
2010 37253956
Texas 2010 25145561
dtype: int64
state year
California 2000 33871648
2010 37253956
Texas 2000 20851820
2010 25145561
dtype: int64
Rearranging Multi-Indices
43
We saw a brief example of this in the stack() and unstack() methods, but there are many more ways to finely
control the rearrangement of data between hierarchical indices and columns, and we’ll explore them here.
Sorted and unsorted indices
We’ll start by creating some simple multiply indexed data where the indices are not lexographically sorted:
Pandas provides a number of convenience routines to perform this type of sorting; examples are the sort_index()
and sortlevel() methods of the DataFrame. We’ll use the simplest, sort_index(), here:
data = data.sort_index()
data
char int
a 1 0.003001
2 0.164974
b 1 0.001693
2 0.526226
c 1 0.741650
2 0.569264
dtype: float64
With the index sorted in this way, partial slicing will work as expected:
data['a':'b'] char int
a 1 0.003001
2 0.164974
dtype:
b float64
1 0.001693
2 0.526226
Stacking and unstacking indices
it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation,
optionally specifying the level to use.
pop.unstack(level=0)
44
pop.unstack(level=1)
The opposite of unstack() is stack(), which here can be used to recover the original series:
pop.unstack().stack()
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
Index setting and resetting
Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished
with the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state
and year column holding the information that was formerly in the index. For clarity, we can optionally specify
the name of the data for the column representation.
pop_flat = pop.reset_index(name='population')
pop_flat
45
2014 1 30.0 37.4 39.0 37.8 61.0 36.9
2 47.0 37.8 48.0 37.3 51.0 36.5
Combining Datasets
Concat and Append
Simple Concatenation with pd.concat
Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a number of
options that we’ll discuss momentarily
pd.concat() can be used for a simple concatenation of Series or DataFrame objects, just as np.concatenate()
can be used for simple concatenations of arrays
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
1A
2B
3C
4D
5E
6F
dtype: object
46
AB CD AB C D
0 A0 B0 0 C0 D0 0 A0 B0 C0 D0
1 A1 B1 1 C1 D1 1 A1 B1 C1 D1
Duplicate indices
One important difference between np.concatenate and pd.concat is that Pandas concatenation preserves indices,
even if the result will have duplicate indices! Consider this simple example.
x y pd.concat([x, y])
AB A B AB
0 A0 B0 0 A2 B2 0 A0 B0
1 A1 B1 1 A3 B3 1 A1 B1
0 A2 B2
1 A3 B3
Categories of Joins
One-to-one joins
Many-to-one joins
Many-to-many joins
47
print(df1); print(df2)
df1 df2
employee group employee hire_date
0 Bob Accounting 0 Lisa 2004
1 Jake Engineering 1 Bob 2008
2 Lisa Engineering 2 Jake 2012
3 Sue HR 3 Sue 2014
To combine this information into a single DataFrame, we can use the pd.merge() function
df3 = pd.merge(df1, df2)
df3
Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-
one case, the resulting DataFrame will preserve those duplicate entries as appropriate.
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
pd.merge(df3, df4)
Many-to-many joins
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the key column in
both the left and right array contains duplicates, then the result is a many-to-many merge. This will be perhaps
most clear with a concrete example.
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'], 'skills':
['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']})
pd.merge(df1, df5)
48
5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization
0 0.374540
1 0.950714
2 0.731994
3 0.598658
4 0.156019
dtype: float64
Su
m ser.sum()
2.8119254917081569
Mean
ser.mean()
0.56238509834163142
The same operations also performed in DataFrame
• The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.
• The apply step involves computing some function, usually an aggregate, transformation, or filtering, within
the individual groups.
49
• The combine step merges the results of these operations into an output array.
Example
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
Df
key data
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
Column indexing.
The GroupBy object supports column indexing in the same way as the DataFrame, and returns a modified GroupBy
object. For example
df=pd.read_csv('D:\iris.csv')
df.groupby('variety')
<pandas.core.groupby.generic.DataFrameGroupBy object at
0x0000023BAADE84C0>
Dispatch methods.
Through some Python class magic, any method not explicitly implemented by the GroupBy object will be passed through
and called on the groups, whether they are DataFrame or Series objects. For example, you can use the describe() method
of DataFrames to perform a set of aggregations that describe each group in the data.
Example
df.groupby('variety')['petal.length'].describe().unstack()
variety
count Setosa 50.000000
Versicolor 50.000000
Virginica 50.000000
mean Setosa 1.462000
Versicolor 4.260000
Virginica 5.552000
std Setosa 0.173664
Versicolor 0.469911
Virginica 0.551895
min Setosa 1.000000
Versicolor 3.000000
Virginica 4.500000
25% Setosa 1.400000
Versicolor 4.000000
Virginica 5.100000
50% Setosa 1.500000
Versicolor 4.350000
Virginica 5.550000
75% Setosa 1.575000
Versicolor 4.600000
Virginica 5.875000
max Setosa 1.900000
Versicolor 5.100000
Virginica 6.900000
dtype: float64
51
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
data1 data2
Filtering.
A filtering operation allows you to drop data based on the group properties. For example, we might want to
keep all groups in which the standard deviation is larger than some critical value.
The filter() function should return a Boolean value specifying whether the group passes the filtering.
Transformation.
While aggregation must return a reduced version of the data, transformation can return some transformed
version of the full data to recombine. For such a transformation, the output is the same shape as the input. A
common example is to center the data by subtracting the group-wise mean:
df.groupby('key').transform(lambda x: x - x.mean())
data1 data2
0 -1.5 1.0
1 -1.5 -3.5
2 -1.5 -3.0
3 1.5 -1.0
4 1.5 3.5
5 1.5 3.0
52
The apply() method.
The apply() method lets you apply an arbitrary function to the group results. The function should take a
DataFrame, and return either a Pandas object (e.g., DataFrame, Series) or a scalar; the combine operation will
be tailored to the type of output returned.
Pivot Tables
A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on
tabular data. The pivot table takes simple column wise data as input, and groups the entries into a two-
dimensional table that provides a multidimensional summarization of the data. The difference between pivot
tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a
multidimensional version of GroupBy aggregation. That is, you split apply- combine, but both the split and the
combine happen across not a one-dimensional index, but across a two-dimensional grid.
age
63 5.500000 NaN
28 3.440000 2.000000
61 7.000000 4.000000
69 5.000000 NaN
45 7.285714 7.375000
62 6.500000 1.000000
53 2.000000 6.250000
68 8.000000 NaN
23 1.516129 1.857143
53
Class tested_negative tested_positive
age
52 13.000000 3.428571
54
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
UNIT V
DATA VISUALIZATION
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots – Histograms –
legends – colors – subplots – text and annotation – customization – three dimensional plotting - Geographic Data
with Basemap - Visualization with Seaborn.
Short assignment
linestyle='-' # solid
linestyle='--' # dashed
linestyle='-.' # dashdot
linestyle=':' # dotted
linestyle and color codes can be combined into a single nonkeyword argument to the plt.plot() function
plt.plot(x, x + 0, '-g') # solid green
plt.plot(x, x + 1, '--c') # dashed cyan
plt.plot(x, x + 2, '-.k') # dashdot black
plt.plot(x, x + 3, ':r'); # dotted red
Axes Limits
1
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
The most basic way to adjust axis limits is to use the plt.xlim() and plt.ylim()
methods Example
plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
The plt.axis() method allows you to set the x and y limits with a single call, by passing a list that specifies
[xmin, xmax, ymin, ymax]
plt.axis([-1, 11, -1.5, 1.5]);
Aspect ratio equal is used to represent one unit in x is equal to one unit in y. plt.axis('equal')
Labeling Plots
The labeling of plots includes titles, axis labels, and simple legends.
Title - plt.title()
Label - plt.xlabel()
plt.ylabel()
Legend - plt.legend()
Example programs
Line color
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x));
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse');# all HTML color names supported
Line style
import matplotlib.pyplot as plt
2
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
import numpy as np
fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');
# For short, you can use the following codes:
plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted
3
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Another commonly used plot type is the simple scatter plot, a close cousin of the line plot. Instead of points being
joined by line segments, here the points are represented individually with a dot, circle, or other shape.
Syntax
plt.plot(x, y, 'type of symbol ', color);
Example
plt.plot(x, y, 'o', color='black');
The third argument in the function call is a character that represents the type of symbol used for the plotting.
Just as you can specify options such as '-' and '--' to control the line style, the marker style has its own set of
short string codes.
Example
Various symbols used to specify ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']
plt.plot(x, y, '-ok');
Example
plt.plot(x, y, '-p', color='gray',
markersize=15, linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2)
plt.ylim(-1.2, 1.2);
4
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Diverging
['PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu', 'RdYlBu', 'RdYlGn', 'Spectral', '
Qualitative
['Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'Set1', 'Set2', 'Set3', 'tab10
Miscellaneous
['flag', 'prism', 'ocean', 'gist_earth', 'terrain', 'gist_stern', 'gnuplot',
'gnuplot2', 'CMRmap', 'cubehelix', 'brg', 'hsv', 'gist_rainbow', 'rainbow', 'jet',
Example programs.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 20)
y = np.sin(x)
plt.plot(x, y, '-o', color='gray',
markersize=15, linewidth=4,
markerfacecolor='yellow',
markeredgecolor='red',
markeredgewidth=4)
plt.ylim(-1.5, 1.5);
Visualizing Errors
5
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
For any scientific measurement, accurate accounting for errors is nearly as important, if not more important, than
accurate reporting of the number itself. For example, imagine that I am using some astrophysical observations to
estimate the Hubble Constant, the local measurement of the expansion rate of the Universe.
In visualization of data and results, showing these errors effectively can make a plot convey much more
complete information.
Types of errors
Basic Errorbars
Continuous Errors
Basic Errorbars
A basic errorbar can be created with a single Matplotlib function call.
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='.k');
Here the fmt is a format code controlling the appearance of lines and points, and has the same syntax as
the shorthand used in plt.plot()
In addition to these basic options, the errorbar function has many options to fine tune the outputs. Using
these additional options you can easily customize the aesthetics of your errorbar plot.
6
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Continuous Errors
In some situations it is desirable to show errorbars on continuous quantities. Though Matplotlib does not
have a built-in convenience routine for this type of application, it’s relatively easy to combine primitives
like plt.plot and plt.fill_between for a useful result.
Here we’ll perform a simple Gaussian process regression (GPR), using the Scikit-Learn API. This is a
method of fitting a very flexible nonparametric function to data with a continuous measure of the
uncertainty.
Notice that by default when a single color is used, negative values are represented by dashed lines, and
positive values by solid lines.
Alternatively, you can color-code the lines by specifying a colormap with the cmap argument.
We’ll also specify that we want more lines to be drawn—20 equally spaced intervals within the data range.
7
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
plt.contour(X, Y, Z, 20, cmap='RdGy');
One potential issue with this plot is that it is a bit “splotchy.” That is, the color steps are discrete rather than
continuous, which is not always what is desired.
You could remedy this by setting the number of contours to a very high number, but this results in a rather
inefficient plot: Matplotlib must render a new polygon for each step in the level.
A better way to handle this is to use the plt.imshow() function, which interprets a two-dimensional grid of
data as an image.
Example Program
import numpy as np
import matplotlib.pyplot as plt
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) *
np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.imshow(Z, extent=[0, 10, 0, 10],
origin='lower', cmap='RdGy')
plt.colorbar()
Histograms
Histogram is the simple plot to represent the large data set. A histogram is a graph showing frequency
distributions. It is a graph showing the number of observations within each given interval.
Parameters
plt.hist( ) is used to plot histogram. The hist() function will use an array of numbers to create a histogram,
the array is sent into the function as an argument.
8
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
bins - A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted
as a bar whose height corresponds to how many data points are in that bin. Bins are also sometimes called
"intervals", "classes", or "buckets".
normed - Histogram normalization is a technique to distribute the frequencies of the histogram over a wider
range than the current range.
x - (n,) array or sequence of (n,) arrays Input values, this takes either a single array or a sequence of arrays
which are not required to be of the same length.
histtype - {'bar', 'barstacked', 'step', 'stepfilled'}, optional
The type of histogram to draw.
'bar' is a traditional bar-type histogram. If multiple data are given the bars are arranged side by side.
'barstacked' is a bar-type histogram where multiple data are stacked on top of each other.
'step' generates a lineplot that is by default unfilled.
'stepfilled' generates a lineplot that is by default filled.
Default is 'bar'
align - {'left', 'mid', 'right'}, optional
Controls how the histogram is plotted.
Default is None
label - str or None, optional. Default is None
Other parameter
**kwargs - Patch properties, it allows us to pass a
variable number of keyword arguments to a
python function. ** denotes this type of function.
Example
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
data = np.random.randn(1000)
plt.hist(data);
The hist() function has many options to tune both the calculation and the display; here’s an example of a more
customized histogram.
plt.hist(data, bins=30, alpha=0.5,histtype='stepfilled', color='steelblue',edgecolor='none');
The plt.hist docstring has more information on other customization options available. I find this combination of
histtype='stepfilled' along with some transparency alpha to be very useful when comparing histograms of several
distributions
9
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Example
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 1000).T
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
10
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Legends
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw how
to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend in
Matplotlib.
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw how
to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend in
Matplotlib
plt.plot(x, np.sin(x), '-b', label='Sine')
plt.plot(x, np.cos(x), '--r', label='Cosine')
plt.legend();
Number of columns - We can use the ncol command to specify the number of columns in the legend.
ax.legend(frameon=False, loc='lower center', ncol=2)
fig
11
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
We can use a rounded box (fancybox) or add a shadow, change the transparency (alpha value) of the frame, or
change the padding around the text.
ax.legend(fancybox=True, framealpha=1, shadow=True, borderpad=1)
fig
Multiple legends
It is only possible to create a single legend for the entire plot. If you
try to create a second legend using plt.legend() or ax.legend(), it will
simply override the first one. We can work around this by creating a
new legend artist from scratch, and then using the lower-level ax.add_artist() method to manually add the second
artist to the plot
Example
import matplotlib.pyplot as plt
plt.style.use('classic')
import numpy as np
x = np.linspace(0, 10, 1000)
ax.legend(loc='lower center', frameon=True, shadow=True,borderpad=1,fancybox=True)
fig
Color Bars
In Matplotlib, a color bar is a separate axes that can provide a key for the meaning of colors in a plot. For
continuous labels based on the color of points, lines, or regions, a labeled color bar can be a great tool.
The simplest colorbar can be created with the plt.colorbar() function.
Customizing Colorbars
Choosing color map.
We can specify the colormap using the cmap argument to the plotting function that is creating the visualization.
Broadly, we can know three different categories of colormaps:
Sequential colormaps - These consist of one continuous sequence of colors (e.g., binary or viridis).
Divergent colormaps - These usually contain two distinct colors, which show positive and negative
deviations from a mean (e.g., RdBu or PuOr).
Qualitative colormaps - These mix colors with no particular sequence (e.g., rainbow or jet).
12
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Color limits and extensions
Matplotlib allows for a large range of colorbar customization. The colorbar itself is simply an instance of
plt.Axes, so all of the axes and tick formatting tricks we’ve learned are applicable.
We can narrow the color limits and indicate the out-of-bounds values with a triangular arrow at the top and
bottom by setting the extend property.
plt.subplot(1, 2, 2)
plt.imshow(I, cmap='RdBu')
plt.colorbar(extend='both')
plt.clim(-1, 1);
Discrete colorbars
Colormaps are by default continuous, but sometimes you’d like to
represent discrete values. The easiest way to do this is to use the
plt.cm.get_cmap() function, and pass the name of a suitable colormap
along with the number of desired bins.
plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6))
plt.colorbar()
plt.clim(-1, 1);
Subplots
Matplotlib has the concept of subplots: groups of smaller axes that can exist together within a single figure.
These subplots might be insets, grids of plots, or other more complicated layouts.
We’ll explore four routines for creating subplots in Matplotlib.
plt.axes: Subplots by Hand
plt.subplot: Simple Grids of Subplots
plt.subplots: The Whole Grid in One Go
plt.GridSpec: More Complicated Arrangements
13
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
For example,
we might create an inset axes at the top-right corner of another
axes by setting the x and y position to 0.65 (that is, starting at
65% of the width and 65% of the height of the figure) and the x
and y extents to 0.2 (that is, the size of the axes is 20% of the
width and 20% of the height of the figure).
For example, a gridspec for a grid of two rows and three columns with some specified width and height space
looks like this:
15
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Text annotation can be done manually with the plt.text/ax.text command, which will place text at a
particular x/y value.
The ax.text method takes an x position, a y position, a string, and then optional keywords specifying the
color, size, style, alignment, and other properties of the text. Here we used ha='right' and ha='center', where
ha is short for horizontal alignment.
Example
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
fig, ax = plt.subplots(facecolor='lightgray')
ax.axis([0, 10, 0, 10])
# transform=ax.transData is the default, but we'll specify it anyway
ax.text(1, 5, ". Data: (1, 5)", transform=ax.transData)
ax.text(0.5, 0.1, ". Axes: (0.5, 0.1)", transform=ax.transAxes)
ax.text(0.2, 0.2, ". Figure: (0.2, 0.2)", transform=fig.transFigure);
16
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Note that by default, the text is aligned above and to the left of the specified coordinates; here the “.” at the
beginning of each string will approximately mark the given coordinate location.
The transData coordinates give the usual data coordinates associated with the x- and y-axis labels. The transAxes
coordinates give the location from the bottom-left corner of the axes (here the white box) as a fraction of the axes
size.
The transfigure coordinates are similar, but specify the position from the bottom left of the figure (here the gray
box) as a fraction of the figure size.
Notice now that if we change the axes limits, it is only the transData coordinates that will be affected, while the
others remain stationary.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
ax = plt.axes(projection='3d')
# Data for a three-dimensional line
zline = np.linspace(0, 15, 1000)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
17
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens');
plt.show()
Notice that by default, the scatter points have their transparency adjusted to give a sense of depth on the page.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z, color='black')
ax.set_title('wireframe');
plt.show()
18
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Adding a colormap to the filled polygons can aid perception of the topology of the surface being visualized
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap='viridis', edgecolor='none')
ax.set_title('surface')
plt.show()
Surface Triangulations
For some applications, the evenly sampled grids required by
the preceding routines are overly restrictive and
inconvenient.
In these situations, the triangulation-based plots can be very useful.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
theta = 2 * np.pi * np.random.random(1000)
r = 6 * np.random.random(1000)
x = np.ravel(r * np.sin(theta))
y = np.ravel(r * np.cos(theta))
z = f(x, y)
ax = plt.axes(projection='3d')
ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5)
19
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
We’ll use an etopo image (which shows topographical features both on land and under the ocean) as the
map background
Program to display particular area of the map with latitude and
longitude lines
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
from itertools import chain
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='r')
Map Projections
The Basemap package implements several dozen such projections, all referenced by a short format code. Here we’ll
briefly demonstrate some of the more common ones.
Cylindrical projections
Pseudo-cylindrical projections
Perspective projections
Conic projections
Cylindrical projection
The simplest of map projections are cylindrical projections, in which lines of constant latitude and longitude
are mapped to horizontal and vertical lines, respectively.
This type of mapping represents equatorial regions quite well, but results in extreme distortions near the
poles.
The spacing of latitude lines varies between different cylindrical projections, leading to different
conservation properties, and different distortion near the poles.
Other cylindrical projections are the Mercator (projection='merc') and the cylindrical equal-area
(projection='cea') projections.
The additional arguments to Basemap for this view specify the latitude (lat) and longitude (lon) of the
lower-left corner (llcrnr) and upper-right corner (urcrnr) for the desired map, in units of degrees.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
20
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude) remain
vertical; this can give better properties near the poles of the projection.
The Mollweide projection (projection='moll') is one common example of this, in which all meridians are
elliptical arcs
It is constructed so as to
preserve area across the map: though there are
distortions near the poles, the area of small
patches reflects the true area.
Other pseudo-cylindrical projections are the
sinusoidal (projection='sinu') and Robinson
(projection='robin') projections.
The extra arguments to Basemap here refer to
the central latitude (lat_0) and longitude
(lon_0) for the desired map.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None,
lat_0=0, lon_0=0)
draw_map(m)
Perspective projections
Perspective projections are constructed using a particular choice of perspective point, similar to if you
photographed the Earth from a particular point in space (a point which, for some projections, technically lies
within the Earth!).
21
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
One common example is the orthographic projection (projection='ortho'), which shows one side of the globe
as seen from a viewer at a very long distance.
Thus, it can show only half the globe at a time.
Other perspective-based projections include the
gnomonic projection (projection='gnom') and
stereographic projection (projection='stere').
These are often the most useful for showing small
portions of the map.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None,
lat_0=50, lon_0=0)
draw_map(m);
Conic projections
A conic projection projects the map onto a single cone, which is then unrolled.
This can lead to very good local properties, but regions far from the focus point of the cone may become
very distorted.
One example of this is the Lambert conformal conic projection (projection='lcc').
It projects the map onto a cone arranged in such a way that two standard parallels (specified in Basemap by
lat_1 and lat_2) have well-represented distances, with scale decreasing between them and increasing outside
of them.
Other useful conic projections are the equidistant conic (projection='eqdc') and the Albers equal-area
(projection='aea') projection
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55, width=1.6E7, height=1.2E7)
draw_map(m)
22
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
• Political boundaries
drawcountries() - Draw country boundaries
drawstates() - Draw US state boundaries
drawcounties() - Draw US county
boundaries
• Map features
drawgreatcircle() - Draw a great circle between two points
drawparallels() - Draw lines of constant latitude
drawmeridians() - Draw lines of constant longitude
drawmapscale() - Draw a linear scale on the map
• Whole-globe images
bluemarble() - Project NASA’s blue marble image onto the map
shadedrelief() - Project a shaded relief image onto the map
etopo() - Draw an etopo relief image onto the map
warpimage() - Project a user-provided image onto the map
23
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Pair plots
When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very useful for
exploring correlations between multidimensional data, when you’d like to plot all pairs of values against each other.
We’ll demo this with the Iris dataset, which lists measurements of petals and sepals of three iris species:
import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue='species', size=2.5);
24
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Faceted histograms
Sometimes the best way to view data is via histograms of subsets. Seaborn’s FacetGrid makes this
extremely simple.
We’ll take a look at some data that shows the amount that restaurant staff receive in tips based on various
indicator data
25
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE
Factor plots
Factor plots can be useful for this kind of visualization as well. This allows you to view the distribution of a
parameter within bins defined by any other parameter.
Joint distributions
Similar to the pair plot we saw earlier, we can use sns.jointplot to show the joint distribution between different
datasets, along with the associated marginal distributions.
Bar plots
Time series can be plotted with sns.factorplot.
26
CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – V III SEM CSE