Unit I
Unit I
UNIT I
Data science and big data are used almost everywhere in both commercial and
noncommercial settings. Commercial companies in almost every industry use data
science and big data to gain insights into their customers, processes, staff, completion,
and products. Many companies use data science to offer customers a better user
experience, as well as to cross-sell, up-sell, and personalize their offerings. A good
example of this is Google AdSense,which collects data from internet users so relevant
commercial messages can be matched to the person browsing the internet. Human
resource professionals use people analytics and text mining to screen candidates,
monitor the mood of employees, and study informal networks among coworkers.
People analytics is the central theme in the book
. In the book (and movie) we saw that the traditional scouting process for
American baseball was random, and replacing it with correlated signals changed
everything. Relying on statistics allowed them to hire the right players and pit them
against the opponents where they would have the biggest advantage. Financial
institutions use data science to predict stock markets, determine the risk of lending
money, and learn how to attract new clients for their services. Governmental
organizations are also aware of data’ s value. Many governmental organizations not
only rely on internal data scientists to discover valuable information, but also share
their data with the public. You can use this data to gain insights or build data-driven
applications. is but one example; it’ s the home of the US Government’ s
open data. A data scientist in a governmental organization gets to work on diverse
projects such as detecting fraud and other criminal activity or optimizing project
funding. well-known example was provided by Edward Snowden, who leaked internal
documents of the American National Security Agency and the British Government
Communications Headquarters that show clearly how they used data science and big
data to monitor millions of individuals. Those organizations collected 5 billion data
records from widespread applications such as Google Maps, Angry Birds, email, and
text messages, among many other data sources. Nongovernmental organizations
(NGOs) are also no strangers to using data. They use it to raise money and defend
their causes. The World Wildlife Fund (WWF), for instance, employs data scientists to
increase the effectiveness of their fundraisingefforts. Universities use data science in
their research but also to enhance the study experience of their students. The rise of
massive open online courses (MOOC) produces a lot of data, which allows universities
to study how this type of learning can complement traditional classes.
In data science and big data you’ ll come across many different types of data, and
each ofthem tends to require different tools and techniques. The main categories
of data
are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
2
■ Graph-based
■Audio, video, and images
■ Streaming
Structured data is data that depends on a data model and resides in a fixed field
within arecord. As such, it’ s often easy to store structured data in tables within
databases or Excel
3
files (figure 1.1). SQL, or Structured Query Language, is the preferred way to manage
and query data that resides in databases. You may also come across structured data
that might give you a hard time storing it in a traditional relational
database. Hierarchical data such as a family tree is one such example.
Unstructured data is data that isn’ t easy to fit into a data model because the content
is context-specific or varying. One example of unstructured data is your regular email
(figure 1.2). Although email contains structured elements such as the sender, title, and
body text, it’ s a challenge to find the number of people who have written an email
complaint about a specific employee because so many ways exist to refer to a person,
for example. The thousands of different languages and dialects out there further
complicate this.
4
“ Graph data” can be a confusing term because any data can be shown in a graph.
“ Graph” in this case points to mathematical . In graph theory, a graph is
a mathematical structure to model pair-wise relationships between objects. Graph or
network data is, in short,data that focuses on the relationship or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and store
graphical data. Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of a person
and the shortest path between two people. Examples of graph-based data can be
found on many social media websites.
5
Graph databases are used to store graph-based data and are queried with
specialized querylanguages such as SPARQL.
Audio, image, and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers. MLBAM (Major League Baseball Advanced
Media) announced in 2014 that they’ ll increase video capture to approximately 7 TB
per game for the purpose of live, in- game analytics. High-speed cameras at stadiums
will capture ball and athlete movements to calculate in real time, for example, the path
taken by a defender relative to two baselines. a company called DeepMind succeeded
at creating an algorithm that’ s capable of learning how to play video games. This
algorithm takes the video screen as input and learns to interpret everything via a
complex process of deep learning. It’ s a remarkable feat that prompted Google to
buy the company for their own Artificial Intelligence (AI) development plans.
While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being
loaded into a data store ina batch. Although this isn’ t really a different type of data,
Examples are the “ What’ s trending” on Twitter, live sporting or music events, and
the stock market.
Data science is mostly applied in the context of an organization. When the business
6
asks you to perform a data science project, you’ ll first prepare a project charter. This
charter contains information such as what
7
you’ re going to research, how the company benefits from that, what data and
resources you need, a timetable, and deliverables.
The second step is to collect data. You’ ve stated in the project charter which data
you need and where you can find it. In this step you ensure that you can use the data in
your program, which means checking the existence of, quality, and access to the data.
Data can also be delivered by third-party companies and takes many forms ranging
from Excel spreadsheets todifferent types of databases.
Data collection is an error-prone process; in this phase you enhance the quality of the
data and prepare it for use in subsequent steps. This phase consists of three
subphases: removes false values from a data source and
inconsistencies across data sources, enriches data sources by
combining information from multiple data sources, and ensures
that the data is in a suitable format for use in your models.
Data exploration is concerned with building a deeper understanding of your data. You
try to understand how variables interact with each other, the distribution of the data,
and whether there are outliers. To achieve this you mainly use descriptive statistics,
visual techniques, and simple modelling. This step is also known as Exploratory Data
Analysis.
In this phase you use models, domain knowledge, and insights about the data you
found in the previous steps to answer the research question. You select a technique
from the fields of statistics, machine learning, operations research, and so on. Building
a model is an iterative process that involves selecting the variables for the model,
executing the model, and model diagnostics.
Finally, you present the results to your business. These results can take many forms,
ranging from presentations to research reports. Sometimes you’ ll need to automate
the execution of the process because the business will want to use the insights you
gained in another project or enable an operational process to use the outcome from
your model.
8
1 The first step of this process is setting a . The main purpose here is
making sure all the stakeholders understand the , , and of the project. In
every serious project this will result in a project charter.
2 The second phase is . You want to have data available for analysis,
so this stepincludes finding suitable data and getting access to the data from the data
owner. The result is data in its raw form, which probably needs polishing and
transformation before it becomes usable.
3 Now that you have the raw data, it’ s time to it. This includes
transforming the data from a raw form into data that’ s directly usable in your models.
To achieve this, you’ ll detect and correct different kinds of errors in the data,
combine data from different data sources, and transform it. If you have successfully
completed this step, you can progress to data visualization and modeling.
9
In reality you won’ t progress in a linear way from step 1 to step 6. Often you’ ll
regress and iterate between the different phases. This process ensures you have a
well-defined research plan, a good understanding of the business question, and clear
deliverables before you even start looking at data. The first steps of your process
focus on getting high-quality data as input for your models. This way your models will
perform better later on. In data science there’ s a well-known saying:
.
A project starts by understanding the , the , and the of your project (figure
2.2). What does the company expect you to do? And why does management place
such a value on your research? Is it part of a bigger strategic picture or a “ lone wolf”
project originating from an opportunity someone detected? Answering these three
questions (what, why, how) is the goal of the first phase, so that everybody knows
what to do and can agree on the best courseof action. The outcome should be a clear
research goal, a good understanding of the context, well-defined deliverables, and a
plan of action with a timetable. This information is then best placed in a project
charter.
An essential outcome is the research goal that states the purpose of your assignment
in a clear and focused manner. Understanding the business goals and context is
critical for project success.
Clients like to know upfront what they’ re paying for, so after you have a good
understanding of the business problem, try to get a formal agreement on the
10
A project charter requires teamwork, and your input covers at least the following:
■A clear research goal
■The project mission and context
■ How you’ re going to perform your analysis
■What resources you expect to use
11
The next step in data science is to retrieve the required data (figure 2.3). Sometimes
you needto go into the field and design a data collection process yourself, but most of
the time you won’ t be involved in this step. Many companies will have already
collected and stored the data for you, and what they don’ t have can often be bought
from third parties
Data can be stored in many forms, ranging from simple text files to tables in a
database. Theobjective now is acquiring all the data you need. This may be
difficult, and even if yousucceed, data is often like a diamond in the rough: it needs
polishing to be of any use to you.
Your first act should be to assess the relevance and quality of the data that’ s readily
availablewithin your company. Most companies have a program for maintaining key
data, so much ofthe cleaning work may already be done. This data can be stored in
official data repositoriessuch as databases, data marts, data warehouses, and data
lakes maintained by a team of ITprofessionals. The primary goal of a database is
data storage, while a data warehouse isdesigned for reading and analyzing that
data.A data mart is a subset of the data warehouseand geared toward serving a
specific business unit. While data warehouses and data marts arehome to
preprocessed data, data lakes contains data in its natural or raw format. But the
possibility exists that your data still resides in Excel files on the desktop of a domain
expert.
If data isn’ t available inside your organization, look outside your organization’ s
walls. Many companies specialize in collecting valuable information. For instance,
Nielsen and GFK are well known for this in the retail industry. Other companies provide
data so that you, in turn, can enrich their services and ecosystem. Such is the case
with Twitter, LinkedIn, and Facebook.
Expect to spend a good portion of your project time doing data correction and
cleansing, sometimes up to 80%. Most of the errors you’ ll encounter during the data
12
gathering phase areeasy to spot, but being too careless will make you spend many
hours solving data issues that could have been prevented during data import. You’ ll
investigate the data during the import, data preparation, and exploratory phases.
During , you check to see if the data is
13
equal to the data in the source document and look to see if you have the right data
types. With
The data received from the data retrieval phase is likely to be “ a diamond in the
rough.” Yourtask now is to prepare it for use in the modelling and reporting phase.
Doing so is tremendously important because your models will perform better and
you’ ll lose less time trying to fix strange output. Your model needs the data in a
specific format, so data transformation will always come into play.
Data cleansing is a subprocess of the data science process that focuses on removing
errors in your data so your data becomes a true and consistent representation of the
processes it originates from. By “ true and consistent representation” we imply that
at least two types of errors exist. The first type is the interpretation error, such as when
you take the value in your data for granted, like saying that a person’ s age is greater
than 300 years. The second type of error points to inconsistencies between data
sources or against your company’ s standardized values. An example of this class of
errors is putting “ Female” in one table and “ F” in another when they represent the
same thing: that the person is female. Another example is that you use Pounds in one
table and Dollars in another.
Table 2.2 An overview of common errors
Sometimes you’ ll use more advanced methods, such as simple modeling, to find and
identify data errors; We do a regression to get acquainted with the data and detect the
14
Data collection and data entry are error-prone processes. They often require human
intervention, and because humans are only human, they make typos or lose their
concentration for a second and introduce an error into the chain. But data collected by
machines or computers isn’ t free from errors either. For small data sets you can
check every value by hand. Detecting data errors when the variables you study don’ t
have many classes can be done by tabulating the data with counts - frequency table.
Most errors of this type are easy to fix with simple assignment statements and
if-then elserules:
if x == “ Godo” :
x = “ Good”
if x == “ Bade” :
x = “ Bad”
Whitespaces tend to be hard to detect but cause errors like other redundant
characters would. The cleaning during the ETL phase wasn’ t well executed, and keys
in one table contained a whitespace at the end of a string. This caused a mismatch
of keys such as “ FR ” – “ FR” ,
16
Sanity checks are another valuable type of data check. Here you check the value
againstphysically or theoretically impossible values.
Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120
An outlier is an observation that seems to be distant from other observations or, more
specifically, one observation that follows a different logic or generative process than
the otherobservations. The easiest way to find outliers is to use a plot or a table with
the minimum andmaximum values. An example is shown in figure 2.6. The plot on the
top shows no outliers, whereas the plot on the bottom shows possible outliers on the
upper side when a normal distribution is expected. The high values in the bottom
graph can point to outliers when assuming a normal distribution.
17
Missing values aren’ t necessarily wrong, but you still need to handle them
separately; certain modeling techniques can’ t handle missing values.
When integrating two data sets, you have to pay attention to their respective
units of measurement. sets can contain prices per gallon and others can contain
prices per liter. A simple conversion will do the trick in this case.
Your data comes from several different places, and in this substep we focus on
integratingthese different sources.
You can perform two operations to combine information from different data sets.
The first operation is : enriching an observation from one table with
information from another table.
The second operation is or : adding the observations of
one tableto those of another table.
19
To avoid duplication of data, you virtually combine data with views. In the previous
example we took the monthly data and combined it in a new physical table.
Certain models require their data to be in a certain shape. Transforming your data so it
takes asuitable form for data modelling.
Relationships between an input variable and an output variable aren’ t always linear.
Take, forinstance, a relationship of the form y = aebx. Taking the log of the independent
variables simplifies the estimation problem dramatically.
20
Sometimes you have too many variables and need to reduce the number
because they don’ t add new information to the model. Having too many variables in
your model makes the model difficult to handle, and certain techniques don’ t
perform well when you overload them with too many input variables. For instance, all
the techniques based on a Euclidean distance perform well only up to 10 variables.
Variables can be turned into dummy variables (figure 2.13). Dummy variables
can only take two values: true(1) or false(0). They’ re used to indicate the absence of
a categorical effect that may explain the observation.
21
Information becomes much easier to grasp when shown in a picture, therefore you
mainly use graphical techniques to gain an understanding of your data and the
interactions between variables.
22
The visualization techniques you use in this phase range from simple line graphs
orhistograms, as shown in figure 2.15, to more complex diagrams such as Sankey
andnetworkgraphs.
These plots can be combined to provide even more insight, as shown in figure
2.16.Overlaying several plots is common practice. In figure 2.17 we combine
simplegraphs into a Pareto diagram, or 80-20 diagram.Figure 2.18 shows another
technique: . With brushing and linkingyou combine and link
different graphs and tables (or views) so changes in onegraph are automatically
transferred to the other graphs.
23
Two other important graphs are the histogram shown in figure 2.19 and the
boxplotshown in figure 2.20.
In a histogram a variable is cut into discrete categories and the number of
occurrencesin each category are summed up and shown in the graph. The boxplot
show how many observations are present but does offer animpression of the
distribution within categories. It can show the maximum, minimum,median, and other
characterizing measures at the same time.
Histogram:
Box plot: It can show the maximum, minimum, median, and other characterizing
measures at the same time.
25
The techniques you’ ll use now are borrowed from the field of machine learning, data
mining,and/or statistics. most models consist of the following main steps:
1 Selection of a modelling technique and variables to enter in the
model2 Execution of the model
3 Diagnosis and model comparison
You’ ll need to select the variables you want to include in your model and a modelling
technique. Your findings from the exploratory analysis should already give a fair idea
of whatvariables will help you construct a good model.
You’ ll need to consider model performance and whether your project meets all the
requirements to use your model, as well as other factors:
26
Once you’ ve chosen a model you’ ll need to implement it in code. Luckily, most
programming languages, such as Python, already have libraries such as StatsModels
or Scikit-learn. These packages use several of the most popular techniques.
We created predictor values that are meant to predict how the target variables behave.
For a linear regression, a “ linear relation” between each x (predictor) and the y
(target) variable is assumed, as shown in figure 2.22.
27
Don’ t let knn.score() fool you; it returns the model accuracy, but by “ scoring a
model” weoften mean applying it on data to make a prediction.
prediction = knn.predict(predictors)
Now we can use the prediction and compare it to the real thing using a
confusionmatrix.
metrics.confusion_matrix(target,prediction)
The confusion matrix shows we have correctly predicted 17+405+5 cases, so that’ s good.
You’ ll be building multiple models from which you then choose the best one based
onmultiple criteria. Working with a holdout sample helps you pick the
best-performingmodel. A holdout sample is a part of the data you leave out of the
model building so itcan be used to evaluate the model afterward.
The principle here is simple: the model should work on unseen data.The model is then
unleashed on the unseen data and error measures are calculated to evaluate it.
Multiple error measures are available, and in figure 2.26 we show the general idea on
comparing models. The error measure used in the example is the mean square error.
Mean square error is a simple measure: check for every prediction how far it was from
the truth, square this error, and add up the error of every prediction.
To estimate the models, we use 800 randomly chosen observations out of 1,000
(or 80%),without showing the other 20% of data to the model.
30
Once the model is trained, we predict the values for the other 20% of the variables
based on those for which we already know the true value, and calculate the model
error with an error measure. Then we choose the model with the lowest error. In this
example we chose model 1 because it has the lowest total error.
Many models make strong assumptions, such as independence of the inputs, and you
have to verify that these assumptions are indeed met. This is called model diagnostics.
After you’ ve successfully analyzed the data and built a well-performing model, you’
re ready to present your findings to the world. Sometimes people get so excited about
your work that you’ ll need to repeat it over and over again because they value the
predictions of your modelsor the insights that you produced.
31
For this reason, you need to automate your models. This doesn’ t always mean that
you haveto redo all of your analysis all the time. Sometimes it’ s sufficient that you
implement only the model scoring; other times you might build an application that
automatically updates reports, Excel spreadsheets, or PowerPoint presentations. The
last stage of the data science process is where your will be most useful, and
yes, they’ re extremely important
Data Mining
Data mining should have been more appropriately named “ knowledge mining
from data,” which is unfortunately somewhat long. However, the shorter term,
may not reflect the emphasis on mining from large amounts of data.
Nevertheless, mining is avivid term characterizing the process that finds a small set of
precious nuggets from a great deal of raw material (Figure 1.3).
In addition, many other terms have a similar meaning to data mining— for example,
, , ,
, and
Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD, while others view data mining as merely an
essential step in the process of knowledge discovery.
The knowledge discovery process is shown in Figure 1.4 as an iterative sequence of
the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
32
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms
appropriatefor mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to
extract datapatterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge basedon )
7. Knowledge presentation (where visualization and knowledge representation
techniquesare used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared
for mining. The data mining step may interact with the user or a knowledge base. The
interesting patterns are presented to the user and may be stored as new knowledge in
the knowledge base.
Data mining is the process of discovering interesting patterns and knowledge from
large amounts of data. The data sources can include databases, data warehouses,
theWeb, other information repositories, or data that are streamed into the system
dynamically.
33
(2) a multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).
3. The top tier is a front-end client layer, which contains query and reporting tools,
analysistools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
Relational OLAP (ROLAP) servers: These are the intermediate servers that
stand in between a relational back-end server and client front-end tools. They
use a or to store and manage warehouse
data, and OLAP middleware to support missing pieces. ROLAP servers include
optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional toolsand services. ROLAP technology tends to
have greater scalability than MOLAP technology. The DSS server of
Microstrategy, for example, adopts the ROLAP approach.
Multidimensional OLAP (MOLAP) servers: These servers support
multidimensional data views through
. They map multidimensional views directly to data cube array
structures. The advantage of using a data cube is that it allows fast indexing to
precomputed summarized data. Notice that with multidimensional data stores,
the storage utilization may be low if the dataset is sparse. Many MOLAP
servers adopt a two- level storage representation to handle dense and sparse
data sets: Denser subcubes are identified and stored as array structures,
whereas sparse subcubes employ compression technology for efficient
storage utilization.
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP
and MOLAP technology, benefiting from the greater scalability of ROLAP and
the faster computation of MOLAP. For example, a HOLAP server may allow
large volumes of detailed data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server
2000 supports a hybrid OLAP server.
Specialized SQL servers: To meet the growing demand of OLAP processing in
relational databases, some database system vendors implement specialized
SQL servers that provide advanced query language and query processing
support for SQL queries over star and snowflake schemas in a read-only
environment.
set ofvalues is
35
Example 2.7 Median. Let’ s find the median of the data from Example 2.6. The data
are already sorted in increasing order. There is an even number of observations (i.e.,
12); therefore, the median is not unique. It can be any value within the two middlemost
values of 52 and 56 (that is, within the sixth and seventh values in the list). By
convention, we assign the average of the two middlemost values as the median; that
is,
Thus, the median is $54,000.
Suppose that we had only the first 11 values in the list. Given an odd number of
values, themedian is the middlemost value. This is the sixth value in this list,
which has a value of
$52,000.
Example 2.8 Mode. The data from Example 2.6 are bimodal. The two modes are
$52,000and
$70,000. (They are repeated two times).
The quartiles give an indication of a distribution’ s center, spread, and shape. The
first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the
data. The third quartile, denoted by Q3, is the 75th percentile— it cuts off the lowest
75% (or highest 25%) of the data. The second quartile is the 50th percentile. As the
median, it gives the center of the data distribution.
The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as
IQR = Q3-Q1.
Interquartile range. The quartiles are the three values that split the sorted data set into
four equal parts. 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 Thus, the quartiles for this
data are the third, sixth, and ninth values, respectively, in the sorted list. Therefore, Q1 =
$47,000 and Q3 is $63,000. Thus, the interquartile range is IQR = 63-47 = $16,000.
where 𝑥 ̅ is the mean value of the observations, as defined in Eq. (2.1). The
standarddeviation, 𝜎 , of the observations is the square root of the variance, 𝜎 2.
represents mean
standard deviation
Variance
38
UNIT II
Types of data
THREE TYPES OF DATA
Any statistical analysis is performed on data, a collection of actual
observations or scores in a survey or an experiment. The precise form of a statistical
analysis often depends on whether data are qualitative, ranked, or quantitative.
TYPES OF VARIABLES
A variable is a characteristic or property that can take on different values.
Discrete and Continuous Variables
Quantitative variables can be further distinguished in terms of whether they are
discrete or continuous. A discrete variable consists of isolated numbers separated by
gaps. Examples include most counts, such as the number of children in a family (1, 2,3,
etc., but never 1 1/2.
A continuous variable consists of numbers whose values, at least in theory,
have no restrictions. Examples include amounts, such as weights of male statistics
students; durations,such as the reaction times of grade school children to a fire alarm;
and standardized test scores, such as those on the Scholastic Aptitude Test (SAT).
Independent Variable ( )
Since training is assumed to influence communication, it is an independent variable.
independent variable
Once the data have been collected, any difference between the groups can be
interpreted as being by the independent variable.
If, for instance, a difference appears in favor of the active-listening group, the
psychologist can conclude that training in active listening causes fewer
communication breakdowns between couples. Having observed this relationship, the
psychologist can expect that, if new couples were trained in active listening, fewer
breakdowns in communication would occur.
Dependent Variable (
)
To test whether training influences communication, the psychologist counts the
number of communication breakdowns between each couple, as revealed by
inappropriate replies, aggressive comments, verbal interruptions, etc., while discussing
a conflict-provokingtopic, such as whether it is acceptable to be intimate with a third
person.
In an experimental setting, the dependent variable is measured, counted, or
recorded by the investigator.
Unlike the independent variable, the dependent variable isn’ t manipulated by the
investigator.Instead, it represents an outcome: the data produced by the experiment
Confounding Variable
Couples willing to devote extra effort to special training might already possess
a deeper commitment that co-varies with more active-listening skills.
confounding
variable. You can avoid confounding variables, as in the present case, by assigning
subjects randomly to the various groups in the experiment and also by standardizing
all experimental conditions, other than the independent variable, for subjects in both
groups.
41
Data are grouped into class intervals with 10 possible values each. The bottom
class includes the smallest observation (133), and the top class includes the largest
observation (245). The distance between bottom and top is occupied by an orderly
series of classes. The frequency ( ) column shows the frequency of observations in
each class and, at the bottom, the total number of observations in all classes.
43
2.2GUIDELINES
The “ Guidelines for Frequency Distributions” box lists seven rules for producing a
well- constructed frequency distribution. The first three rules are essential and should
not be violated. The last four rules are optional and can be modified or ignored as
circumstances warrant.
44
45
For instance, to obtain the proportion of .06 for the class 130– 139, divide the
frequency of 3for that class by the total frequency of 53.
Percentages or Proportions?
To convert the relative frequencies in Table 2.5 from proportions to percentages,
multiply each proportion by 100; that is, move the decimal point two places to the right.
For example, multiply .06 (the proportion for the class 130– 139) by 100 to obtain 6
percent.
For class 130-139 the cumulative frequency is 3 since, there are no lower
classes.For class 140-149 the cumulaive frequency is 1+3 = 4
For class 150-159 the cumulative frequency is 1+3+17= 21
The cumulative percent for class 130-139 is given by (cumulative frequency / Total no.of
freq)*100.
Example (3/53)*100 = 5.66 = 6
Percentile Ranks
When used to describe the relative position of any score within its parent
distribution, cumulative percentages are referred to as percentile ranks. percentile
rank
The weight distribution described in Table 2.2 appears as a histogram in Figure 2.1.
A casual glance at this histogram confirms previous conclusions: a dense
concentration of weights among the 150s, 160s, and 170s, with a spread in the
direction of the heavier weights. Let’ s pinpoint some of the more important features
of histograms.
1. Equal units along the horizontal axis (the axis, or abscissa) reflect the various
class intervals of the frequency distribution.
2. Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in
frequency.
3. The intersection of the two axes defines the origin at which both numerical
scales equal 0.
4. Numerical scales always increase from left to right along the horizontal axis
and from bottom to top along the vertical axis. It is considered good practice to
use wiggly linesto highlight breaks in scale, such as those along the horizontal
axis in Figure 2.1, between the origin of 0 and the smallest class of 130– 139.
5. The body of the histogram consists of a series of bars whose heights reflect
the frequencies for the various classes
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph.
50
Draw a vertical line to separate the stems, which represent multiples of 10, from the
space tobe occupied by the leaves, which represent multiples of 1.
51
Selection of Stems
Stem values are not limited to units of 10. Depending on the data value of 10,
such as 1, 100, 1000, or even .1, .01, .001, and so on can be selected.
For instance, an annual income of $23,784 could be displayed as a stem of 23
(thousands) and a leaf of 784. (Leaves consisting of two or more digits, such as 784,
are separated by commas.)
2.9TYPICAL SHAPES
Whether expressed as a histogram, a frequency polygon, or a stem and leaf
display, animportant characteristic of a frequency distribution is its shape. Figure 2.3
shows some of themore typical shapes for smoothed frequency polygons
52
Normal
Any distribution that approximates the normal shape. The familiar bell-shaped
silhouette of the normal curve can be superimposed on many frequency distributions.
Bimodal
Any distribution that approximates the bimodal shape in panel B, might, as
suggested previously, reflect the coexistence of two different types of observations in
the same distribution. For instance, the distribution of the ages of residents in a
neighborhood consisting largely of either new parents or their infants has a bimodal
shape.
Positively Skewed
The two remaining shapes in Figure 2.3 are lopsided. A
Negatively Skewed
2.11MISLEADING GRAPHS
53
54
Four years is the modal term, since the greatest number of presidents, 7, served this
term. Note that the mode equals 4 years
More Than One Mode
Distributions can have more than one mode (or no mode at all).
3.2MEDIAN
The median reflects the middle value when observations are ordered from least to most.
3.3MEAN
The mean is found by adding all scores and then dividing by the number of
scores.
Sample or Population?
Statisticians distinguish between two types of means— the population mean and the
sample mean— depending on whether the data are viewed as a population (
) or as a sample ( ).
Formula for Sample Mean
designates the sample mean, and the formula becomes
The balance point for a sample, found by dividing the sum for the values of all scores
3.4WHICH AVERAGE?
If Distribution Is Not Skewed
. If Distribution Is Skewed
When extreme scores cause a distribution to be skewed, as for the infant death
rates forselected countries listed in Table 3.4, the values of the three averages can
differ appreciably.
The modal infant death rate of 4 describes the most typical rate (since it occurs
mostfrequently, five times, in Table 3.4).
57
The median infant death rate of 7 describes the middle-ranked rate (since the United
States, with a death rate of 7, occupies the middle-ranked, or 10th, position among the
19 ranked countries).
The mean infant death rate of 30.00 describes the balance point for all rates (since the
sum of all rates, 570, divided by the number of countries, 19, equals 30.00).
Unlike the mode and median, the mean is very sensitive to extreme scores, or outliers.
Interpreting Differences between Mean and Median
Appreciable differences between the values of the mean and median signal
thepresence of a skewed distribution.
If the mean exceeds the median the underlying distribution is positively skewed.
If the median exceeds the mean, the underlying distribution is negatively skewed.
Describing Variability
Variability is the measures of amount by which scores are dispersed or
scattered in a distribution.
In Figure 4.1, each of the three frequency distributions consists of seven scores
with the same mean (10) but with different variabilities. rank the three distributions
from least to most variable. Your intuition was correct if you concluded that
distribution A has the variability, distribution B has variability, and
distribution C has the variability. For distribution A with the least (zero) variability,
all seven scores have the same value (10). For distribution B with intermediate
variability, the values of scores vary slightly (one 9 and one 11), and for distribution C
with most variability, they vary even more (one 7, two 9s, two 11s, and one 13).
FIGURE 4.1
4.2RANGE
The range is the difference between the largest and smallest scores.
In Figure 4.1, distribution A, the least variable, has the smallest range of 0 (from
10 to10); distribution B, the moderately variable, has an intermediate range of 2 (from
11 to 9); and distribution C, the most variable, has the largest range of 6 (from 13 to 7).
Shortcomings of Range
1. The range has several shortcomings. First, since its value depends on only two
scores— the largest and the smallest— it fails to use the information provided
by the remaining scores.
2. The value of the range tends to increase with increases in the total number of scores.
4.3VARIANCE
The mean of all squared deviation scores.
the variance also qualifies as a type of mean, that is, as the balance point for
some distribution. In the case of the variance, each original score is re-expressed as a
60
FIGURE 4.1
In distribution C, one score coincides with the mean of 10, four scores (two 9s and two
11s) deviate 1 unit from the mean, and two scores (one 7 and one 13) deviate 3 units
from the mean, yielding a set of seven deviation scores: one 0, two – 1s, two 1s, one
– 3, and one 3. (Deviation scores above the mean are assigned positive signs; those
below the mean are assigned negative signs.)
Mean of the Squared Deviations
Multiplying each deviation by itself— generates a set of squared deviation
scores, all of which are positive. add the consistently positive values of all squared
deviation scores and then dividing by the total number of scores to produce the mean
of all squared deviation scores, also known as the variance.
4.4STANDARD DEVIATION
The square root of the variance. This produces a new measure, known as the
standard deviation, that describes variability in the original units of measurement the
standard deviation, the square root of the mean of all squared deviations from the
mean, that is,
Majority of Scores within One Standard Deviation
For instance, among the seven deviations in distribution C, a majority of five
scores deviate less than one standard deviation (1.77) on either side of the mean.
62
if the distribution of IQ scores for a class of fourth graders has a mean (X) of
105and a standard deviation (s) of 15, a majority of their IQ scores should be within
one standard deviation on either side of the mean, that is, between 90 and 120.
FIGURE 4.3
Some generalizations that apply to most frequency distributions
represents the sum of squares, Σ directs us to sum over the expression to its right, and (
−
) 2 denotes each of the squared deviation scores.
1. Subtract the population mean, , from each original score, , to obtain a
deviationscore, − .
2. Square each deviation score, ( − )2, to eliminate negative signs.
3. Sum all squared deviation scores, Σ ( − )2.
64
where 2 and represent the sample variance and sample standard deviation,
is thesample sum of squares
67
4.6DEGREES OF FREEDOM ( )
refers to the number of values that are free to vary, given one
or more mathematical restrictions, in a sample being used to a population
characteristic.
when deviations about the sample mean are used to estimate variability in the
population, only − 1 are free to vary. As a result, there are only − 1 degrees of
freedom, that is,
= − 1. One is lost because of the zero-sum restriction.
68
FIGURE 5.1
.10 of these men, that is, one-tenth of 3091, (3091/10) or about 309 men, are
70 inches tall. .10 of these men, that is, one-tenth of 3091, or about 309 men, are 70
inches tall. Only half of the bar at 66 inches is shaded to adjust for the fact that any
height between 65.5 and 66.5 inches is reported as 66 inches, whereas eligible
applicants must be shorter than 66 inches, that is, 66.0 inches.
FIGURE 5.2
5.2 SCORES
A score is a unit-free, standardized score that, regardless of the original
units ofmeasurement, indicates how many standard deviations a score is above or
below the mean of its distribution.
where is the original score and and σ are the mean and the standard deviation,
respectively,
Converting to Scores
To answer the question about eligible FBI applicants, replace with 66 (the maximum
permissible height), with 69 (the mean height), and with 3 (the standard deviation
This informs us that the cutoff height is exactly one standard deviation below
the mean. Knowing the value of , we can use the table for the standard normal curve
to find the proportion of eligible FBI applicants. First, however, we’ ll make a few
comments about the standard normal curve.
5.3STANDARD NORMAL CURVE
If the original distribution approximates a normal curve, then the shift to standard or
scores will always produce a new distribution that approximates the standard normal
curve. The standard normal curve always has a mean of 0 and a standard deviation of
1.
However, to verify (rather than prove) that the mean of a standard normal distribution
equals 0, replace in the score formula with , the mean of any (nonstandard)
normal distribution, and then solve for :
to verify that the standard deviation of the standard normal distribution equals 1,
replace in the score formula with + 1 , the value corresponding to one standard
deviation above the mean for any (nonstandard) normal distribution, and then solve for
:
Although there is an infinite number of different normal curves, each with its own
meanand standard deviation, there is only one standard normal curve, with a mean of
0 and astandard deviation of 1.
73
Page
458
74
4. Find the target area. Refer to the standard normal table, using the bottom
legend, as the score is negative. The arrows in Table 5.1 show how to read the
table. Look up column A’ to 1.00 (representing a score of – 1.00), and note
the corresponding proportion of .1587 in column C’ : This is the answer, as
suggested in the right part of
75
Figure 5.6. It can be concluded that only .1587 (or .16) of all of the FBI
applicantswill be shorter than 66 inches.
Example: Finding Proportions Two Scores
Look up column A′ to a negative z score of – 1.00 (remember, you must imagine the
negativesign), and note the corresponding proportion of .1587 in column C′ .
Likewise, look up
76
column A′ to a z score of – 1.67, and note the corresponding proportion of .0475 in column
C′ .Subtract
77
UNIT III
6.2SCATTERPLOTS
A scatterplot
as in panel A of
Figure 6.2, Small values of one variable are paired with
small values of the other variable, and large values are paired with large values.
78
Perfect Relationship
A dot cluster that equals (rather than merely approximates) a straight line reflects a
perfectrelationship between two variables. In practice, perfect relationships are most
unlikely.
Curvilinear Relationship
Sometimes adot cluster approximates a or line, as in Figure 6.4, and
thereforereflectsa curvilinear relationship. Descriptions of these relationships are
more complex thanthose of linear relationships.
6
Key Properties of
1. .
,
Sign of
A number with a plus sign (or no sign) indicates a positive relationship, and a
number with a minus sign indicates a negative relationship. For example, an with a
plus sign describes the positive relationship between height and weight shown in panel
A of Figure 6.2, and an with a minus sign describes the negative relationship
between heavy smoking and life expectancy shown in panel B.
Numerical Value of
The more closely a value of r approaches either – 1.00 or +1.00, the stronger
(more regular) the relationship. Conversely, the more closely the value of r approaches
0, the weaker (less regular) the relationship. Figure 6.3, notice that the values of r shift
from .75 to
.27 as the analysis for pairs of IQ scores shifts from a relatively strong relationship for
identical twins to a relatively weak relationship for foster parents and foster children.
80
Interpretation of
Located along a scale from – 1.00 to +1.00, the value of supplies information
about the direction of a linear relationship— whether positive or negative— and,
generally, information about the relative strength of a linear relationship— whether
relatively weak (and a poor describer of the data) because is in the vicinity of 0, or
relatively strong (and a good describer of the data) because deviates from 0 in the
direction of either +1.00 or – 1.00.
Range Restrictions
The value of the correlation coefficient declines whenever the range of possible or
scoresis restricted.
For example, Figure 6.5 shows a dot cluster with an obvious slope, represented
by an of .70 for the positive relationship between height and weight for all college
students. If, however, the range of heights along is restricted to students who stand
over 6 feet 2 inches (or 74 inches) tall, the abbreviated dot cluster loses its obvious
slope because of the more homogeneous weightsamong tall students. Therefore, as
depicted in Figure 6.5, the value of drops to .10.
Verbal Descriptions
An of .70 for the height and weight of college students could be translated into
“ Tallerstudents tend to weigh more”
81
where the two sum of squares terms in the denominator are defined as
FIGURE 7.2
Predictive Errors
Figure 7.3 illustrates the predictive errors that would have occurred if the
regressionline had been used to predict the number of cards received by the
five friends.
FIGURE 7.3
84
all scores
=.80*13+6.40
=16.8
7.5 ASSUMPTION
SLinearity
You
needto worry about violating this assumption only when the scatterplot for the original
correlationanalysis reveals an obviously bent or curvilinear dot cluster, such as
illustrated in Figure 6.4. In the unlikely event that a dot cluster describes a
pronounced curvilinear trend consultadvanced statistics technique.
Homoscedasticity
| , assumes that except for chance,
.
You need to worry about violating this assumption only when the
scatterplot reveals a dramatically different
type of dot cluster such as that shown in Figure 7.4
87
Figure 7.4
INTERPRETATION OF
Squared correlation coefficient, 2 A measure of predictive accuracy that supplements
the , . even though our ultimate goal is to show
therelationship between 2 and predictive accuracy, we will initially concentrate on two
kinds of predictive errors— those due to the repetitive prediction of the mean and
those due to the regression equation.
Predictive Errors
Panel A of Figure 7.5 shows the predictive errors for all five friends when the mean for
all five friends, , of 12 (shown as the mean line) is used to predict each of their
five scores. Panel B shows the corresponding predictive errors for all five friends
when a series of different ′ values, obtained from the least squares equation
(shown as the least squares line), is used to predict each of their five scores.
Positive and negative errors indicate that scores are either above or
belowtheircorresponding predicted scores.
Overall, as expected, errors are smaller when customized predictions of ′ from
the leastsquares equation can be used than when only the repetitive prediction of
can be used.
89
The error variability for the repetitive prediction of the mean can be designated
then summed.
Using the errors for the five friends shown in Panel A of Figure 7.5, this becomes
The error variability for the customized predictions from the least squares
equationcan be designated as
Using the errors for the five friends shown in Panel B of Figure 7.5, we obtain:
Proportion of Variability
This result, .64 or 64 percent, represents the proportion or percent gain in predictive accuracy.
91
ScoresSmall Values of r2
Table 7.4 lists the top 10 hitters in the major leagues during 2014 and shows how they
fared during 2015. Notice that 7 of the top 10 batting averages regressed downward,
toward 260s, the approximate mean for all hitters during 2015. Incidentally, it is not
true that, viewed as a group, all major league hitters are headed toward mediocrity.
Hitters among the top 10 in 2014, who were not among the top 10 in 2015, were
replaced by other mostly above-average hitters, who also were very lucky during 2015.
Observed regression toward the mean occurs for individuals or subsets of individuals,
not for entire groups.
Some trainees were praised after very good landings, while others were
reprimanded after very bad landings. On their next landings, praised trainees did more
poorly and reprimanded trainees did better. It was concluded, therefore, that praise
hinders but a reprimand helps performance!
UNIT IV
The Basics of NumPy Arrays
Data manipulation in Python is nearly synonymous
with NumPy array manipulation. NumPy array
manipulation to access data and subarrays, and to split,
reshape, and join the arrays
import numpyas np
np.random.seed(0)
x1 = np.random.randint(10, size=6)
print("dtype:", x3.dtype)
96
dtype: int64
itemsize: 8 bytes
nbytes: 480 bytes
Array Indexing: Accessing Single Elements
In a one-dimensional array, you can access the ith value
(counting from zero) by specifying the desired index in
squarebrackets.
In[5]: x1
Out[5]: array([5, 0, 3, 3, 7, 9])
In[6]: x1[0]
Out[6]: 5
In[7]: x1[4]
Out[7]: 7
In[10]: x2
Out[10]: array([[3, 5, 2, 4],
[7, 6, 8, 8],
98
[1, 6, 7, 7]])
Row,
columnIn[11]: x2[0,
0]
Out[11]: 3
In[12]: x2[2, 0]
Out[12]: 1
In[13]: x2[2, -1]
Out[13]: 7
array([5, 0, 3, 3, 7, 9])
In[15]: x1[0] = 3.14159
x1
Out[15]: array([3, 0, 3, 3, 7, 9])
In[16]: x =
np.arange(10)x
Out[16]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
99
In[17]: x[:5]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
100
In[18]: x[5:]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Out[18]: array([5, 6, 7, 8, 9])
In[19]: x[4:7]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Out[19]: array([4, 5, 6])
In[20]: x[::2]
Out[20]: array([0, 2, 4, 6, 8])
Prints every second element
In[21]: x[1::2]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Out[21]: array([1, 3, 5, 7, 9])
In[22]: x[::-1]
Out[22]: array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
In[23]: x[5::-2]
Out[23]: array([5, 3, 1])
x[6::-2]
array([6, 4, 2, 0])
Multidimensional subarrays
Multidimensional slices work in the same way, with multiple
slices separated by commas.
For example
In[24]: x2
Out[24]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
In[25]: x2[:2, :3]
101
In X2[::-1]
array([[1, 6, 7,
7],
[7, 6, 8, 8],
[3, 5, 2, 4]])
In x2[::-1, ::-1]
array([[7, 7, 6, 1],
array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
x2[::-2]
array([[1, 6, 7, 7],
[3, 5, 2, 4]])
x2[::-2,::-2]
array([[7, 6],
[4, 5]])
102
x2[::-3]
103
array([[1, 6, 7, 7]])
x2[::-3,::-3]
array([[7, 1]])
Or
In[30]: print(x2[0])
[12 5 2 4]
[ 7 6]]
105
if we modify this subarray we’ ll see that the original array is changed
In[33]: x2_sub[0, 0] = 99
print(x2_sub
)[[99 5]
[ 7 6]]
In[34]: print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
copy()
Out: [[99 5]
[ 7 6]]
If we now modify this subarray, the original array is
nottouched:
In[36]: x2_sub_copy[0, 0] = 42
print(x2_sub_copy
)[[42 5]
[ 7 6]]
In[37]: print(x2)
[[99 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
106
Reshaping of Arrays
107
out:
[[1 2 3]
[4 5 6]
[7 8 9]]
the size of the initial array must match the size of the
reshapedarray
x.reshape((1, 3))
In[40]:
x[np.newaxis, :]
Out[40]: array([[1, 2, 3]])
In[41]: x.reshape((3,1))
Out[41]: ([[1]
[2]
[3]])
108
Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily
accomplished through the routines np.concatenate, np.vstack,
and np.hstack.
In[43]: x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
Out[43]: array([1, 2, 3, 3, 2, 1])
np.vstack([x, grid])
Out[48]: array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])
In[49] y=np.array([[99],
[99]])
np.hstack([grid,y])
Out[49]: array([[9,8,7,99],
[6,5,4,99]])
110
Splitting of arrays
111
[[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
In[53]: left, right = np.hsplit(grid, [2])
print(left)
print(right)
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
112
[14 15]]
sum(L)
Out[2]: 55.61209116604941
the sum function and the np.sum function are not identical.
Minimum and Maximum
Similarly, Python has built-in min and max functions,
used tofind the minimum value and maximum value of any given
array:
In[5]: min(big_array), max(big_array)
Out[5]: (1.1717128136634614e-06, 0.9999976784968716)
Multidimensional aggregates
One common type of aggregation operation is an aggregate
along arow or column
115
In[18]: plt.hist(heights)
plt.title('Height Distribution of US
Presidents')plt.xlabel('height (cm)')
plt.ylabel('number');
118
Figure 2.4
In[3]: a + 5
Out[3]: array([5, 6, 7])
print(a)
print(b)
[0 1 2] #print
a[[0] # print b
[1]
[2]]
In[7]: a + b
Out[7]: array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
Rules of Broadcasting
Broadcasting example 1
Let’ s look at adding a two-dimensional array to a
one-dimensionalarray:
In[8]: M = np.ones((2,
3))a = np.arange(3)
M.shape = (2, 3)
a.shape = (3,)
122
Broadcasting example 2
example where both arrays need to be
broadcastIn[10]: a =
np.arange(3).reshape((3, 1))
b = np.arange(3)
a.shape = (3, 1)
b.shape = (3,)
out: [[0]
[1]
[2]]
[0 1 2]
Rule 1 says we must pad the shape of b with
ones:a.shape -> (3, 1)
b.shape -> (1, 3)
[1, 2, 3],
[2, 3, 4]])
Broadcasting example 3
an example in which the two arrays are not
compatibleIn[12]: M = np.ones((3, 2))
a = np.arange(3)
[0 1 2] # a output
M.shape = (3, 2)
a.shape = (3,)
rule 1 tells us that we must pad the shape of a with
ones:M.shape -> (3, 2)
a.shape -> (1, 3)
rule 2, the first dimension of a is stretched to match that
of MM.shape -> (3, 2)# since its 2 here we cannotstrech
a.shape -> (3, 3)
rule 3— the final shapes do not match, so these two arrays are
incompatible
In[13]: M + a
Broadcasting in Practice
Centring an arrayufuncs allow a NumPy user to remove the need to
explicitly write slow Python loops. Broadcasting extends this ability.
example is centering an array of data
Out
([[0.6231582 , 0.62830284, 0.48405648],
[0.4893788 , 0.96598238, 0.99261057],
[0.18596872, 0.26149718, 0.41570724],
[0.74732252, 0.96122555, 0.03700708],
[0.71465724, 0.92325637, 0.62472884],
[0.53135009, 0.20956952, 0.78746706],
[0.67569877, 0.45174937, 0.53474695],
[0.91180302, 0.61523213, 0.18012776],
[0.75023639, 0.46940932, 0.11044872],
[0.86844985, 0.07136273, 0.00521037]])
we can center the X array by subtracting the mean value from each
element in array.(Ex: )
Out [z]
# z -array,
origin - [0,0] index of z should be at the lower-left corner of the
plot,extent = left, right, bottom, and top boundaries of the image,
cmap - color map.
129
rainfall=pd.read_csv('data/Seattle2014.csv')['PRCP'].va
lues# reads PRCP column values
inches = rainfall / 254
inches.shap
e Out[1]:
(365,)
130
plt.hist(inches, 2);
Comparison Operators as
ufuncsIn[4]: x = np.array([1, 2, 3,
4, 5]) In[5]: x <3
Out[5]: array([ True, True, False, False, False], dtype=bool)
132
In[13]: x <6
Out[13]: array([[ True, True, True, True],
[False, False, True, True],
[True, True, False, False]], dtype=bool)
Counting entries
In[15]:
np.count_nonzero(x <6)
Out[15]: 8
133
(Or)
Counts the number of values less than 6 in each row of the matrix.
Here all the elements in the first and third rows are less than 8,
whilethis is not the case for the second row.
BOOLEAN OPERATORS
We have already seen
All days with rain less than four inches,
All days with rain greater than two inches
All days with rain less than four inches and greater than
oneinch?
134
Or
In[27]: x <5
Out[27]: array([[False, True, True, True],
[False, False, True, False],
[True, True, False, False]], dtype=bool)
In[30]: bool(42),
bool(0)Out[30]: (True,
False) In[31]: bool(42
and 0) Out[31]: False
In[32]: bool(42 or
0)Out[32]: True
In[33]: bin(42)
Out[33]: '0b101010' #binary representation
In[34]: bin(59)
Out[34]: '0b111011' #binary representation
In[38]: A or B
5. Fancy Indexing
We’ ll look at another style of array indexing, known as
.
In[1]:
import numpyas np
rand = np.random.RandomState(42) # 42- type of random
numbergenerator
x = rand.randint(100, size=10)
print(x)
[51 92 1471 60 20 82 86 74 74]
Suppose we want to access three different elements. We
could do itlike this:
[4, 5]])
x[ind]
Out[4]: array([[71, 86],
[60, 20]])
In[5]: X = np.arange(12).reshape((3,
4))X
Out[5]: array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
The first index refers to the row, and the second to the column:
In[6]: row = np.array([0, 1, 2])
col = np.array([2, 1,
3])X[row, col]
Out[6]: array([ 2, 5, 11])
The first value in the result is X[0, 2], the second is X[1, 1],
and the third is X[2, 3]. The pairing of indices in fancy indexing
follows all the broadcasting rules.
lumn vecto
onal resu
w[:, n
newa
x([[ 2, 1,
[6
[
Combined Indexing
Fancy indexing can be combined with the other indexing
schemeswe’ ve seen:
In[9]: print(X)
0123
0 [[ 0 1 2 3]
142
1 [ 4 5 6 7]
2 [ 8910 11]]
143
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8910 11]]
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Out[21]: array([ 6., 0., 1., 1., 1., 0., 0., 0., 0., 0.]) # the value is 1
since the values are overwritten.
[ 0. 0. 1. 2. 3. 0. 0. 0. 0. 0.]
4
In[5]: data['name'] =
namedata['age'] = age
data['weight'] = weight
print(data)
Out
[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)('Doug', 19,
61.5)]
In[6]:
data['name']
Out[6]: array(['Alice', 'Bob', 'Cathy', 'Doug'],dtype='<U10')
147
In[7]:
data[0]
Out[7]: ('Alice', 25, 55.0)
In[8]:
data[-1]['name']
Out[8]: 'Doug'
In[9]:
data[data['age'] <30]['name']
Out[9]: array(['Alice', 'Doug'],dtype='<U10')
In[15]: data['age']
Out[15]: array([25, 45, 37, 19], dtype=int32)
In[16]: data_rec=
data.view(np.recarray)data_rec.age
Out[16]: array([25, 45, 37, 19], dtype=int32)
149
Series as dictionary
Like a dictionary, the Series object provides a mapping from a
collection of keys to a collection of values:
In[1]: import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
#valuesindex=['a', 'b', 'c', 'd']) # keys
data
Out[1]: a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
In[2]: data['b'] # returns the value of b in
OUT[1]Out[2]: 0.5
Out[3]: True
Out[6]: a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
Out[6]: a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
In[7]:
data['a':'c']
Out[7]: a 0.25
b 0.50
151
c 0.75
dtype: float64
152
In[9]:
data[(data >0.3) &(data <0.8)]
Out[9]: b 0.50
c 0.75
dtype: float64
In[10]:
data[['a', 'e']]
Out[10]: a
0.25
e 1.25
dtype: float64
slicing may be the source of the most confusion
when you are slicing with an explicit index (i.e., data['a':'c']),
the final index is in the slice, while when you’ re
slicing with an implicit index (i.e., data[0:2]), the final index
is e from the slice.
data['a':'c']
Out[7]: a 0.25
b 0.50
c 0.75
data[0:2]
a 0.25
b 0.50
Implicit Explicit
Values0 1a
1 3b
154
2 5c
dtype: object
In[12]:
data[1]
Out[12]: 'a'
In[13]:
data[1:3]
Out[13]: 3 b
5c
dtype: object
In[15]:
data.loc[1:3]
Out[15]: 1 a
3b
dtype: object
Implicit Explicit
Values0 1a
1 3b
2 5c
dtype: object
In[16]: data.iloc[1]
Out[16]: 'b'
In[17]:
data.iloc[1:3]
Out[17]: 3 b
5c
dtype: object
A third indexing attribute, ix, is a hybrid of the two, and for Series
objects is equivalent to standard []-based indexing.
DataFrame as a dictionary
In[18]: area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})#area variable
pop= pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})#population
variabledata = pd.DataFrame({'area':area, 'pop':pop})
157
data
158
In[19]: data['area']
Out[19]: California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
In[21]: data.areais
data['area']Out[21]: True
In[22]: data.popis
data['pop']Out[22]: False
Matrix transpose
161
In[25]: data.T
Out[25]:
California Florida Illinois New York Texas
area 4.239670e+05 1.703120e+05 1.499950e+05 1.412970e+05 6.956620e+05
pop 3.833252e+07 1.955286e+07 1.288214e+07 1.965113e+07 2.644819e+07
density 9.041393e+01 1.148061e+02 8.588376e+01 1.390767e+02 3.801874e+01
162
OUT [28]
area pop
California 423967 38332521
Florida 170312 19552860
163
Example
dataframeOut[23]:
area pop density
California 423967 38332521 90.413926
Florida 170312 19552860 114.80612
1
166
3 4
dtype: int64
In[3]: df= pd.DataFrame(rng.randint(0, 10, (3, 4)), # 0 to 10
values 3rows and 4 columns with names A B C D
columns=['A', 'B', 'C', 'D'])
df
Out[3]: A B C D
06926
17437
27254
dtype: float64
170
Out[11]: AB
0 111
172
151
Out[12]: B A C
0 4 09
1 580
2 9 2 6
In[13]: A + B
Out[13]: A B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaNNaNNaN
A.add(B, fill_value=fill)
A B C B A C
0 1114.5 + 0 4 09
1 5 14.5 1 5 8 0
2 4.5 4.5 4.5 2 9 2 6
Out: dtype =
175
objectdtype = int
%timeitnp.arange(1E6, dtype=dtype).sum()
print()
176
dtype = object
10 loops, best of 3: 78.2 ms per
loopdtype = int
100 loops, best of 3: 3.06 ms per loop
NaN:In[6]: 1 + np.nan
Out[6]: nan
In[7]: 0 *
np.nanOut[7]:
nan
In[8]: vals2.sum(), vals2.min(),
vals2.max()Out[8]: (nan, nan, nan)
NumPy does provide some special aggregations that will ignore
thesemissing values
In[9]: np.nansum(vals2), np.nanmin(vals2),
177
structures.
isnull()Generate a Boolean mask indicating missing values
180
notnull()Opposite of isnull()
dropna()Return a filtered version of the
data
fillna()Return a copy of the data with missing values filled or imputed
2],[2, 3, 5],
[np.nan, 4, 6]])
df
Out[17]: 0 1 2
0 1.0 NaN2
1 2.0 3.0 5
2 NaN 4.0 6
182
Out[23]:
184
a 1.0
b
NaNc
2.0
d
NaNe
3.0
dtype: float64
Out[24]:
a 1.0 # filled with 0
valuesb 0.0
c 2.0
d 0.0
e 3.0
dtype: float64
Out[25]:
a 1.0 # fills previous value 1 in NaN
valueb 1.0
c 2.0
d 2.0
e 3.0
dtype: float64
Out[26]:
a 1.0 # fills below value 2 in NaN
valueb 2.0
186
c 2.0
d 3.0
e 3.0
dtype: float64
For DataFrames, the options are similar, but we can also specify
anaxis along which the fills take place:
In[27]: df
Out[27]:
0 1 23
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In[28]: df.fillna(method='ffill', axis=1) #column wise fill from
prevOut[28]:
0 1 2 3
0 1.0 2.0 2.0
1.0
1 3.0 5.0 5.0
2.0
2 NaN 4.0 6.0 6.0
Notice that if a previous value is not available during a forward
fill,the NA value remains.
Hierarchical Indexing
While Pandas does provide Panel and Panel4D objects that
natively handle three-dimensional and four-dimensional data, a
far more common pattern in practice is to make use of
hierarchical indexing (also known as multi-indexing) to
incorporate multiple index levels within a single index.
creation of MultiIndex objects
In[1]: import pandas as pd
import numpyas np
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations,
index=index)pop
Out[2]: (California, 2000) 33871648
(California, 2010) 37253956
(New York, 2000) 18976457
(New York, 2010) 19378102
(Texas, 2000) 20851820
(Texas, 2010) 25145561
dtype: int64
you can straightforwardly index or slice the series based on this
multiple index:
In[3]: pop[('California', 2010):('Texas', 2000)] # remove item at a
given index
Out[3]: (California, 2010)
37253956
(New York, 2000) 18976457
(New York, 2010) 19378102
(Texas, 2000) 20851820
dtype: int64
if you need to select all values from 2010, you’ ll need to do
somemessy.
In[5]: index =
pd.MultiIndex.from_tuples(index)Index
190
In[8]: pop_df=
pop.unstack()pop_df
191
Out[8]:
2000 2010#
Difference
California 33871648 37253956
New York 18976457 19378102
Texas 20851820 25145561
stack() method provides the opposite operation:
In[9]:
pop_df.stack()
Out[9]:
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
so why do we need multiple indexing?
we were able to use multi-indexing to represent two-
dimensional data within a one-dimensional Series, we can also
use it to represent data of three or more dimensions in a Series
or DataFrame.
Now we add another column with population
under 18.In[10]:pop_df= pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
Series
In[21]: pop
Out[21]: state year
195
California 33871648
2000
2010 37253956
New York 18976457
2000
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
In[22]: pop['California', 2000] # prints California yr 2000
popul.Out[22]: 33871648
In[23]: pop['California']
Out[23]: year # prints available data on
california2000 33871648
2010 37253956
dtype: int64
In[6]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
Out[6]: 1 A
2B
3C
4D
5E
6F
dtype: object
197
Duplicate indices
Difference between np.concatenate and pd.concat is that
Pandas concatenation , even if the result will
have duplicate indices
In[9]: x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index= x.index
print(x); print(y); print(pd.concat([x, y]))
print("ValueError:", e)
ValueError: Indexes have overlapping values: [0, 1]
Many-to-many joins
If the key column in both the left and right array contains duplicates,
thenthe result is a many-to-many merge
203
Planets Data
206
Pivot Tables
Titanic
example
UNIT V
at
https://jakevdp.github.io/PythonDataScienceHandboo
k/https://matplotlib.org/
figuresIn[2]: plt.style.use('classic')
import matplotlib.pyplotas
pltimport numpyas np
x = np.linspace(0, 10, 100) #numpy.linspace( , , )
Return evenly spaced numbers over a specified interval.
Returns evenly spaced samples, calculated over the interval [ ,
].plt.plot(x, np.sin(x))
plt.plot(x,
np.cos(x))
plt.show()
IPython is built to work well with Matplotlib if you specify Matplotlib mode. To enable
thismode, you can use the %matplotlib magic command after starting ipython:
In [1]: %matplotlib #enables the drawing of matplotlib figures in the
IPythonenvironment
Using matplotlib backend: TkAgg
In [2]: import matplotlib.pyplotas plt
The IPython notebook is a browser-based interactive data analysis tool that can combine
narrative, code, graphics, HTML elements, and much more into a single
executabledocument
within thenotebook
• %matplotlib inline will lead to images of your plot
embedded in theNotebook
x = np.linspace(0, 10,
100)fig = plt.figure()
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--');
In[5]:
fig.savefig('my_figure.png')
IN [6] ls -lh my_figure.png # shows figure properties, ls list file, -lh – human
readableformat,
others,no. of link
owner,
file,size,
last modified.
In[6]:
plt.plot(x, np.sin(x - 0), color='blue') plt.
plot(x, np.sin(x - 1), color='g')
plt.plot(x, np.sin(x - 2), color='0.75')
In[7]: plt.plot(x, x + 0,
linestyle='solid')plt.plot(x, x + 1,
linestyle='dashed') plt.plot(x, x + 2,
linestyle='dashdot') plt.plot(x, x + 3,
linestyle='dotted');
plt.plot(x, x + 4, linestyle='-')
plt.plot(x, x + 5, linestyle='--')
plt.plot(x, x + 6, linestyle='-.') plt.
plot(x, x + 7, linestyle=':');
216
In[10]: plt.plot(x,
np.sin(x))plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
217
plt.axis('equal');
Labeling Plots
In[14]: plt.plot(x,
np.sin(x))plt.title("A Sine
Curve") plt.xlabel("x")
plt.ylabel("sin(x)");
In[15]:
plt.plot(x, np.sin(x), '-g', label='sin(x)') # green label
sinplt.plot(x, np.cos(x), ':b', label='cos(x)') # blue label
cos plt.axis('equal')
plt.legend();
219
In[3]: rng= np.random.RandomState(0) # seed value, produces same random numbers again
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
plt.plot(rng.rand(5), rng.rand(5),
marker,label="marker='{0}'".format(marker))plt.legend(numpoints=1)
plt.xlim(0, 1.8);
220
In[6]: plt.scatter(x, y,
marker='o');
Visualizing Errors
Basic Errorbars
interval.dy= 0.8
y = np.sin(x) + dy* np.random.randn(50)
224
Continuous Errors
In[4]: from sklearn.gaussian_processimport GaussianProcess
gp.fit(xdata[:, np.newaxis],
ydata) xfit= np.linspace(0, 10,
1000)
yfit, MSE = gp.predict(xfit[:, np.newaxis],
eval_MSE=True) dyfit= 2 * np.sqrt(MSE)
In[5]:
plt.plot(xdata, ydata, 'or')
plt.plot(xfit, yfit, '-', color='gray')
plt.fill_between(xfit, yfit- dyfit, yfit+ dyfit,
color='gray', alpha=0.2)
plt.xlim(0, 10);
plt.imshow(Z.reshape(Xgrid.shape),origin='lower', aspect='auto',extent=[-3.5,
3.5, -6,6],cmap='Blues') # reshapes 1D Z to 2D #aspect ratio # x and y axiss
# color
map
cb= plt.colorbar()
cb.set_label("density")
231
for area in [100, 300, 500]: #For area in values of 100, 300, 500
plt.scatter([], [], c='k', alpha=0.3, s=area, label=str(area) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, title='City
Area')plt.title('California Cities: Area and Population');
Multiple Legends
In[10]: fig, ax=
plt.subplots()lines =
[]
styles = ['-', '--', '-.', ':']
x = np.linspace(0, 10, 1000)
for iin range(4):
lines += ax.plot(x, np.sin(x - i* np.pi/ 2),styles[i],
color='black')ax.axis('equal')
Customizing Colorbars
In[1]: import matplotlib.pyplotas plt
plt.style.use('classic')
In[2]: %matplotlib inline
import numpyas np
In[3]: x = np.linspace(0, 10, 1000)
I = np.sin(x) * np.cos(x[:,
np.newaxis])plt.imshow(I)
plt.colorbar();
Customizing Colorbars
In[4]: plt.imshow(I, cmap='gray');
These usually contain two distinct colors, which show positive and negative
deviationsfrom a mean (e.g., RdBuor PuOr).
frommatplotlib.colorsimportLinearSegmentedColormap
defgrayscale_cmap(cmap):
cmap=plt.cm.get_cmap(cmap)
colors=cmap(np.arange(cmap.N))
defview_colormap(cmap):
cmap=plt.cm.get_cmap(cmap)
colors=cmap(np.arange(cmap.N))
cmap=grayscale_cmap(cmap)
grayscale =cmap(np.arange(cmap.N))
In[6]: view_colormap('jet')
In[7]: view_colormap('viridis')
In[8]: view_colormap('cubehelix')
237
238
In[9]: view_colormap('RdBu')
Multiple Subplots
In[7]:
for iin range(2):
for j in range(3):
ax[i, j].text(0.5, 0.5, str((i, j)),
fontsize=18, ha='center')
fig
241
In[10]:
mean = [0, 0]
cov= [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 3000).T
fig.transFigure
Transform associated with the figure (in units of figure
dimensions)In[5]: fig, ax= plt.subplots(facecolor='lightgray')
ax.axis([0, 10, 0, 10])
In[8]:
245
births_by_date.plot(ax=ax)
Customizing Ticks
Major and Minor Ticks
In[1]: %matplotlib inline
import matplotlib.pyplotas plt
plt.style.use('seaborn-whitegri
d')import numpyas np
In[2]: ax= plt.axes(xscale='log', yscale='log')
ax.grid(True)
ax.legend(frameon=False) ax.
axis('equal')
ax.set_xlim(0, 3 * np.pi);
249
plt.grid(color='w',
linestyle='solid')
Since, this is hard to do all the modifications each time its best to change the defaults
Changing the Defaults: rcParams
Each time matplotlib loads it defines a runtime configuration (rc) containing default
style foreach plot. plt.rc.
In[4]: IPython_default= plt.rcParams.copy()
In[5]: from matplotlib import cycler
colors= cycler('color',
['#EE6666', '#3388BB', '#9988DD',
'#EECC55', '#88BB44', '#FFBBBB'])
plt.rc('axes', facecolor='#E6E6E6',
edgecolor='none',axisbelow=True, grid=True,
prop_cycle=colors) plt.rc('grid', color='w',
linestyle='solid')
plt.rc('xtick', direction='out',
color='gray')plt.rc('ytick',
direction='out', color='gray')
plt.rc('patch', edgecolor='#E6E6E6') plt.
rc('lines', linewidth=2)
In[6]: plt.hist(x);
Stylesheets
In[8]: plt.style.available[:5] #names of the first five available Matplotlib styles
Out[8]: ['fivethirtyeight',
'seaborn-pastel',
'seaborn-whitegri
d','ggplot',
'grayscale']
The basic way to switch to a stylesheet is to
call:plt.style.use('stylename')
this will change the style for the rest of the session
with
plt.style.context('stylename'):
make_a_plot()
Let’ s create a function that will make two basic types of
plot:In[9]: def hist_and_lines():
np.random.seed(0)
fig, ax= plt.subplots(1, 2, figsize=(11,
4))ax[0].hist(np.random.randn(1000))
for iin range(3):
ax[1].plot(np.random.rand(10))
ax[1].legend(['a', 'b', 'c'], loc='lower left')
Default style
In[10]:
plt.rcParams.update(IPython_defa
ult);Now let’ s see how it looks (Figure
4-85): In[11]: hist_and_lines()
252
FiveThirtyEight style
In[12]: with
plt.style.context('fivethirtyeight'):
hist_and_lines()
zdata= 15 * np.random.random(100)
xdata= np.sin(zdata) + 0.1 *
np.random.randn(100)ydata= np.cos(zdata) +
0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens'); #scatter points
• Map
features
drawgreatcircle
()
Draw a great circle between two points
258
drawparallels()
Draw lines of constant
latitudedrawmeridians()
Draw lines of constant
longitudedrawmapscale()
Draw a linear scale on the map
• Whole-globe
imagesbluemarble()
Project NASA’ s blue marble image onto the
mapshadedrelief()
Project a shaded relief image onto the
mapetopo()
Draw an etopo relief image onto the
mapwarpimage()
Project a user-provided image onto the map
lat= cities['latd'].values
lon=
cities['longd'].values
population =
cities['population_total'].valuesarea =
cities['area_total_km2'].values
259
In[11]:
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h', #map projection, resolution
highlat_0=37.5, lon_0=-119,
width=1E6, height=1.2E6)
m.shadedrelief() #draw shaded satellite image
260
m.drawcoastlines(color='gra
y')
m.drawcountries(color='gra
y')
m.drawstates(color='gray')
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7) #Set the color limits of the current
image.
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5,
s=a,label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1,
frameon=False,labelspacing=1,
loc='lower left');
import numpyas np
import pandas as pd
In[2]:
rng=
np.random.RandomState(0)x
= np.linspace(0, 10, 500)
y = np.cumsum(rng.randn(500, 6), 0)#cumulative sum of elements (partial
sum ofsequence)
In[3]:
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');