FDS Notes
FDS Notes
UNIT – I
DataScience:Benefitsanduses–facetsofdata-DataScienceProcess:Overview–Definingresearchgoals–
INTRODUCTION Retrieving
data – Data preparation - Exploratory Data analysis – Build the model– presenting findings and building
applications - Data Mining - Data Warehousing – Basic Statistical descriptions of Data
Data
In computing,data is information that has been translated into a form that is efficient for movement or processing
Data Science
Datascience is an evolutionary extension of statistic scapable of dealing with the massive amounts of data produced
today. It adds methods from computer science to the repertoire of statistics.
Benefitsandusesofdata science
Datascience andbigdata are used almosteverywhereinbothcommercialand noncommercialSettings
Commercial companies in almost every industry use data science and big data to gain insights into
their customers, processes, staff, completion, and products.
Many companies use data science to offer customers a better user experience, as well as to cross-sell,
up-sell, and personalize their offerings.
Governmentalorganizationsarealsoawareofdata’svalue.Manygovernmental organizationsnotonly rely
on internal data scientists to discover valuable information, but also share their data with the public.
Nongovernmentalorganizations (NGOs)useit toraise moneyanddefend their causes.
Universities use data science in their research but also to enhance the study experience of their
students. The rise of massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional classes.
Facetsofdata
Indatascienceandbigdata you’llcomeacrossmanydifferenttypesofdata,andeach ofthemtends torequire different tools
and techniques. The main categories of data are these:
Structured
Unstructured
Naturallanguage
Machine-generated
Graph-based
Audio,video,and images
Streaming
Let’sexplorealltheseinterestingdata types.
Structureddata
Structured data is data that depends on a data model and resides in a fixed field within a record. As
such, it’s often easy to store structured data in tables within databases or Excel files
SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.
1
Unstructureddata
Unstructureddataisdatathatisn’teasytofitintoadatamodelbecausethecontentiscontext-specificor varying. One example
of unstructured data is your regular email
Naturallanguage
Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.
2
The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
Evenstate-of-the-arttechniquesaren’table todecipherthemeaningofeverypieceof text.
Machine-generateddata
Machine-generateddataisinformationthat’sautomaticallycreatedbyacomputer,process,application, or
other machine without human intervention.
Machine-generateddata is becomingamajordata resourceandwill continueto do so.
Theanalysisofmachinedatareliesonhighlyscalabletools,duetoitshighvolumeandspeed. Examples of
machine data are web server logs, call detail records, network event logs, and telemetry.
Graph-basedornetworkdata
“Graphdata” canbeaconfusingterm becauseanydata canbeshownin agraph.
Graphornetworkdatais,in short, data thatfocuses on therelationship or adjacencyof objects.
Thegraphstructuresuse nodes,edges,andpropertiestorepresentandstoregraphicaldata.
Graph-baseddataisanaturalwaytorepresentsocialnetworks,anditsstructureallowsyoutocalculate specific
metrics such as the influence of a person and the shortest path between two people.
Audio,image,andvideo
Audio,image,andvideo aredatatypesthatposespecific challengestoadata scientist.
Tasks that are trivial for humans,suchas recognizingobjects in pictures, turn out to bechallenging for
computers.
MLBAM(MajorLeagueBaseballAdvancedMedia)announcedin2014thatthey’llincreasevideo capture to
approximately 7 TB per game for the purpose of live, in-game analytics.
Recently a company calledDeepMindsucceededatcreating analgorithmthat’scapable of learning how to
play video games.
This algorithm takes the video screen as input and learns to interpret everythingvia acomplex
process of deep learning.
3
Streamingdata
Thedataflowsintothesystemwhenaneventhappensinsteadofbeingloadedintoadatastoreina batch.
Examplesarethe“What’strending” onTwitter, livesportingor musicevents, andthe stock market.
DataScienceProcess
Overviewof thedatascience process
Thetypicaldatascienceprocessconsistsofsix stepsthrough whichyou’lliterate, asshowninfigure
1. The first step of this process is setting a research goal. The main purpose here is making sure all the
stakeholders understand the what, how, and whyof the project. In everyserious project this will result
in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step includes
finding suitable data and getting access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from a raw
form into datathat’s directlyusablein your models. To achievethis, you’ll detect and correct different
kinds of errors in the data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and modeling.
4
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data.
You’ll look for patterns, correlations, and deviations based on visual and descriptive techniques. The
insights you gain from this phase will enable you to start modeling.
5. Finally,wegettomodelbuilding(oftenreferredtoas“datamodeling”throughoutthisbook).Itisnow that you
attempt to gain the insights or make the predictions stated in your project charter. Now is the time to
bring out the heavy guns, but remember research has taught us that often (but not always) a
combination of simple models tends to outperform one complicated model. If you’ve done this phase
right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis, if
needed. One goal of a project is to change a process and/or make better decisions. You may still need
to convince the business that your findings will indeed change the business process as expected.
Thisiswhereyoucan shinein yourinfluencerrole.Theimportanceofthisstepismoreapparentin projects on a
strategic and tactical level. Certain projects require you to perform the business process over and over
again, so automating the project will save time.
Definingresearchgoals
A project starts by understanding the what, the why, and the how of your project. The outcome should be a
clearresearch goal, a good understandingofthecontext, well-defined deliverables,and aplanofaction with a
timetable. This information is then best placed in a project charter.
Createaprojectcharter
Aprojectcharter requiresteamwork,andyourinputcoversatleastthe following:
Aclearresearch goal
Theprojectmissionandcontext
Howyou’regoingtoperform youranalysis
Whatresourcesyou expecttouse
Proofthatit’sanachievableproject, orproofof concepts
Deliverablesandameasureofsuccess
A timeline
Retrievingdata
The next step in data science is to retrieve the required data. Sometimes you need to go into the
fieldand design a data collection process yourself, but most of the time you won’t be involved in this
step.
Manycompanies will havealreadycollected and stored thedatafor you, and what theydon’thave can
often be bought from third parties.
Moreandmoreorganizationsaremakingevenhigh-qualitydatafreelyavailableforpublicandcommercial
use.
Data can be stored in manyforms, ranging from simple text files to tables in a database. The objective
now is acquiring all the data you need.
Startwithdatastoredwithinthecompany(Internaldata)
5
Most companies have a program for maintaining key data, so much of the cleaning work may already
be done. This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals.
Datawarehousesanddatamartsarehometopreprocesseddata,datalakescontaindatainitsnaturalor raw
format.
Finding data even within your own company can sometimes be a challenge. As companies grow, their
data becomes scattered around manyplaces. the data maybe dispersed as people change positions and
leave the company.
Getting access to data is another difficult task. Organizations understand the value and sensitivity of
data and often have policies in place so everyone has access to what they need and nothing more.
These policies translate into physical and digital barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer data in most countries.
ExternalData
Ifdataisn’tavailableinsideyourorganization,lookoutsideyourorganizations.Companiesprovide data
so that you, in turn, can enrich their services and ecosystem. Such is the case with Twitter,
LinkedIn, and Facebook.
Moreandmore governments andorganizationssharetheirdatafor freewiththeworld.
Alistofopendataprovidersthatshouldgetyou started.
DataPreparation(Cleansing,Integrating,TransformingData)
Your model needs the data in a specific format, so data transformation will always come into play. It’s a good
habit to correct data errors as early on in the process as possible. However, this isn’t always possible in a
realistic setting, so you’ll need to take corrective actions in your program.
Cleansingdata
Datacleansingisasubprocessofthedatascienceprocessthatfocusesonremovingerrorsinyourdataso your data
becomes a true and consistent representation of the processes it originates from.
Thefirst typeis the interpretation error, suchas when you takethevalue in your data for granted, like
saying that a person’s age is greater than 300 years.
Thesecondtypeoferrorpointstoinconsistenciesbetweendatasourcesoragainstyourcompany’s
standardized values.
An example of this class of errors is putting “Female” in one table and “F” in another when theyrepresent
the same thing: that the person is female.
Overviewofcommonerrors
6
Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify data errors;
diagnostic plots can be especially insightful. For example, in figurewe use a measure to identify data points
that seem out of place. We do a regression to get acquainted with the data and detect the influence of
individual observations on the regression line.
DataEntryErrors
Data collection and data entry are error-prone processes. They often require human intervention, and
introduce an error into the chain.
7
Data collected by machines or computers isn’t free from errors. Errors can arise from human
sloppiness, whereas others are due to machine or hardware failure.
Detecting data errors when the variables you study don’t have many classes can be done by tabulating
the data with counts.
When you have a variable that can take only two values: “Good” and “Bad”, you can create a
frequency table and see if those are truly the only two values present. In tablethe values “Godo” and
“Bade” point out something went wrong in at least 16 cases.
Mosterrorsofthistypeareeasytofixwithsimpleassignmentstatementsandif-thenelse rules:
ifx==“Godo”:
x=“Good”
ifx==“Bade”:
x =“Bad”
Redundant Whitespace
Whitespacestendtobehardtodetectbutcause errorslikeotherredundantcharacterswould.
The whitespace cause the miss match in the string such as “FR ” – “FR”, dropping the observations
that couldn’t be matched.
If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and trailing
whitespaces. For instance, in Python you can use the strip() function to remove leading and trailing
spaces.
FixingCapitalLetterMismatches
Capitallettermismatchesarecommon.Mostprogramminglanguagesmakeadistinctionbetween“Brazil” and
“brazil”.
Inthiscaseyoucansolvetheproblembyapplyingafunctionthatreturnsbothstringsinlowercase,suchas
.lower()inPython.“Brazil”.lower() ==“brazil”.lower()shouldresultin true.
ImpossibleValuesandSanityChecks
Hereyoucheckthevalue againstphysicallyortheoreticallyimpossiblevaluessuchaspeopletallerthan3 meters or
someone with an age of 299 years. Sanity checks can be directly expressed with rules:
check =0<= age <=120
Outliers
An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows adifferent logicor generativeprocess than the otherobservations. The easiest wayto
find outliers is to use a plot or a table with the minimum and maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possibleoutliers on the upper side
when a normal distribution is expected.
8
DealingwithMissingValues
Missing values aren’t necessarily wrong, but you still need to handle them separately; certain modeling
techniques can’t handle missing values. They might be an indicator that something went wrong in your data
collection or that an error happened in the ETL process. Common techniques data scientists use are listed in
table
Integratingdata
Your data comes from several different places, and in this substep we focus on integrating these different
sources. Data varies in size, type, and structure, ranging from databases and Excel files to text documents.
TheDifferentWaysof CombiningData
Youcanperformtwooperationsto combineinformationfromdifferentdatasets.
Joining
Appendingorstacking
JoiningTables
Joining tables allows you to combine the information of one observation found in onetable with
theinformation that you find in another table. The focus is on enriching a single observation.
Let’s saythat the first table contains information about the purchases of a customer and the other table
contains information about the region where your customer lives.
Joiningthetablesallowsyoutocombinetheinformationsothatyoucanuseitforyourmodel,as shown in
figure.
Figure.Joiningtwotablesontheitemandregionkey 9
To join tables, you use variables that represent the same object in both tables, such as a date, a country name,
or a Social Security number. These common fields are known as keys. When these keys also uniquely define
the records in the table they are called primary keys.
Thenumberof resultingrowsin theoutputtabledependson theexactjointypethatyou use
Appending Tables
Appendingor stackingtables is effectivelyaddingobservations from one table toanother table.
One table contains the observations from the month January and the second table containsobservations
from the month February. The result of appending these tables is a larger one with the observations
from January as well as February.
Figure. Appending data from tables is a common operation but requires an equal structure in the tables begin
appended,
Transformingdata
Certain models require their data to be in a certain shape. Transforming your data so it takes a suitable form
for data modeling.
Relationships between an input variable and an output variable aren’t always linear. Take, for instance, a
relationship of the form y = aebx. Taking the log of the independent variables simplifies the estimationproblem
dramatically. Transformingtheinputvariables greatlysimplifiestheestimationproblem.Other times you might
want to combine two variables into a new variable.
10
ReducingtheNumberofVariables
Having too many variables in your model makes the model difficult to handle, and certain techniques
don’t perform well when you overload them with too many input variables. For instance, all the
techniques based on a Euclidean distance perform well only up to 10 variables.
Data scientists use special methods to reduce the number of variables but retain the maximum amount
of data.
11
Figure shows how reducing the number of variables makes it easier to understand the key values. It alsoshows
how two variables account for 50.6% of the variation within the data set (component1 = 27.8% + component2
= 22.8%). These variables, called “component1” and “component2,” are both combinations ofthe original
variables.They’re the principal components of the underlying data structure
TurningVariablesintoDummies
Dummy variables can onlytaketwo values: true(1)or false(0). They’reused to indicate the absence of a
categorical effect that may explain the observation.
Inthiscase you’llmake separate columnsforthe classesstoredinonevariableandindicateitwith 1if the class
is present and 0 otherwise.
An example is turning one column named Weekdays into the columns Monday through Sunday. You
use an indicator to show if the observation was on a Monday; you put 1 on Monday and 0 elsewhere.
Turningvariablesintodummiesisatechniquethat’susedinmodelingandispopularwith,butnot exclusive to,
economists.
Figure. Turning variables into dummies is a data transformation that breaks a variable that has multiple
classes into multiple variables, each having only two possible values: 0 or 1
Exploratorydataanalysis
During exploratory data analysis you take a deep dive into the data (see figure below). Information
becomes much easier to grasp when shown in a picture, therefore you mainly use graphical techniques to
gain an understanding of your data and the interactions between variables.
The goal isn’t to cleanse the data, but it’s common that you’ll still discover anomalies you missed before,
forcing you to take a step back and fix them.
12
13
The visualization techniques you use in this phase range from simple line graphs or histograms, as
shown in below figure , to more complex diagrams such as Sankey and network graphs.
Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight
into the dataOther times the graphs can be animated or made interactive to make it easier and, let’s
admit it, way more fun
The techniques we described in this phase are mainly visual, but in practice they’re certainly not limited to
visualization techniques. Tabulation, clustering, and other modeling techniques can also be a part of
exploratory analysis. Even building simple models can be a part of this step.
Buildthe models
With clean data in place and a good understanding of the content, you’re ready to build models with
the goal of making better predictions, classifying objects, or gaining an understanding of the system
that you’re modeling.
This phase is much more focused than the exploratory analysis step, because you know what you’re
looking for and what you want the outcome to be.
Building a model is an iterative process. The way you build your model depends on whether you go with
classic statistics or the somewhat more recent machine learning school, and the type of technique you want to
use. Either way, most models consist of the following main steps:
Selectionofamodelingtechniqueand variablestoenterinthemodel
Executionofthe model
Diagnosisandmodelcomparison
Model execution
Onceyou’vechosen amodelyou’llneedtoimplementitincode.
14
Most programming languages, such as Python, already have libraries such as StatsModels or Scikit-
learn. These packages use several of the most popular techniques.
Coding a model is a nontrivial task in most cases, so having these libraries available can speed up the
process. As you can see in the following code, it’s fairly easy to use linear regressionwith StatsModels
or Scikit-learn
Doing this yourself would require much more effort even for the simple techniques. The following
listing shows the execution of a linear prediction model.
Modeldiagnosticsandmodelcomparison
You’llbebuildingmultiplemodelsfromwhichyouthenchoosethebestonebasedonmultiple criteria.
Working with a holdout sample helps you pick the best-performing model.
Aholdout sampleis apartofthedata you leaveout ofthemodel building so it can beused to evaluate the
model afterward.
The principle here is simple: the model should work on unseen data. You use only a fraction of
yourdata to estimate the model and the other part, the holdout sample, is kept out of the equation.
Themodelisthen unleashedontheunseen dataanderrormeasuresarecalculatedtoevaluateit.
Multiple error measures are available, and in figure we show the general idea on comparing
models.The error measure used in the example is the mean square error.
Formulaformeansquareerror.
Mean square error is a simple measure: check for every prediction how far it was fromthe truth, square
thiserror, and add up the error of every prediction.
15
Abovefigurecomparestheperformanceoftwomodelstopredicttheordersizefromtheprice.Thefirst model is size =
3 * price and the second model is size = 10.
Toestimatethemodels,weuse800randomlychosenobservationsoutof1,000(or80%),without showing the
other 20% of data to the model.
Once the model is trained, we predict the values for the other 20% of the variables based on those for
which we already know the true value, and calculate the model error with an error measure.
Then we choose the model with the lowest error. In this example we chose model 1 because it has the
lowest total error.
Many models make strong assumptions, such as independence of the inputs, and you have to verify that these
assumptions are indeed met. This is called model diagnostics.
Presentingfindingsandbuildingapplications
Sometimes people get so excited about your work that you’ll need to repeat it over and over again
because they value the predictions of your models or the insights that you produced.
This doesn’t always mean that you have to redo all of your analysis all the time. Sometimes it’s
sufficient that you implement only the model scoring; other times you might build an application that
automatically updates reports, Excel spreadsheets, or PowerPoint presentations. The last stage of the
datascienceprocessiswhere yoursoftskillswillbemost useful,and yes,they’reextremelyimportant.
Datamining
Data mining is the process of discovering actionable information from large sets of data. Data mining uses
mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be
discovered by traditional data exploration because the relationships are too complex or because there is too
much data.
These patterns and trends can be collected and defined as a data mining model. Miningmodels can be applied
to specific scenarios, such as:
Forecasting:Estimatingsales,predictingserverloadsorserverdowntime
16
Risk and probability: Choosing the best customers for targeted mailings, determining the
probablebreak-even point for risk scenarios, assigning probabilities to diagnoses or other outcomes
Recommendations:Determiningwhichproductsarelikelytobesoldtogether,generating recommendations
Findingsequences:Analyzingcustomer selectionsina shoppingcart, predictingnext likelyevents
Grouping:Separatingcustomersoreventsintoclusterofrelateditems,analyzingandpredicting affinities
Building a mining model is part of a larger process that includes everything from asking questions about the
dataand creatingamodeltoanswerthosequestions,todeployingthemodelintoaworkingenvironment.This process
can be defined by using the following six basic steps:
1. DefiningtheProblem
2. PreparingData
3. ExploringData
4. BuildingModels
5. Exploringand Validating Models
6. DeployingandUpdating Models
The following diagram describes the relationships between each step in the process, and the technologies in
Microsoft SQL Server that you can use to complete each step.
The first step in the data mining process is to clearly define the problem, and consider ways that data can be
utilized to provide an answer to the problem.
17
What kind of data do you have and what kind of information is in each column? If there are multiple
tables,howarethetablesrelated? Do youneedtoperformanycleansing,aggregation,orprocessingto make
the data usable?
How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of
the business?
PreparingData
The second step in the data mining process is to consolidate and clean the data that was identified in
the Defining the Problem step.
Datacan bescattered across acompanyand stored in different formats, or maycontain inconsistencies
such as incorrect or missing entries.
Data cleaning is not just about removing bad data or interpolating missing values, but about finding
hidden correlations in the data, identifying sources of data that are the most accurate, and determining
which columns are the most appropriate for use in analysis
ExploringData
Explorationtechniquesincludecalculatingtheminimumandmaximumvalues,calculatingmean andstandard
deviations, and looking at the distribution of the data. For example, you might determine by reviewing the
maximum, minimum, and mean values that the data is not representative of your customers or business
processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis
for your expectations. Standard deviations and other distribution values can provide useful information about
the stability and accuracy of the results.
BuildingModels
Theminingstructureislinkedtothe sourceofdata,butdoesnot actuallycontainanydatauntil you processit. When
you process the mining structure, SQL Server Analysis Services generates aggregates and other statistical
information that can be used for analysis. This information can be used byanymining model that is based on
the structure.
ExploringandValidatingModels
Beforeyoudeployamodelintoaproductionenvironment,youwillwanttotesthowwellthemodelperforms. Also,
when you build a model, you typically create multiple models with different configurations and test all models
to see which yields the best results for your problem and your data.
DeployingandUpdating Models
After the mining models exist in a production environment, you can perform many tasks, depending on your
needs. The following are some of the tasks you can perform:
Usethemodelstocreatepredictions,whichyoucanthen usetomakebusiness decisions.
Createcontentqueriesto retrievestatistics,rules,orformulasfromthe model.
Embed data mining functionality directly into an application. You can include Analysis Management
Objects (AMO), which contains a set of objects that your application can use to create, alter, process,
and delete mining structures and mining models.
Use Integration Services to create a package in which a mining model is used to intelligently separate
incoming data into multiple tables.
Create a report that lets users directlyqueryagainst an existingmining model
Updatethemodelsafterreviewandanalysis.Anyupdaterequires thatyoureprocessthemodels.
Update the models dynamically, as more data comes into the organization, and making constant
changes to improve the effectiveness of the solution should be part of the deployment strategy.
18
Datawarehousing
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed
byintegratingdata from multipleheterogeneoussourcesthatsupportanalyticalreporting,structured and/orad hoc
queries, and decision making. Data warehousing involves data cleaning, data integration, and data
consolidations.
Characteristicsof datawarehouse
Themain characteristics of adata warehouseareas follows:
Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information rather than the
overall processes of a business. Such subjects may be sales, promotion, inventory, etc
Integrated
A data warehouse is developed by integrating data from varied sources into a consistent format.
Thedatamustbestoredinthewarehouseinaconsistentanduniversallyacceptablemannerintermsof naming,
format, and coding. This facilitates effective data analysis.
Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is read-only. Previous
data is not erased when current data is entered. This helps you to analyze what has happened andwhen.
Time-Variant
The data stored in a data warehouse is documented with an element of time, either explicitly or
implicitly. An example of time variance in Data Warehouse is exhibited in the Primary Key, which
must have an element of time like the day, week, or month.
Databasevs.DataWarehouse
Although a data warehouse and a traditional databaseshare some similarities, they need not be the same idea.
The main difference is that in a database, data is collected for multiple transactional purposes. However, in a
datawarehouse, datais collectedon an extensivescaleto perform analytics. Databases providereal-timedata,
while warehouses store data to be accessed for big analytical queries.
DataWarehouse Architecture
Usually,datawarehouse architecturecomprises athree-tierstructure.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational database system. Back-end tools are
used to cleanse, transform and feed data into this layer.
MiddleTier
Themiddle tierrepresents an OLAPserver thatcan beimplementedin two ways.
TheROLAPorRelationalOLAPmodelisanextendedrelationaldatabasemanagementsystemthatmaps multidimensional
data process to standard relational process.
TheMOLAPormultidimensionalOLAPdirectlyactsonmultidimensionaldataand operations.
Top Tier
Thisisthefront-endclientinterfacethatgetsdataoutfromthedatawarehouse.Itholdsvarioustoolslike query tools,
analysis tools, reporting tools, and data mining tools.
Data Warehousing integrates data and information collected from various sources into one comprehensive
database. For example, a data warehouse might combine customer information from an organization’s point-
of-salesystems,itsmailinglists,website,andcommentcards.Itmightalsoincorporateconfidential
19
information about employees, salary information, etc. Businesses use such components of data warehouse to
analyze customers.
Data mining is one of the features of a data warehouse that involves looking for meaningful data patterns in
vast volumes of data and devising innovative strategies for increased sales and profits.
Typesof DataWarehouse
Therearethreemaintypes ofdata warehouse.
EnterpriseDataWarehouse (EDW)
Thistypeofwarehouseservesasakeyorcentral databasethatfacilitatesdecision-supportservicesthroughout the
enterprise. The advantage to this type of warehouse is that it provides access to cross-organizational
information, offers a unified approach to data representation, and allows running complex queries.
OperationalDataStore (ODS)
This type of data warehouse refreshes in real-time. It is often preferred for routine activities like storing
employee records. It is required when data warehouse systems do not support reportingneeds of the business.
Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department, region, or business unit.
Everydepartment of a business has a central repositoryor data mart to store data. The data from the data mart
is stored in the ODS periodically. The ODS then sends the data to the EDW, where it is stored and used.
Summary
Inthischapteryoulearnedthedatascienceprocessconsistsofsixsteps:
Settingtheresearchgoal—Definingthewhat,thewhy,andthehowofyourprojectinaproject charter.
Retrieving data—Finding and getting access to data needed in your project. This data is either
foundwithin the company or retrieved from a third party.
Data preparation—Checkingand remediatingdataerrors, enrichingthedata with datafromotherdata
sources, and transforming it into a suitable format for your models.
Dataexploration—Divingdeeperintoyourdatausingdescriptivestatisticsandvisual techniques.
Datamodeling—Usingmachinelearningandstatisticaltechniquestoachieveyourprojectgoal.
Presentationandautomation—Presentingyourresultstothestakeholdersandindustrializingyour analysis
process for repetitive reuse and integration with other tools.
20
Unit –II
Unit– II
DESCRIBINGDATA
THREETYPESOF DATA
Qualitativedataconsistofwords(YesorNo),letters(YorN),ornumericalcodes(0or1)that represent a class
or category.
Rankeddataconsistofnumbers(1st,2nd,...40thplace)thatrepresentrelativestandingwithina group.
Quantitativedata consist ofnumbers (weights of238, 170, . . . 185 lbs) that represent an amount ora
count. To determine the type of data, focus on a single observation in any collection of observations
TYPESOFVARIABLES
Avariable is a characteristic orpropertythat cantakeon different values.
The weights can be described not only as quantitative data but also as observations for a
quantitativevariable, since the various weights take on different numerical values.
Bythesametoken,therepliescanbedescribedasobservationsforaqualitativevariable,sincethe replies to the
Facebook profile question take on different values of either Yes or No.
Giventhisperspective,any singleobservationcanbedescribedasaconstant,sinceittakesononly one value.
DiscreteandContinuousVariables
Quantitativevariablescanbefurtherdistinguishedasdiscreteorcontinuous.
Adiscretevariable consists ofisolated numbers separated bygaps.
Discrete variables can only assume specific values thatyou cannot subdivide. Typically,you count
discretevalues, and the results are integers.
Examples
Counts-suchasthe numberof children ina family.(1, 2,3,etc.,but never 1.5)
Thesevariablescannothavefractionalor decimalvalues.Youcanhave20or21cats,butnot 20.5
Thenumberofheads inasequenceof coin tosses.
Theresult ofrollingadie.
Thenumberof patients in ahospital.
Thepopulation of a country.
While discrete variables have no decimal places, the average of these values can be fractional. For example,
families can have only a discrete number of children: 1, 2, 3, etc. However, the average number of childrenper
family can be 2.2.
Acontinuousvariableconsistsofnumberswhosevalues,atleastintheory,havenorestrictions.
Continuous variables can assume any numeric value and can be meaningfully split into smaller parts.
Consequently, they have valid fractional and decimal values. In fact, continuous variables have an infinite
number of potential values between any two points. Generally, you measure them using a scale.
Examplesofcontinuousvariablesincludeweight,height,length,time,andtemperature.
Durations, such as the reaction times of grade school children to a fire alarm; and standardized test scores,
such as those on the Scholastic Aptitude Test (SAT).
1
Unit –II
IndependentandDependentVariables
Independent Variable
Inanexperiment, an independent variable isthetreatmentmanipulated bythe investigator.
Independent variables (IVs) arethe ones that you include in the model toexplain or predict changes in
the dependent variable.
Independentindicates thattheystand aloneand othervariables in themodel donot influencethem.
Independentvariables arealsoknown aspredictors,factors,treatmentvariables,explanatoryvariables, input
variables, x-variables, and right-hand variables—because they appear on the right side of the equals
sign in a regression equation.
Itisavariablethatstandsalone andisn'tchangedbythe othervariablesyouaretryingto measure.
For example, someone's age might be an independent variable. Other factors (such as what they eat, howmuch
they go to school, how much television they watch)
The impartial creation of distinct groups, which differ only in terms of the independent variable, has a most
desirable consequence. Once the data have been collected, any difference between the groups can be
interpreted as being caused by the independent variable.
DependentVariable
When a variable is believed to have been influenced by the independent variable, it is called a dependent
variable. In an experimental setting, the dependent variable is measured, counted, or recorded by the
investigator.
The dependent variable (DV) iswhatyou wantto use the model toexplain or predict. The values ofthis
variable depend on other variables.
It’salsoknownastheresponsevariable,outcomevariable,andleft-handvariable.Graphsplace dependent
variables on the vertical, or Y, axis.
adependent variableisexactlywhat itsounds like. It issomethingthatdepends onother factors.
ConfoundingVariable
An uncontrolled variable that compromises the interpretation of a studyis known as a confoundingvariable.
Sometimes a confounding variable occurs because it’s impossible to assign subjects randomly to different
conditions.
DescribingDatawithTablesandGraphs
FrequencyDistributionsforQuantitativeData
Afrequencydistributionisacollectionofobservations
producedbysortingobservationsintoclassesandshowing
their frequency (f) of occurrence in each class.
When observations are sorted into classes of single
values,asinTable2.1,theresultisreferredtoasafrequency
distribution for ungrouped data.
The frequencydistribution shown in Table 2.1 is onlypartially
displayedbecausetherearemorethan100possiblevaluesbetween the
largest and smallest observations.
Frequencydistribution tableis much moreinformativeif possible
2
Unit –II
observedvaluesislessthen20.Ifmoreentryisobservedthen grouped
data is used.
Grouped Data
Accordingtotheirfrequencyofoccurrence.Whenobservationsaresorted into
classes of more than one value result is referred to as a frequency
forgroupeddata.(Shown intable 2.2)
Thegeneralstructureofthisfrequencydistributionisthedata’sare
grouped into class intervals with 10 possible values each.
Thefrequency( f)column shows thefrequencyofobservations in
eachclassand, atthebottom,thetotalnumber ofobservations inall classes.
GUIDELINES
Unit –II
OUTLIERS
An outlier is an extremely high or extremely low data point relative to the nearest data point and the
restof the neighboring co-existing values in a data graph or dataset you're working with.
Outliersare extreme valuesthat stand out greatlyfrom theoverallpattern of values ina dataset orgraph.
4
Unit –II
RELATIVEFREQUENCY DISTRIBUTIONS
Relative frequencydistributions show the frequencyof each
classasapartorfractionofthetotalfrequencyfortheentire
distribution.
Thistypeofdistributionisespeciallyhelpfulwhenyoumust compare
two or more distributions based on different total numbers of
observations.
The conversion to relative frequencies allows a direct
comparisonoftheshapesoftwodistributionswithout
adjust other observations.
ConstructingRelativeFrequency Distributions
To convert a frequencydistribution into a relative frequency
distribution,dividethefrequencyforeachclassbythetotalf
requency for the entire distribution.
Table2.5illustratesarelativefrequencydistributionbasedon the
weight distribution of Table 2.2.
PercentagesorProportions
Some people prefer to deal with percentages rather than proportions because percentages usually lack
decimal points. A proportion always varies between 0 and 1, whereas a percentage always varies between
0 percent and 100 percent.
To convert the relative frequencies, multiply each proportion by 100; that is, move the decimal point two
places to the right.
CUMULATIVEFREQUENCY DISTRIBUTIONS
Cumulative frequency distributions show the total number of observations in each class and in all lower-
ranked classes.
Cumulativefrequencies areusuallyconverted, in turn, to cumulativepercentages. Cumulativepercentages
are often referred to as percentile ranks.
ConstructingCumulativeFrequencyDistributions
To convert a frequency distribution into a cumulative frequency distribution, add to the frequency of each
class the sum of the frequencies of all classes ranked below it.
5
Unit –II
Cumulative Percentages
As has been suggested, if relative standing within a distribution is particularly important, then cumulative
frequencies are converted to cumulative percentages
To obtain this cumulative percentage, the cumulative frequency of the class should be divided bythe total
frequency of the entire distribution.
PercentileRanks
When used to describe the relative position of any score within its parent distribution, cumulative
percentages are referred to as percentile ranks.
The percentile rank of a score indicates the percentage of scores in the entire distribution with similar or
smaller values than that score. Thus a weight has a percentile rank of 80 if equal or lighter weights
constitute 80 percent of the entire distribution.
FREQUENCYDISTRIBUTIONSFORQUALITATIVE(NOMINAL)DATA
Frequency distributions for qualitative data are easy to construct.
Simplydetermine the frequency with which observations occupy
Eachclass,andreportthesefrequenciesasshowninTable2.7for the
Face book profile survey
RelativeandCumulativeDistributionsforQualitative Data
Frequencydistributionsforqualitativevariablescanalwaysbeconvertedintorelativefrequency distributions.
if measurement is ordinal because observations can be ordered from least to most, cumulative frequencies
(and cumulative percentages) can be used.
6
Unit –II
GRAPHS
Data can be described clearlyand conciselywith the aid of a well-constructed frequency distribution. And
data can often be described even more vividly by converting frequency distributions into graphs.
GRAPHS FORQUANTITATIVEDATA
Histograms
Abar-typegraphforquantitativedata.Thecommonboundariesbetweenadjacentbarsemphasizethe continuity of the
data, as with continuous variables.
A histogram is a display of statistical information that uses rectangles to show the frequencyof data items
in successive numerical intervals of equal size.
Importantfeaturesofhistograms
Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class intervals of the
frequency distribution.
Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency. (The
unitsalong the vertical axis do not have to be the same width as those along the horizontal axis.)
Theintersectionofthetwoaxesdefinesthe originatwhichbothnumericalscalesequal 0.
Numerical scales always increase from left to right along the horizontal axis and from bottom to
topalong the vertical axis
Thebodyofthehistogramconsistsofaseriesofbarswhoseheightsreflectthefrequenciesforthe various
classes.
Theadjacentbarsinhistogramshavecommonboundariesthatemphasizethecontinuityof quantitative data
for continuous variables.
The introduction of gaps between adjacent bars would suggest an artificial disruption in the data more
appropriate for discrete quantitative variables or for qualitative variables.
Figure:Histogram
7
Unit –II
FrequencyPolygon
An important variation on a histogram is the frequency polygon, or line graph. Frequency polygons
maybe constructed directly from frequency distributions.
A. Thispanelshowsthehistogramforthe weightdistribution.
B. Place dots at the midpoints of each bar top or, in the absence of bar tops, at midpoints for classes on
the horizontal axis, and connect them with straight lines.
C. c. Anchor the frequency polygon to the horizontal axis. First, extend the upper tail to the midpoint of
the first unoccupied class on the upper flank of the histogram. Then extend the lower tail to the
midpoint of the first unoccupied class on the lower flank of the histogram. Now all of the area under
the frequency polygon is enclosed completely.
D. Finally,eraseallof thehistogrambars, leavingonlythe frequencypolygon.
8
Unit –II
Stemand LeafDisplays
Another technique for summarizing quantitative data is a stem and leaf display. Stem and leaf displays are
ideal for summarizing distributions, such as that for weight data, without destroying the identities ofindividual
observations.
ConstructingStemandLeaf Display
Theleftmostpaneloftablere-createstheweights.
To construct the stem and leaf display for the table given below, first note that, when counting by tens, the
weights range from the 130s to the 240s.
Arrange a column of numbers, the stems, beginning with 13 (representing the 130s) and ending with 24
(representing the 240s). Draw a vertical line to separate the stems, which represent multiples of 10, from the
space to be occupied by the leaves, which represent multiples of 1.
Forexample
Enter each raw score into the stem and leaf display. As suggested by the shaded coding in Table 2.9, the first
rawscoreof160reappearsasaleafof0onastem of16.Thenextrawscoreof193 reappearsasaleafof3 on a stem of 19,
and the third raw score of 226 reappears as a leaf of 6 on a stem of 22, and so on, until each raw score
reappears as a leaf on its appropriate stem.
TYPICALSHAPES
Whether expressed as a histogram, a frequency polygon, or a stem and leaf display, an importantcharacteristic
of a frequency distribution is its shape. Below figure shows some of the more typical shapes for smoothed
frequency polygons (which ignore the inevitable irregularities of real data).
9
Unit –II
AGRAPH FORQUALITATIVE(NOMINAL)DATA
As with histograms, equal segments along the horizontal axis are allocated to the different words or
classes that appear in the frequency distribution for qualitative data. Likewise, equal segments along
the vertical axis reflect increases in frequency. The body of the bar graph consists of a series of bars
whose heights reflect the frequencies for the various words or classes.
A person’s answer to the question “Do you have a Facebook profile?” is either Yes or No, not some
impossible intermediate value, such as 40 percent Yes and 60 percent No.
Gaps are placed between adjacent bars of bar graphs to emphasize the discontinuous nature of
qualitative data.
MISLEADING GRAPHS
Graphscanbeconstructedin anunscrupulous mannertosupport aparticular pointof view.
Popular sayings says, including “Numbers don’t lie, but statisticians do” and “There are three kinds of lies—
lies, damned lies, and statistics.”
10
Unit –II
11
Unit –II
DescribingDatawithAverages
MODE
Themodereflectsthevalueofthemostfrequentlyoccurringscore. In
other words
A mode is defined as the value that has a higher frequencyin a given set of values. It is the value that appears
the most number of times.
Example:
Inthegivenset ofdata: 2, 4,5, 5,6, 7, themode ofthe data setis 5sinceithasappeared intheset twice.
TypesofModes
Bimodal,Trimodal&Multimodal(Morethanone mode)
Whentherearetwo modesin adata set,then theset iscalledbimodal
Forexample,ThemodeofSetA={2,2,2,3,4,4,5,5,5}is2and5,becauseboth 2and5isrepeatedthreetimes in the given
set.
Whentherearethreemodes ina dataset,then the set iscalledtrimodal
Forexample, the modeof set A={2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
Whentherearefour ormoremodesin a dataset, thenthesetis calledmultimodal
It can beseen that 2 wickets were taken bythebowlerfrequentlyin different matches. Hence, themodeofthe
given data is 2.
MEDIAN
Themedianreflectsthe middlevaluewhenobservationsareorderedfromleasttomost.
Themediansplits asetof orderedobservations intotwoequal parts,theupper andlower halves.
FindingtheMedian
Orderscoresfrom leastto most.
Ifthetotalnumberofobservationgivenisodd,thentheformulatocalculatethemedianis: Median
Ifthetotalnumberofobservationiseven,thenthemedianformulais:
Example1:
4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23,29
12
Unit –II
Solution:
n=15
Whenweputthosenumbersin theorder wehave:
4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,
Median={(n+1)/2}thterm
=(15+1)/2
=8
th
The8 term in the list is24
Themedian value ofthis set ofnumbers is 24.
Example2:
Findthemedian ofthe following:
9,7,2,11,18,12,6,4
Solution
n=8
Whenweputthosenumbersin theorder wehave:
2, 4, 6, 7, 9,11, 12, 18
Median=1/2[(n/2)thterm+{(n/2)+1}th term]
=½ [(8/2)term + ((8/2)+1)term]
=1/2[4thterm+5thterm] (inourlist 4thtermis 7 and 5th term is 9)
=½[7+9]
=1/2(16)
=8
Themedian value ofthis set ofnumbers is 8.
MEAN
Themean isfound byaddingall scoresandthendividingbythe numberofscores.
Mean is the average of the given numbers and is calculated bydividing the sum of given numbers by the total
number of numbers.
Typesofmeans
Samplemean
Populationmean
SampleMean
Thesamplemean isacentraltendencymeasure. Thearithmeticaverageis computedusingsamples orrandom values
taken from the population. It is evaluated as the sum of all the sample variables divided by the total number of
variables.
13
Unit –II
PopulationMean
Thepopulation mean can becalculated bythesum ofall valuesin thegiven data/population divided byatotal number of
values in the given data/population.
AVERAGESFORQUALITATIVEANDRANKEDDATA
Mode
Themodealwayscanbe usedwithqualitativedata.
Median
The median can be used whenever it is possible to order qualitative data from least to most because the level
of measurement is ordinal.
DescribingVariability
RANGE
Therangeisthedifferencebetween thelargest andsmallest scores.
Therangeinstatisticsforagivendatasetisthedifferencebetweenthehighestandlowestvalues.For example, if the given
data set is {2,5,8,10,3}, then the range will be 10 – 2 = 8.
VARIANCE
Varianceisameasureofhowdatapointsdifferfromthemean.Avarianceisameasureofhowfarasetof data (numbers)
are spread out from their mean (average) value.
Formula
σ=Σ(x-μ)2or
Variance=(Standarddeviation)2=σ2=>σ2=Σ(x-μ)2/n
thevaluesofallscoresmustbeaddedandthendividedbythetotalnumberofscores. Example
X=5, 8, 6, 10, 12, 9, 11,10, 12, 7
Solution
Mean=sum(x)/n n=
10
sum(x)=5+8+6+10+12+9+11+10+12+7
14
Unit –II
=90
Mean=>μ=90/10=9
Deviation from mean
x-μ =-4, -1,-3, 1, 3, 0, 2,1,3,-2
(x-μ)2=16,1,9,1,9,0,4,1,9,4
Σ(x-μ)2= 16+1+9+1+9+0+4+1+9+4
=54
σ 2=Σ(x-μ)2/n
=54/10
=5.4
STANDARDDEVIATION
Thestandarddeviation, thesquarerootofthe meanof allsquareddeviations fromthe mean,that is,
Standarddeviation=√variance
StandardDeviation:Arough measureof theaverage(or standard)amount bywhich scores deviate
StandardDeviation:AMeasureofDistance
The mean isa measureof position, but the standard deviationisa measure of distance (on either side of themean
of the distribution).
15
Unit –II
16
Unit –II
17
Unit –II
DEGREESOFFREEDOM(df)
Degrees of freedom (df) refers to the number of values that are free to vary, given one or more
mathematical restrictions, in a sample being used to estimate a population characteristic.
Degrees of freedom are the number of independent variables that can be estimated in a statistical
analysis. These values of these variables are without constraint, although the values do impost
restrictions on other variables if the data set is to comply with estimate parameters.
Degrees of Freedom (df ) The number of values free to vary, given one or more mathematical
restrictions.
Formula
Degreeoffreedom df= n-1
Example
Consideradatasetconsistsoffivepositiveintegers.Thesumofthefiveintegersmustbethemultipleof6. The values are
randomly selected as 3, 8, 5, and 4.
Thesumofthisforvaluesis20.Sowehavetochoosethefifthintegertomakethesumdivisibleby6. Thereforethe fifth element
is 10.
The number of degrees of Degrees of Freedom (df ) The number of values free to vary, given one or more
mathematical restrictions. Freedom—in the numerator, as in the formulas for s2 and s. In fact, we can use
degrees of freedom to rewrite the formulas for the sample variance and standard deviation:
INTERQUARTILERANGE (IQR)
The interquartile range (IQR), is simply the range for the middle 50 percent of the scores. More specifically,
the IQR equals the distance between the third quartile (or 75th percentile) and the first quartile (or 25 th
percentile), that is, after the highest quarter (or top 25 percent) and the lowest quarter (or bottom 25 percent)
have been trimmed from the original set of scores. Since most distributions are spread more widely in their
extremities than their middle, the IQR tends to be less than half the size of the range.
Simply, The IQR describes the middle 50% of values when ordered from lowest to highest. To find the
interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These
values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
18
Unit –II
NormalDistributionsandStandard(z)Scores
THENORMALCURVE
The normal distribution is a continuous probability distribution that is symmetrical on both sides of the mean,
so the right side of the center is a mirror image of the left side.
PropertiesoftheNormalCurve
The normal curve is a theoretical curve defined for a continuous variable, as described in Section 1.6,
and noted for its symmetrical bell-shaped form, as revealed in below figure
Becausethenormalcurveissymmetrical,its lowerhalfis themirrorimageofitsupper half.
The normal curve peaks above a point midway along the horizontal spread and then tapers off
gradually in either direction from the peak (without actually touching the horizontal axis, since, in
theory, the tails of a normal curve extend infinitely far).
The values of the mean, median (or 50th percentile), and mode, located at a point midway along the
horizontal spread, are the same for the normal curve.
Propertiesofanormal distribution
Themean,mode and medianareall equal.
Thecurveissymmetricat thecenter (i.e. aroundthemean, μ).
Exactlyhalf ofthe valuesaretotheleft ofcenter and exactlyhalf thevaluesareto the right.
Thetotal areaunderthecurveis 1.
19
s
Unit –II
DifferentNormalCurves
As a theoretical exercise, it is instructive to note the various types of normal curves that are produced
by an arbitrary change in the value of either the mean (μ) or the standard deviation (σ).
Obvious differences in appearance among normal curves are less important than you might suspect.
Becauseoftheir common mathematical origin, everynormal curve can be interpretedin exactlythesameway
once any distance from the mean is expressed in standard deviation units.
z SCORES
A z score can be defined as a measure of the number of standard deviations by which a score is below or
abovethemean ofadistribution. In otherwords,it is used to determinethe distanceofascore from themean. If the
z score is positive it indicates that the score is above the mean. If it is negative then the score will be below
the mean. However, if the z score is 0 it denotes that the data point is the same as the mean.
To obtain a z score, express any original score, whether measured in inches, milliseconds, dollars, IQ points,
etc.,asadeviationfromitsmean(bysubtractingitsmean)andthensplitthisdeviationintostandarddeviation units (by
dividing by its standard deviation),
Where X is the original score and μ and σ are the mean and the standard deviation, respectively, for
thenormaldistributionoftheoriginalscores.Sinceidenticalunitsofmeasurementappearinboththenumerator
20
Unit –II
Azscoreconsistsoftwoparts:
1. Apositiveornegative sign indicatingwhetherit’saboveorbelowthemean; and
2. Anumberindicatingthesizeofits deviationfromthemeanin standard deviation units.
ConvertingtozScores
Example
Suppose on a GRE test a score of 1100 is obtained. The mean score for the GRE test is 1026 and the
population standard deviation is 209. In order to find how well a person scored with respect to the score of an
average test taker, the z score will have to be determined.
STANDARDNORMAL CURVE
If the original distribution approximates a normal curve, then the shift to standard or z scores will always
produce a new distribution that approximates the standard normal curve. This is the one normal curve for
which a table is actually available.
Forastandardnormal curve
Mean=0
Standarddeviation= 1
StandardNormalTable
Thestandardnormal tableconsists ofcolumns of z scores coordinated with columnsof proportions
UsingtheTopLegendoftheTable
Notice that columns are arranged in sets of three, designated as A, B, and C in the legend at the top of the
table. When using the top legend, all entries refer to the upper half ofthe standard normal curve. The entriesin
column A are z scores, beginning with 0.00 and ending with 4.00
21
Unit –II
Given a z score of zero or more, columns B and C indicate how the z score splits the area in the upper half of
thenormalcurve.Assuggestedby theshading inthetoplegend,columnB indicatestheproportionofarea betweenthe
meanand the zscore,and columnCindicates theproportion ofarea beyond thezscore,inthe upper tail of the
standard normal curve.
UsingtheBottomLegend of theTable
Now the columns are designated as A′, B′, and C′ in the legend at the bottom of the table. When using the
bottom legend, all entries refer to the lower half of the standard normal curve.
A negative z score, columns B′ and C′ indicate how that z score splits the lower half of the normal curve. As
suggestedbytheshadinginthebottomlegendofthetable,column B′indicatestheproportionofareabetween themean
andthenegativezscore,and columnC′indicatestheproportion ofareabeyondthenegativezscore, in the lower tail of
the standard normal curve.
22
Unit –II
FINDINGPROPORTIONS
FindingProportionsforOneScore
Sketchanormal curveand shadeinthetarget area,
Planyoursolutionaccordingtothenormaltable.
ConvertXtoz.
Findthetarget area.
FindingProportionsbetweenTwoScores
Sketch a normal curve and shade in the target area, (example, find proportion between 245
to 255)
Planyoursolutionaccordingtothenormaltable.
ConvertXtozbyexpressing255as
Findthetarget area.
23
Unit –II
FINDINGSCORES
So far, we have concentrated on normal curve problems for which Table A must be consulted to find
the unknown proportion (of area) associated with some known score or pair of known scores
NowwewillconcentrateontheoppositetypeofnormalcurveproblemforwhichTableAmustbe consulted to find
the unknown score or scores associated with some known proportion.
ForthistypeofproblemrequiresthatwereverseouruseofTableAbyenteringproportionsin columns B, C, B′, or C′
and finding z scores listed in columns A or A′.
FindingOneScore
Sketch a normal curve and, on the correct side of the mean, draw a line representing the target
score, as in figure
It’s often helpful to visualize the target score as splitting the total area into two sectors—one to the left of
(below) the target score and one to the right of (above) the target score
Planyoursolution accordingtothenormaltable.
In problems ofthis type, you must planhowto find thez score forthe target score. Because thetarget score is on
the right side of the mean, concentrate on the area in the upper half of the normal curve, as described in
columns B and C.
Findz.
Convertztothetargetscore.
When converting z scores to original scores, you will probably find it more efficient to use the following
equation
24
Unit –II
FindingTwoScores
Sketch a normal curve. On either side of the mean, draw two lines representing the two target scores,
as in figure
Planyoursolutionaccordingtothenormaltable.
Findz.
Convertztothetargetscore.
Pointsto Remember
1. range=largest value –smallest value in alist
2. classinterval= range/desirednoof classes
3. relative frequency= frequency(f)/ε(f)
4. Cumulativefrequency-addtothefrequencyofeachclassthesumofthefrequenciesofall classes
ranked below it.
5. Cumulativepercentage=(f/cumulative f)*100
6. Histograms
7. Constructionoffrequencypolygon
8. Stemandleafdisplay
9. Mode-Thevalue of the most frequentscore.
10. Forodd no ofterms Median={(n+1)/2}thterm/observation. Forevenno of termsMedian
=1/2[(n/2)thterm+ {(n/2)+1}thterm]
11. Mean=sum of allscores/ numberof scores
Variance σ= Σ(x-μ)2or
Variance=(Standarddeviation)2=σ2=>σ2=Σ(x-μ)2/n
12. Range (X)=Max (X)–Min(X)
25
Unit –II
15. z– score
26
Unit –II
18. Findingproportion
1. Forone score
2. Forbetweentwoscore
19. Findingscores
Twoscores
27
Unit –II
28
Unit –III
Unit – III
DESCRIBINGRELATIONSHIPS
Correlation – Scatterplots – correlation coefficient forquantitativedata – computational formulafor correlation
coefficient – Regression – regression line – least squares regression line – Standard error of estimate –
interpretation of r2 – multiple regression equations – regression towards the mean
Correlation
Correlation refers to a process for establishing the relationships between two variables. You learned a way toget
a general idea about whether or not two variables are related, is to plot them on a “scatter plot”. While there are
many measures of association for variables which are measured at the ordinal or higher level of measurement,
correlation is the most commonly used approach.
TypesofCorrelation
Positive Correlation – when the values of the two variables move in the same direction so that an
increase/decrease in the value of one variable is followed by an increase/decrease in the value of the
other variable.
Negative Correlation – when the values of the two variables move in the opposite direction so that an
increase/decrease in the value of one variable is followed by decrease/increase in the value of the other
variable.
NoCorrelation– when thereis no linear dependenceor norelation between the twovariables.
SCATTERPLOTS
A scatter plot is a graph containing a cluster of dots that represents all pairs of scores. In other words
Scatter plots are the graphs that present the relationship between two variables in a data-set. It represents
datapoints on a two-dimensional plane or on a Cartesian system.
Constructionofscatterplots
Theindependentvariableorattributeis plottedonthe X-axis. Fig6.1
Thedependent variableis plottedon the Y-axis.
1
Unit –III
Positive,Negative,orLittleorNo Relationship?
Adotclusterthathasaslopefromtheupperlefttothelowerright,asinpanelBofbelowfigurereflectsa
negativerelationship.
PerfectRelationship
A dot cluster that equals(rather than merelyapproximates) a straight line reflects a perfect relationship between
two variables.
Curvilinear Relationship
The previous discussion assumes that a dot cluster approximates a straight line and, therefore, reflects a linear
relationship. But this is not always the case. Sometimes a dot cluster approximates a bent or curved line, as in
below figure, and therefore reflects a curvilinear relationship.
2
Unit –III
ACORRELATIONCOEFFICIENTFORQUANTITATIVEDATA:r
Thecorrelationcoefficient,r,isasummarymeasurethatdescribestheextentofthestatistical relationship between two
interval or ratio level variables.
Propertiesofr
Thecorrelationcoefficientisscaled sothatitis alwaysbetween -1and+1.
When ris closeto 0 this means that thereis littlerelationship between the variablesand thefarther away
from 0 r is, in either the positive or negative direction, the greater the relationship between the two
variables.
Thesign of rindicatesthetypeof linearrelationship, whetherpositiveor negative.
Thenumericalvalueof r,withoutregardto sign,indicates thestrengthof thelinear relationship.
A number with a plus sign (or no sign) indicates a positive relationship, and a number with a minus sign
indicates a negative relationship
COMPUTATIONFORMULAFORr
Calculateavaluefor rbyusingthe followingcomputationformula:
3
Unit –III
REGRESSION
A regression is a statistical technique that relates a dependent variable to one or more independent
(explanatory)variables. Aregressionmodelisabletoshowwhetherchangesobservedinthedependentvariable are
associated with changes in one or more of the explanatory variables.
Regression captures the correlation between variables observed in a data set, and quantifies whether
those correlations are statistically significant or not.
ARegressionLine
a regressionlineisaline thatbestdescribesthebehaviour ofa setof data.Inotherwords,it’saline thatbest fits the trend
of a given data.
4
Unit –III
Typesofregression
Thetwobasictypesof regressionare
Simplelinearregression
Simplelinearregressionusesoneindependentvariableto explain or
predict the outcome of the dependent variable Y
Multiplelinearregression
Multiple linear regressions use two or more independent
variables to predict the outcome
PredictiveErrors
Prediction error refers to the difference between the predicted values made by some model and theactual
values.
LEASTSQUARESREGRESSIONLINE
The placement of the regression line minimizes not the total predictive error but the total squared
predictive error, that is, the total for all squared predictive errors. When located in this fashion, the regression
line is often referred to as the least squares regression line.
The Least Squares Regression Line is the line that minimizes the sum of the residuals squared. The
residual is the vertical distance between the observed point and the predicted point, and it is calculated by
subtracting ˆy from y.
Formula
b=NΣ(xy)− ΣxΣy
NΣ(x2)−(Σx)2
5
Unit –III
b=Σy −mΣx
N
Example
"x" "y"
2 4
3 5
5 7
7 10
9 15
=1315 −1066
840 −676
=249
164
b =1.5183.
Step4:CalculateIntercepta a
= Σy − b Σx
N
=41 −1.5183 x26
5
a=0.3049.
Step5:y’=bx+a
y’=1.518x+ 0.305
6
Unit –III
x y y=1.518x+0.305 error
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
STANDARDERROROFESTIMATE ,s y|x
The standard error of the estimate is a measure of the accuracy of predictions. The regression line is the
line that minimizes the sum of squared deviations of prediction (also called the sum of squares error), and the
standard error of the estimate is the square root of the average squared deviation.
The standard error of estimate and symbolized as s y|x, this estimate of predictive error complies with the
general format for any sample standard deviation, that is, the square root of a sum of squares term dividedby its
degrees of freedom.
Fig.Predictiveerrorsfor fivefriends
Example
Calculatethestandarderrorofestimateforthe given Xand Yvalues. X=1,2,3,4,5 Y=2,4,5,4,5
7
Unit –III
Solution
Createfivecolumnslabeledx, y,y’,y–y’,(y–y’)2andN=5
b=NΣ(xy)−ΣxΣy N
Σ(x2) − (Σx)2
b=5(66)-15x20
5(55)-(15)2
=
330– 300
275-225
b= 30/50 = 0.6
a= Σy− bΣx
N
=20– (0.6 x15)
5
= 20– 11
5
a= 9/5 = 2.2
=√(2.4/3)
SSy/x =0.894
INTERPRETATIONOFr2
R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that
determines the proportion of variance in the dependent variable that can be explained by the independent
variable. In other words, r-squared shows how well the data fit the regression model (the goodness of fit).
R-squared can take any values between 0 to 1. Although the statistical measure provides some useful
insights regarding the regression model, the user should not rely only on the measure in the assessment of a
statistical model.
8
Unit –III
In addition, it does not indicate the correctness of the regression model. Therefore, the user should
always draw conclusions about the model by analyzing r-squared together with the other variables in astatistical
model.
Themostcommoninterpretationofr-squaredis howwelltheregressionmodelexplainsobserved data.
MULTIPLEREGRESSIONEQUATIONS
Multiple regression is a statistical technique applied on datasets dedicated to draw out a relationship
between one response or dependent variable and multiple independent variables.
Multiple regression works byconsidering the values of the available multiple independent variables and
predicting the value of one dependent variable.
Example:
A researcher decides to study students’ performance from a school over a period of time. He observed that
asthelectures proceed to operateonline, theperformanceofstudents started to declineas well. Theparametersfor
the dependent variable “decrease in performance” are various independent variables like “lack of attention,more
internet addiction, neglecting studies” and much more.
Formulatofindmultipleregression
y= b1x1 + b2x2+…bnxn+ a
REGRESSIONTOWARDTHE MEAN
Regressiontowardthemeanreferstoatendencyforscores,particularlyextreme scores,toshrinktowardthe mean.
In statistics, regression toward the mean (also called reversion to the mean, and reversion to mediocrity) is a
conceptthatreferstothefactthatifonesampleofarandomvariableisextreme,thenextsamplingofthesame random
variable is likely to be closer to its mean.
Example
A military commander has two units return, one with 20% casualties and another with 50% casualties. He
praises the first and berates the second. The next time, the two units return with the opposite results. From this
experience, he “learns” that praise weakens performance and berating increases performance.
TheRegression Fallacy
Theregressionfallacyiscommittedwheneverregressiontowardthemeanisinterpreted asareal,rather than a
chance, effect.
Theregression fallacycanbeavoided bysplittingthe subset of extremeobservationsinto two groups
9
Unit –III
10
Unit –IV
UNITIV
PYTHONLIBRARIESFORDATAWRANGLING
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic – fancy
indexing – structured arrays – Data manipulation with Pandas – data indexing and selection – operating on data
– missing data – Hierarchical indexing – combining datasets – aggregation and grouping – pivot tables
NumPy(shortforNumericalPython)providesanefficientinterfacetostoreandoperateondensedatabuffers. NumPy
arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient storage and data
operations as the arrays grow larger in size.
NumPyArray Attributes
ndim(thenumberof dimensions),
shape(thesizeof each dimension)
size(thetotal sizeof thearray)
Example
np.random.seed(0)#seedfor reproducibility
x1=np.random.randint(10,size=6)#One-dimensional array
x2=np.random.randint(10,size=(3,4))#Two-dimensionalarray
x3=np.random.randint(10,size=(3,4,5))#Three-dimensionalarray
print("x3ndim:",x3.ndim)
print("x3shape:",x3.shape)p
rint("x3 size: ", x3.size)
print("dtype:",x3.dtype)
print("itemsize:",x3.itemsize,"bytes")
print("nbytes:", x3.nbytes, "bytes")
ArrayIndexing:
AccessingSingle Elements
AccessingSingleElements
Indexingin NumPywillfeel quite familiarlikelist indexing,
1
Unit –IV
ArraySlicing:AccessingSubarrays
Justaswecanusesquarebracketstoaccessindividualarrayelements,wecanalsouse themtoaccesssubarrays with the
slice notation, marked by the colon (:) character.
TheNumPyslicingsyntaxfollowsthatofthestandardPythonlist;toaccessasliceof an array
x, use this:
x[start:stop:step]
start – startingarrayindex
stop–arrayindextostop(lastvaluewillnotbeconsidered) step –
terms has to be printed from start to stop
Defaultto thevalues start=0, stop=sizeof dimension, step=1.
Example
x=np.arange(10)x
x[:5]#printsfirstfive elements
array([0,1,2,3,4])
x[5:]# elementsafterindex5
array([5,6,7,8,9])
Multidimensionalsubarrays
Multidimensionalslicesworkinthesameway,withmultipleslicesseparatedbycommas. For
example:
x2
array([[12,5,2,4],
[7,6,8,8],
[1,6,7,7]])
2
Unit –IV
x2[:2,:3]#tworows,threecolumns
array([[12, 5, 2],
[7,6,8]])
x2[:3,::2]#allrows,everyothercolumn(everysecondcolumn) array([[12,
2],
[7,8],
[1,7]])
Finally,subarraydimensionscaneven bereversed together
x2[::-1,::-1]
array([[ 7,7,6,1],
[8,8,6,7],
[4,2,5,12]])
ReshapingofArrays
Themostflexiblewayof doingthisiswiththereshape()method.Forexample,ifyouwanttoputthenumbers 1 through
9 in a 3×3 grid, you can do the following
grid=np.arange(1,10).reshape((3,3))
print(grid)
[[123]
[456]
[789]]
ArrayConcatenationandSplitting
Concatenationof arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished through the routines
np.concatenate,np.vstack,andnp.hstack.np.concatenatetakesatupleorlistofarraysasitsfirstargument. x =
np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x,y])
array([1,2,3,3,2,1])
1 2 3 3 2 1 99 99 99]
array([[1,2, 3],
[4,5, 6],
[1,2, 3],
3
Unit –IV
[4,5, 6]])
Concatenatealongthesecondaxis (zero-indexed)
np.concatenate([grid,grid],axis=1)
array([[1,2,3,1,2,3],
[4,5,6,4,5, 6]])
np.vstack(verticalstack) functions
x=np.array([1,2,3])
grid=np.array([[9,8,7],
[6,5, 4]])
np.vstack([x,grid])
array([[1,2, 3],
[9,8, 7],
[6,5, 4]])
np.hstack(horizontalstack) functions
y =np.array([[99],
[99]])
np.hstack([grid,y])
array([[9,8,7,99],
[6,5,4,99]])
Splittingofarrays
Theoppositeofconcatenationis splitting,whichisimplementedbythe functionsnp.split,np.hsplit, and
np.vsplit.Foreach ofthese,wecanpass alist ofindicesgivingthesplit points
x=[1,2,3,99,99,3,2, 1]
x1,x2,x3=np.split(x,[3,5])
print(x1, x2, x3)
[123][9999][321]
upper,lower=np.vsplit(grid,[2])
print(uppe
r)
print(lowe
r)
4
Unit –IV
[[012 3]
[4567]]
[[891011]
[121314 15]]
left,right=np.hsplit(grid,[2])
print(left
)
print(righ
t)
[[01]
[45]
[89]
[1213]]
[[23]
[67]
[10 11]
[1415]]
ComputationonNumPyArrays:UniversalFunctions
Introducing UFuncs
NumPy provides a convenient interface into just this kind of statically typed, compiled routine. This is
knownas a vectorized operation.
Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is to quickly execute
repeated operations on values in NumPy arrays. Ufuncs are extremely flexible—before we saw an operation
between a scalar and an array, but we can also operate between two arrays
ExploringNumPy’sUFuncs
Ufuncs exist in two flavors: unaryufuncs, which operate on a single input, and binary ufuncs, which operate on
two inputs. We’ll see examples of both these types of functions here.
Arrayarithmetic
NumPy’s ufuncs make use of Python’s native arithmetic operators. The standard addition, subtraction,
multiplication, and division can all be used.
x = np.arange(4)
print("x =", x)
print("x+5=",x+5)
print("x-5 =",x-5)
print("x*2=",x*2)
5
Unit –IV
Absolutevalue
Justas NumPyunderstands Python’s built-in arithmetic operators, it also understands Python’s built-in absolute
value function.
np.abs()
np.absolute()
x=np.array([-2,-1,0,1,2])
abs(x)
array([2,1,0,1,2])
np.abs(x)
array([2,1,0,1,2])
Trigonometricfunctions
NumPyprovidesalargenumberofusefulufuncs,andsomeofthemostusefulforthedatascientistarethe trigonometric
functions.
np.sin()
np.cos()
np.tan()
inversetrigonometric functions
np.arcsin()
np.arccos()
np.arctan()
Defininganarrayofangles: theta=np.linspace(0,np.pi,3)
Computesometrigonometricfunctionslike
print("theta = ", theta)
print("sin(theta)= ",np.sin(theta))
print("cos(theta)=",np.cos(theta))
print("tan(theta)=",np.tan(theta))
Exponentsandlogarithms
Anothercommon typeofoperationavailablein a NumPyufunc aretheexponentials.
np.exp(x)– calculate exponent of all elements in the input arrayieex(e=2.7182)
np.exp2(x)–calculate 2**x forallxbeingthearrayelements
np.power(x,y)–calculatesthepowerasxy
x=[1, 2,3]
print("x=",x)
print("e^x=",np.exp(x))
6
Unit –IV
print("2^x=",np.exp2(x))
print("3^x=",np.power(3,x))
The inverse of the exponentials, the logarithms, are also available. The basic np.log gives the natural logarithm;
if you prefer to compute the base-2 logarithm or the base-10 logarithm as .
np.log(x)-isamathematicalfunctionthathelpsusertocalculateNaturallogarithmofxwherexbelongs to all the
input array elements
np.log2(x)-tocalculateBase-2logarithmof x
np.log10(x)-tocalculateBase-10logarithmof x
x=[1, 2, 4,10]
print("x=",x)
print("ln(x)=",np.log(x))
print("log2(x)=",np.log2(x))
print("log10(x)=",np.log10(x))
Specializedufuncs
NumPyhasmanymore ufuncsavailablelike
Hyperbolictrigfunctions,
Bitwisearithmetic,
Comparison operators,
Conversionsfromradianstodegrees,
Roundingand remainders,and much more
Morespecializedandobscureufuncsisthesubmodulescipy.special.Ifyouwanttocomputesomeobscure mathematical
function on your data, chances are it is implemented in scipy.special.
Gammafunction
AdvancedUfuncFeatures
Specifying output
Ratherthancreatingatemporaryarray,youcanusethistowritecomputationresultsdirectly tothememory location where
you’d like them to be. For all ufuncs, you can do this using the out argument of the function.
x=np.arange(5)
y=np.empty(5)
np.multiply(x,10,out=y)
print(y)
Aggregates
To reduce an arraywith a particularoperation, we can use the reduce method of anyufunc. A reduce repeatedly applies
a given operation to the elements of an array until only a single result remains.
x=np.arange(1,6)
np.add.reduce(x)
Similarly,callingreduce onthemultiplyufuncresultsintheproductofallarrayelements
np.multiply.reduce(x)
120
7
Unit –IV
Ifwe’dliketostorealltheintermediateresultsofthecomputation,wecaninsteaduse Accumulate
np.add.accumulate(x)
array([1,3,6,10,15])
Outer products
ufunccan computetheoutputofallpairs oftwo different inputsusingtheoutermethod.This allowsyou, inone line, to
do things like create a multiplication table.
x = np.arange(1, 6)
np.multiply.outer(x,x)
array([[1, 2,3,4,5],
[2, 4, 6, 8, 10],
[3, 6, 9, 12, 15],
[4, 8, 12, 16, 20],
[5, 10, 15, 20, 25]])
Aggregations:Min,Max,andEverythinginBetween
Minimumand Maximum
Pythonhasbuilt-inminandmaxfunctions,usedtofindtheminimumvalueandmaximum value of any
given array.
For min, max, sum, and several other NumPy aggregates, a shorter syntax is to use methods of
the array object itself.
np.min()–findstheminimum(smallest)valueinthe array
np.max()–findsthemaximum(largest)valueinthearray
Example
x=[1,2,3,4]
np.min(x)
1
np.max(x)
4
Multidimensionalaggregates
Onecommontypeofaggregationoperationis anaggregate alongarowor column.
By default, eachNumPy aggregation function willreturn the aggregate over the entire array.ie.If we use thenp.sum() it
will calculates the sum of all elements of the array.
Example
m=np.random.random((3,4))
print(M)
8
Unit –IV
M.sum()
6.0850555667307118
Aggregation functions take an additional argument specifying the axis along which the aggregate is computed.
Theaxisnormally takeseither0or1.iftheaxis=0thenitrunsalongwithcolumns,ifaxis=1itrunsalong with rows.
Example
Wecan findtheminimumvaluewithineachcolumnbyspecifyingaxis=0
M.min(axis=0)
array([0.66859307,0.03783739,0.19544769, 0.06682827])
Similarly,wecanfindthemaximumvaluewithineachrow M.max(axis=1)
array([0.8967576, 0.99196818,0.6687194 ])
Otheraggregation functions
NumPy provides many other aggregation functions most aggregates have a NaN-safe counterpart that
computes the result while ignoring missing values, which are marked by the special IEEE floating-point NaN
value.
FunctionName NaN-safe Version Description
np.sum np.nansum Computesum of elements
np.prod np.nanprod Computeproductof elements
np.mean np.nanmean Computemedian of elements
np.std np.nanstd Computestandarddeviation
np.var np.nanvar Computevariance
np.min np.nanmin Findminimumvalue
np.max np.nanmax Findmaximum value
np.argmin np.nanargmin Findindexofminimum value
np.argmax np.nanargmax Findindex ofmaximumvalue
np.median np.nanmedian Computemedian of elements
np.percentile np.nanpercentile Computerank-basedstatisticsof elements
np.any N/A Evaluatewhether anyelements aretrue
np.all N/A Evaluatewhetherallelementsaretrue
ComputationonArrays:Broadcasting
Broadcasting issimply asetof rulesforapplying binary ufuncs(addition,subtraction,multiplication, etc.) on
arrays of different sizes.
Forarraysofthesame size,binaryoperationsare performedonanelement-by-element basis.
a=np.array([0,1,2])
b=np.array([5,5,5]) a
+b
array([5,6,7])
Broadcastingallowsthesetypesofbinaryoperationstobeperformedonarraysofdifferentsizes. a + 5
array([5,6,7])
9
Unit –IV
We can think of this as an operation that stretches or duplicates the value 5 into the array[5, 5, 5], and adds the
results. The advantage of NumPy’s broadcasting is that this duplication of values does not actually take place.
We can similarly extend this to arrays of higher dimension. Observe the result when we add a one-dimensional
array to a two-dimensional array.
Example
M=np.ones((3,3))
M
array([[ 1.,1.,1.],
[1., 1., 1.],
[1., 1., 1.]])
M +a
array([[1.,2.,3.],
[1., 2., 3.],
[1., 2., 3.]])
Here the one-dimensional array a is stretched, or broadcast, across the second dimension in order to match the
shape of M.
Just as before we stretched or broadcasted one value to match the shape of the other, here we’ve stretched
botha and b to match a common shape, and the result is a two dimensional array.
10
Unit –IV
Thelightboxesrepresentthebroadcastedvalues:again,thisextramemoryisnotactuallyallocatedinthecourse of the
operation, but it can be useful conceptually to imagine that it is.
RulesofBroadcasting
Broadcastingin NumPyfollowsastrict setof rules todeterminethe interactionbetween the two arrays.
Broadcastingexample1
Let’slook at addingatwo-dimensional arrayto aone-dimensional array:
M=np.ones((2,3)) a
= np.arange(3)
Let’sconsideranoperationonthesetwoarrays.Theshapesofthearraysare: M.shape =
(2, 3)
a.shape= (3,)
Weseebyrule1thatthearrayahasfewerdimensions,sowepaditontheleftwithones: M.shape -> (2,
3)
a.shape->(1, 3)
Byrule2,wenowseethatthefirstdimensiondisagrees,so westretchthisdimensiontomatch: M.shape ->
(2, 3)
a.shape->(2, 3)
Theshapes match,and weseethatthefinalshape willbe(2, 3):
M +a
array([[1.,2.,3.],
[1., 2., 3.]])
Broadcastingexample2
Let’stakealookatanexamplewherebotharraysneedtobe broadcast:
a=np.arange(3).reshape((3, 1))
b =np.arange(3)
Again,we’llstartbywritingouttheshapeofthearrays: a.shape =
(3, 1)
b.shape=(3,)
Rule1sayswemustpadtheshapeofbwithones: a.shape -
> (3, 1)
b.shape-> (1, 3)
11
Unit –IV
Becausetheresultmatches,theseshapesarecompatible.Wecanseethishere: a + b
array([[0,1,2],
[1, 2,3],
[2, 3,4]])
Comparisons,Masks,andBooleanLogic
ComparisonOperators as ufuncs.
Wesawthatusing+,-,*,/,andothersonarraysleadstoelement-wiseoperations.NumPyalsoimplements comparison
operators such as < (less than) and > (greater than) as element-wise ufuncs.
TheresultofthesecomparisonoperatorsisalwaysanarraywithaBooleandatatype. All six of
the standard comparison operations are available:
x=np.array([1,2,3,4,5]) x
<3 # less than
array([True,True,False,False,False],dtype=bool) x
>3 # greater than
array([False,False,False,True,True], dtype=bool)
x!=3#not equal
array([True,True,False, True,True],dtype=bool)
x==3#equal
array([False,False,True,False,False], dtype=bool)
Operator Equivalentufunc
== np.equal
!= np.not_equal
< np.less
<= np.less_equal
> np.greater
>= np.greater_equal
rng=np.random.RandomState(0) x
= rng.randint(10, size=(3, 4))
x
array([[5,0,3,3],
[7, 9, 3,5],
[2,4, 7,6]])
12
Unit –IV
x<6
The result is a Boolean array, and NumPyprovides a number of straightforward patterns for working with these
Boolean results.
Boolean operators
Operator Equivalentufunc
& np.bitwise_and
| np.bitwise_or
^ np.bitwise_xor
~ np.bitwise_not
Example
np.sum((inches>0.5)&(inches<1))
inches >(0.5 &inches) <1
np.sum(~((inches <=0.5)|(inches>=1) ))
BooleanArraysas Masks
A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.
Returning to our x array from before, supposewe want an arrayof all values in the arraythat are less than, say, 5
WecanobtainaBooleanarrayforthisconditioneasily,aswe’vealreadyseen
Example
x
array([[5,0,3,3],
[7, 9, 3,5],
[2,4, 7,6]])
x<5
array([[False, True, True, True],
[False,False,True,False],
[True,True, False, False]],dtype=bool)
Masking operation
To selectthese valuesfromthe array, wecan simply indexon this Booleanarray;this isknownas a masking operation.
x[x<5]
13
Unit –IV
array([0,3,3,3,2,4])
Whatisreturnedisaone-dimensionalarrayfilled withallthevaluesthatmeetthis condition;inotherwords,all the values
in positions at which the mask array is True.
FancyIndexing
Fancy indexing is like the simple indexing we’ve already seen, but we pass arrays of indices in place of
single scalars. This allows us to very quickly access and modify complicated subsets of an array’s values.
ExploringFancy Indexing
Fancyindexing is conceptuallysimple: it means passing an arrayof indices to access multiple array elements at
once.
Typesof fancyindexing.
Indexing/ accessingmorevalues
Arrayofindices
Inmulti dimensional
Standard indexing
Example
importnumpyasnp
rand=np.random.RandomState(42)
x = rand.randint(100, size=10)
print(x)
[5192 14 71 60 20 82 867474]
Indexing/accessingmore values
Supposewewanttoaccessthreedifferentelements.Wecoulddoitlikethis: [x[3], x[7],
x[2]]
[71, 86,14]
Arrayof indices
Wecanpassasinglelistorarrayofindicestoobtainthesameresult. ind = [3,
7, 4]
x[ind]
array([71,86,60])
Inmulti dimensional
Fancyindexingalso works in multiple dimensions. Consider thefollowing array.
X=np.arange(12).reshape((3,4)) X
array([[0,1,2,3],
[4, 5, 6, 7],
[8, 9, 10, 11]])
Standardindexing
Likewithstandardindexing,thefirstindexreferstotherow,andthesecondtothe column. row =
np.array([0, 1, 2])
col=np.array([2,1,3])
14
Unit –IV
X[row,
col]array([2, 5,
11])
CombinedIndexing
Forevenmorepowerful operations,fancyindexingcanbecombinedwiththeotherindexingschemeswe’ve seen.
Examplearray
print(X)
[[0123]
[4567]
[89 1011]]
Combinefancyandsimpleindices
X[2,[2,0,1]]
array([10,8, 9])
Combinefancyindexingwithslicing
X[1:,[2,0, 1]]
array([[6,4,5],
[10,8,9]])
Combine fancy indexing with masking
mask=np.array([1,0,1,0],dtype=bool)
X[row[:, np.newaxis], mask]
array([[0,2],
[4,6],
[8,10]])
ModifyingValueswithFancyIndexing
Justasfancyindexingcanbeusedtoaccesspartsofanarray,it can alsobeusedtomodifypartsofanarray. Change some
value in an array
Modifyparticularelementby index
Forexample,imaginewehavean arrayofindices andwe’d liketosetthecorrespondingitemsinan arrayto some value.
x =np.arange(10)
i = np.array([2,1,8,4])
x[i] =99
print(x)
[09999399 567999]
Usingassignment operator
Wecanuseanyassignment-typeoperatorfor this.For example
x[i] -=10
print(x)
15
Unit –IV
[08989389 567899]
Usingat()
Usethe at()method of ufuncsforother behavior of modifications.
x = np.zeros(10)
np.add.at(x,i,1)
print(x)
[ 0.0.1.2.3.0.0.0.0. 0.]
Sorting Arrays
Sortingin NumPy:np.sortandnp.argsort
Python has built-in sort and sorted functions to work with lists, we won’t discuss them here because NumPy’s
np.sortfunctionturnsout tobemuchmoreefficientandusefulforourpurposes.Bydefaultnp.sortusesan
O[NlogN],quicksort algorithm,thoughmergesortand heapsortarealsoavailable.Formost applications, thedefault
quicksort is more than sufficient.
Sortingwithoutmodifyingthe input.
Toreturn asorted version ofthearraywithout modifyingtheinput,youcan usenp.sort
x=np.array([2,1,4,3,5])
np.sort(x)
array([1,2,3,4,5])
Returnssortedindices
Arelatedfunctionis argsort,whichinsteadreturnstheindices ofthesortedelements
x=np.array([2,1,4,3,5]) i =
np.argsort(x)
print(i)
[10324]
Sorting along rows orcolumns
AusefulfeatureofNumPy’ssortingalgorithmsistheabilitytosortalongspecificrowsorcolumns ofa multidimensional array
using the axis argument. For example
rand=np.random.RandomState(42) X
= rand.randint(0, 10, (4, 6))
print(X)
[[637469]
[267437]
[725417]
[514095]]
np.sort(X, axis=0)
array([[2,1,4,0,1,5],
16
Unit –IV
[5,2,5,4,3, 7],
[6,3,7,4,6, 7],
[7,6,7,4,9, 9]])
np.sort(X, axis=1)
array([[3,4,6,6,7,9],
[2,3,4,6,7, 7],
[1,2,4,5,7, 7],
[0,1,4,5,5, 9]])
PartialSorts:Partitioning
Sometimes we’re not interested in sorting the entire array, but simply want to find the K smallest values in the
array. NumPyprovides this in the np.partition function. np.partition takes an arrayand a number K; the result is a
new array with the smallest K values to the left of the partition, and the remaining values to the right, in
arbitrary order
x= np.array([7,2, 3,1,6,5,4])
np.partition(x,3)
array([2,1,3,4,6,5,7])
Note that the first three values in the resulting array are the three smallest in the array, and the remaining array
positions contain the remaining values. Within the two partitions, the elements have arbitrary order.
Partitioninginmultidimensionalarray
Similarlyto sorting, we canpartition along an arbitraryaxis of amultidimensionalarray.
np.partition(X,2, axis=1)
array([[3,4,6,7,6,9],
[2,3,4,7,6, 7],
[1,2,4,5,7, 7],
[0,1,4,5,9, 5]])
Structured Arrays
ThissectiondemonstratestheuseofNumPy’sstructuredarraysandrecordarrays,whichprovideefficient storage for
compound, heterogeneous data.
NumPydatatypes
Character DescriptionExample
'b' Bytenp.dtype('b')
'i' Signedintegernp.dtype('i4')== np.int32
'u' Unsignedintegernp.dtype('u1')== np.uint8
'f' Floatingpointnp.dtype('f8') == np.int64
'c' Complexfloatingpointnp.dtype('c16')==np.complex128
'S','a' string np.dtype('S5')
'U' Unicodestringnp.dtype('U')==np.str_
'V' Raw data (void) np.dtype('V') == np.void
17
Unit –IV
Consider if we have several categories of data on a number of people (say, name, age, and weight), and we’d
like to store these values for use in a Python program. Itwould be possible to store these in three separate arrays.
name=['Alice','Bob','Cathy','Doug']
age = [25, 45, 37, 19]
weight =[55.0,85.5,68.0, 61.5]
Creatingstructuredarray
NumPycanhandlethisthroughstructuredarrays,whicharearrayswithcompounddatatypes.createa structured array using a
compound data type specification as follows.
data=np.zeros(4,dtype={'names':('name','age','weight'),
'formats':('U10','i4','f8')})
print(data.dtype)
[('name','<U10'),('age','<i4'),('weight','<f8')]
U10 - Unicode string of maximum length 10
i4 - 4-byte (i.e., 32 bit) integer
f8-8-byte (i.e.,64bit) float
[('Alice',25,55.0)('Bob',45,85.5)('Cathy',37,68.0)('Doug',19, 61.5)]
Refervaluesthrough indexorname
Thehandythingwith structuredarrays isthatyoucan nowrefer tovalueseitherbyindex or byname.
i. data['name']#byname
array(['Alice','Bob','Cathy','Doug'],dtype='<U10')
ii. data[0]#by index
('Alice', 25,55.0)
array(['Alice','Doug'],dtype='<U10')
CreatingStructuredArrays
Dictionary method
np.dtype({'names':('name','age','weight'),
'formats':('U10', 'i4', 'f8')})
dtype([('name','<U10'),('age','<i4'),('weight','<f8')])
18
Unit –IV
NumericaltypescanbespecifiedwithPythontypes
np.dtype({'names':('name','age','weight'),
'formats':((np.str_,10),int,np.float32)})
Listoftuples
np.dtype([('name','S10'),('age','i4'),('weight','f8')])
dtype([('name','S10'),('age','<i4'),('weight','<f8')])
Specifythetypesalone
np.dtype('S10,i4,f8')
dtype([('f0','S10'),('f1','<i4'),('f2','<f8')])
DataManipulationwithPandas
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a
DataFrame. DataFrames areessentiallymultidimensional arrays withattachedrow andcolumnlabels,andoften with
heterogeneous types and/or missing data.
Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and
provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time.
HerewewillfocusonthemechanicsofusingSeries,DataFrame, andrelatedstructureseffectively.
IntroducingPandasObjects
PandasobjectscanbethoughtofasenhancedversionsofNumPystructuredarraysinwhichtherowsand columns are identified
with labels rather than simple integer indices.
Pandasprovideahostofusefultools,methods,andfunctionalityontopofthebasicdatastructures. Three
fundamental Pandas data structures: the Series, DataFrame, and Index
ThePandasSeriesObject
APandas Seriesis aone-dimensional arrayof indexeddata.Itcanbecreatedfrom alist orarrayas follows:
data=pd.Series([0.25,0.5,0.75,1.0])
data
00.25
10.50
20.75
31.00
dtype:float64
Findingvalues
Thevalues aresimplyafamiliar NumPy array
data.values
array([0.25,0.5,0.75,1.])
Findingindex
Theindexisanarray-likeobjectoftypepd.Index
data.index
19
Unit –IV
RangeIndex(start=0,stop=4,step=1)
Accessbyindex
LikewithaNumPyarray,datacanbeaccessedbytheassociatedindexviathefamiliar Python
square-bracket notation
data[1]
0.5
data[1:3]
10.50
20.75
dtype:float64
SeriesasgeneralizedNumPy array
theNumPyarrayhasanimplicitlydefinedintegerindexusedtoaccessthevalues,thePandasSerieshasan explicitly defined
index associated with the values.
This explicit index definition gives the Series object additional capabilities. For example, the index need not be
an integer, but can consist of values of any desired type.
Forexample,ifwewish,wecanusestringsasan index.
Stringsas anindex
data=pd.Series([0.25,0.5, 0.75,1.0],
index=['a','b','c','d'])
data
a 0.25
b0.50
c 0.75
d1.00
dtype:float64
Noncontiguousornon sequentialindices.
data=pd.Series([0.25,0.5, 0.75,1.0],
index=[2,5,3,7]) data
20.25
50.50
30.75
71.00
dtype:float64
Seriesasspecializeddictionary
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that
maps typed keys to a set of typed values.
just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for
certain operations, the type information of a Pandas Series makes it much more efficient than Python
dictionaries for certain operations.
20
Unit –IV
We can make the Series-as-dictionary analogy even more clear by constructing a Series object directly from
aPython dictionary.
Forexample
sub1={‘sai’:90,’ram’:85,’kasim’:92,’tamil’:89}
mark=pd.Series(sub1)
mark
sai 90
ram 85
kasim 92
tamil 89
dtype:int64
Dictionary-styleitemaccess
Mark[‘ram’]
85
Array-styleslicing
Mark[ ‘sai’:’kasim’]
sai 90
ram 85
kasim 92
ConstructingSeriesobjects
ListorNumPyarray
pd.Series([2,4,6])
02
14
26
dtype:int64
Repeated tofillthespecified index
pd.Series(5,index=[100,200,300])
100 5
200 5
300 5
dtype:int64
Datacanbeadictionary,in whichindexdefaultstothesorteddictionarykeys
pd.Series({2:'a',1:'b',3:'c'})
1b
2a
3c
dtype:object
Theindexcanbeexplicitlysetif adifferentresultispreferred
21
Unit –IV
pd.Series({2:'a',1:'b',3:'c'},index=[3,2]) 3
c
2a
dtype:object
ThePandasDataFrameObject
ThefundamentalstructureinPandasistheDataFrame.TheDataFramecanbethoughtofeitherasa generalization of a
NumPy array, or as a specialization of a Python dictionary.
DataFrameasageneralizedNumPyarray
A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible columnnames.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns,
you can think of a DataFrame as a sequence of aligned Series objects. Here, by “aligned” we meanthat they
share the same index.
Todemonstratethis,let’s firstconstructanewSerieslistingthemarksofsubject2.
sub2={'sai':91,'ram':95,'kasim':89,'tamil':90}
DS FDS
sai 90 91
ram 85 95
kasim 92 89
tamil 89 90
DataFramehasanindexattribute
LiketheSeries object,theDataFramehas anindexattributethatgives accessto theindexlabels
result.index
Index(['sai','ram','kasim','tamil'],dtype='object')
DataFramehasacolumnsattribute.
TheDataFramehasa columnsattribute,whichisan Indexobjectholdingthecolumn labels.
result.columns
Index(['DS','FDS'],dtype='object')
DataFrameasspecializeddictionary
We can also think of a DataFrame as a specialization of a dictionary. Where a dictionarymaps a key to a value,
a DataFrame maps a column name to a Series of column data.
result['DS']
sai 90
ram 85
kasim 92
22
Unit –IV
tamil 89
Name:DS,dtype:int64
Note
In a two-dimensional NumPy array, data[0] will return the first row. For a DataFrame, data['col0'] will returnthe
first column. Because of this, it is probably better to think about DataFrames as generalized dictionaries rather
than generalized arrays, though both ways of looking at the situation can be useful.
ConstructingDataFrameobjects
APandasDataFramecanbeconstructedinavarietyofways.Herewe’llgiveseveral examples.
FromasingleSeriesobject.
Froma listof dicts.
Fromadictionaryof Seriesobjects.
Fromatwo-dimensionalNumPy array.
FromaNumPystructuredarray.
FromasingleSeriesobject.
A DataFrame is a collection of Series objects, and a single column DataFrame can be constructed from a single
Series.
sub1=pd.Series({'sai':90,'ram':85,'kasim':92,'tamil':89})
pd.DataFrame(sub1,columns=['DS'])
DS
sai 90
ram 85
kasim 92
tamil 89
ab
000
112
224
Evenifsomekeysinthedictionaryaremissing,Pandaswillfill
theminwithNaN(i.e.,“notanumber”)values.
pd.DataFrame([{'a':1,'b':2},{'b':3,'c':4}]) a b
c
01.02NaN
1NaN 34.0
Fromadictionaryof Seriesobjects.
23
Unit –IV
DS FDS
sai 90 91
ram 85 95
kasim 92 89
tamil 89 90
Fromatwo-dimensionalNumPyarray.
Given atwo-dimensional arrayofdata,we can createa DataFramewith anyspecified column and index names. If
omitted, an integer index will be used for each.
pd.DataFrame(np.random.rand(3,2),
columns=['food', 'water'],
index=['a','b','c'])
food water
a 0.8652570.213169
b 0.442759 0.108267
c0.0471100.905718
FromaNumPystructuredarray.
APandasDataFrameoperatesmuchlikeastructuredarray,andcanbe created directly.
A=np.zeros(3,dtype=[('A','i8'),('B','f8')]) A
array([(0,0.0),(0,0.0),(0,0.0)],
dtype=[('A','<i8'),('B','<f8')])
pd.DataFrame(A)
AB
000.0
100.0
200.0
ThePandasIndexObject
We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference
and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an
immutable array or as an ordered set.
ind=pd.Index([2,3,5,7,11]) ind
Int64Index([2,3,5,7,11],dtype='int64')
Indexasimmutablearray
TheIndexobjectinmanywaysoperateslikeanarray.Forexample,wecanusestandard Python
indexing notation to retrieve values or slices.
24
Unit –IV
ind[1]
3
ind[::2]
Int64Index([2,5,11], dtype='int64')
Indexasorderedset
Pandasobjectsaredesignedtofacilitateoperationssuchasjoinsacrossdatasets,whichdependonmany aspects of set
arithmetic.
TheIndexobjectfollowsmanyoftheconventionsusedbyPython’sbuilt-insetdatastructure,sothat unions, intersections,
differences, and other combinations can be computed in a familiar way.
indA=pd.Index([1, 3,5,7,9])
indB=pd.Index([2,3,5,7,11]) indA
&indB # intersection
Int64Index([3,5,7],dtype='int64')
indA^indB#symmetric difference
Int64Index([1,2,9,11],dtype='int64')
DataIndexingandSelection
DataSelectionin Series
A Series object acts in many ways like a one dimensional NumPy array, and in many ways like a
standardPython dictionary. It will help us to understand the patterns of data indexing and selection in these
arrays.
Seriesas dictionary
Seriesasone-dimensionalarray
Indexers:loc,iloc,andix
Seriesasdictionary
Likeadictionary,theSeriesobjectprovidesamappingfromacollectionofkeystoacollectionof values.
data=pd.Series([0.25,0.5, 0.75,1.0],
index=['a','b','c','d'])
data
a 0.25
b0.50
c 0.75
d1.00
dtype:float64
data['b']
25
Unit –IV
0.5
Examinethekeys/indicesandvalues
Wecanalsousedictionary-likePythonexpressions andmethodstoexaminethekeys/indicesand values
i. 'a'indata
True
ii. data.keys()
Index(['a','b','c','d'],dtype='object')
iii. list(data.items())
[('a',0.25),('b',0.5), ('c',0.75),('d', 1.0)]
Modifyingseries object
Seriesobjectscanevenbemodifiedwithadictionary-likesyntax.Justasyoucanextendadictionaryby assigning to a new key,
you can extend a Series by assigning to a new index value.
data['e']=1.25
data
a 0.25
b0.50
c 0.75
d1.00
e1.25
dtype:float64
Seriesasone-dimensionalarray
A Series builds on this dictionary-like interface and provides array-style item selection via the same
basicmechanisms as NumPy arrays—that is, slices, masking, and fancy indexing.
Slicingbyexplicitindex
data['a':'c']
a 0.25
b0.50
c 0.75
dtype:float64
Slicingbyimplicitintegerindex
data[0:2]
a 0.25
b0.50
dtype:float64
Masking
data[(data>0.3)&(data<0.8)]
b0.50
c 0.75
26
Unit –IV
dtype:float64
Fancyindexing
data[['a','e']]
a 0.25
e1.25
dtype:float64
Indexers:loc,iloc,andix
Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are
not functional methods, but attributes that expose a particular slicing interface to the data in the Series.
data=pd.Series(['a','b','c'],index=[1,3,5]) data
1a
3b
5c
dtype:object
data.loc[1:3]
1a
3b
dtype:object
iloc-Theilocattributeallows indexingandslicingthatalwaysreferences theimplicitPython-style index.
data.iloc[1]
'b'
data.iloc[1:3]
3b
5c
dtype:object
DataSelectioninDataFrame
DataFrameasadictionary
DataFrameastwo-dimensionalarray
Additionalindexing conventions
DataFrameasadictionary
Thefirst analogywe will consider is theDataFrameas adictionaryof related Seriesobjects.
27
Unit –IV
The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing
of the column name.
DS
sai 90
ram 85
kasim 92
tamil 89
Attribute-styleaccesswithcolumnnamesthatare strings
result.DS
DS
sai 90
ram 85
kasim 92
tamil 89
True
Modifytheobject
LikewiththeSeriesobjectsthisdictionary-stylesyntaxcan alsobeusedto modifytheobject,inthiscaseto add a new
column:
result[‘TOTAL’]=result[‘DS’]+result[‘FDS’]
result
DS FDS TOTAL
sai 90 91 181
ram 85 95 180
kasim 92 89 181
tamil 89 90 179
DataFrameastwo-dimensionalarray
Transpose
Wecantransposethe fullDataFrametoswaprowsandcolumns.
result.T
28
Unit –IV
Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the
underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame
index and column labels are maintained in the result
loc
result.loc[:‘ram’,:‘FDS’]
DS FDS
sai 90 91
ram 85 95
iloc
result.iloc[:2,:2]
DS FDS
sai 90 91
ram 85 95
ix
result.ix[:2,:’FDS’]
DS FDS
sai 90 91
ram 85 95
MaskingandFancyindexing
Inthe locindexer we cancombinemaskingand fancyindexingasin thefollowing:
result.loc[result.total>180,[‘DS’,‘FDS’]]
DS FDS
sai 90 91
kasim92 89
Modifyingvalues
Indexing conventions may also be used to set or modifyvalues; this is done in the standard waythat
you might be accustomed to from working with NumPy.
result.iloc[1,1]=70
DS FDS TOTAL
sai 90 91 181
ram 85 70 180
kasim 92 89 181
tamil 89 90 179
Additionalindexingconventions
Slicing row wise
result['sai':'kasim']
29
Unit –IV
DS FDS TOTAL
sai 90 91 181
ram 85 70 180
kasim 92 89 181
DS FDS TOTAL
ram 85 70 180
kasim92 89 181
Maskingrowwise
result[result.total>180]
DS FDS TOTAL
sai 90 91 181
kasim92 89 181
OperatingonDatainPandas
Pandas inherits much of this functionality from NumPy, and the ufuncs.So Pandas having the ability to perform
quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with
more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).
For unary operations like negation and trigonometric functions, these ufuncs will preserve index and column
labels in the output.
For binaryoperations such as addition and multiplication, Pandas will automatically align indices when passing
the objects to the ufunc.
Hereweare goingto seehow theuniversalfunctions areworkinginseries andDataFrames by
Indexpreservation
Indexalignment
Index Preservation
Pandas is designed to work with NumPy, anyNumPyufuncwill work on Pandas Series and DataFrame objects.
We can use all arithmetic and special universal functions as in NumPy on pandas. In outputs the index will
preserved (maintained) as shown below.
Forseries
x=pd.Series([1,2,3,4])
x
01
12
23
34
dtype:int64
For DataFrame
df=pd.DataFrame(np.random.randint(0,10,(3,4)),
columns=['a','b','c','d'])
30
Unit –IV
df
a b c d
0 1 4 1 4
1 8 4 0 4
2 7 7 7 2
Foruniversalfunction.(hereweuseexponentasexample)
Ufuncsforseries
np.exp(ser)
08103.083928
154.598150
2403.428793
320.085537
dtype:float64
UfuncsforData Frame
np.exp(df)
a b c d
Index Alignment
Pandaswillalignindicesintheprocessofperformingtheoperation.Thisisveryconvenientwhenyouare working with
incomplete data, as we’ll.
Indexalignmentin Series
supposewearecombiningtwodifferentdatasources, thentheindex will alignedaccordingly.
x=pd.Series([2,4,6],index=[1,3,5])
y=pd.Series([1,3,5,7],index=[1,2,3,4])
x+y
13.0
2 NaN
3 9.0
4 NaN
5 NaN
dtype:float64
31
Unit –IV
The resulting array contains the union of indices of the two input arrays, which we could determine using
standard Python set arithmetic on these indices.
Any item for which one or the other does not have an entry is marked with NaN, or “Not a Number,” which is
how Pandas marks as missing data.
x.add(y,fill_value=0)
13.0
23.0
39.0
47.0
56.0
dtype:float64
Indexalignmentin DataFrame
A similar type of alignment takes place for both columns and indices when you are performing operations on
DataFrames.
A=pd.DataFrame(rng.randint(0,20,(2,2)),columns=list('AB')) A
AB
0111
151
B=pd.DataFrame(rng.randint(0,10,(3,3)),
columns=list('BAC'))
B
BAC
0409
1580
2926
A+B
A B C
01.015.0NaN
113.0 6.0NaN
2NaNNaNNaN
Notice that indices are aligned correctly irrespective of their order in the two objects,and indices in the result are
sorted. As was the case with Series, we can use the associated object’s arithmetic method and pass any desired
fill_value to be used in place of missing entries. Here we’ll fill with the mean of all values in A.
32
Unit –IV
fill = A.stack().mean()
A.add(B,fill_value=fill)
AB C
0 1.015.013.5
113.06.04.5
2 6.513.510.5
MappingbetweenPythonoperatorsandPandasmethods.
Python operator Pandas method(s)
+ add()
- sub(), subtract()
* mul(), multiply()
/ truediv(),div(), divide()
// floordiv()
% mod()
** pow()
OperationsbetweenDataFrameandSeries
When you are performing operations between a DataFrame and a Series, the index and column alignment is
similarly maintained. Operations between a DataFrame and a Series are similar to operations between a two-
dimensional and one-dimensional NumPy array.
A=rng.randint(10,size=(3,4)) A
array([[3,8, 2,4],
[2, 6,4, 8],
[6,1,3, 8]])
A-A[0]
array([[ 0,0,0,0],
[-1,-2,2,4],
[3,-7,1,4]])
HandlingMissingData
A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame.
Generally, they revolve around one of two strategies: using a mask that globally indicates missing values, or
choosing a sentinel value that indicates a missing entry.
In the masking approach, the mask might be an entirelyseparate Boolean array, or it mayinvolve appropriation
of one bit in the data representation to locally indicate the null status of a value.
MissingDatainPandas
33
Unit –IV
The way in which Pandas handles missing values is constrained by its NumPy package, which does not have a
built-in notion of NA values for non floating- point data types.
NumPy supports fourteen basic integer types once you account for available precisions, signedness, and
endianness of the encoding. Reserving a specific bit pattern in all available NumPy types would lead to an
unwieldyamountofoverheadinspecial-casingvariousoperationsforvarious types,likelyeven requiringanew fork of
the NumPy package.
Pandas choseto usesentinels formissingdata,and furtherchoseto usetwo already-existingPython null values: the
special floatingpoint NaN value, and the Python None object. This choice has some side effects, as we will see,
but in practice ends up being a good compromise in most cases of interest.
None:Pythonicmissingdata
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in
Python code. BecauseNoneis aPython object, it cannot be used in anyarbitraryNumPy/Pandas array, but only in
arrays with data type 'object' (i.e., arrays of Python objects)
This dtype=object means that the best common type representation NumPy could infer for the contents of the
array is that they are Python objects.
NaN:Missingnumericaldata
NaN is a special floating-point value recognized by all systems that use the standard IEEE floating-point
representation.
vals2=np.array([1,np.nan,3,4])
vals2.dtype
dtype('float64')
You should be aware that NaN is a bit like a data virus—it infects anyother object it touches. Regardless of the
operation, the result of arithmetic with NaN will be another NaN
1+np.nan
nan
0*np.nan
Nan
NaNandNoneinPandas
NaNand Nonebothhave theirplace,andPandas is builtto handlethe twoofthemnearlyinterchangeably.
pd.Series([1,np.nan,2,None])
01.0
1NaN
22.0
3NaN
dtype:float64
For types that don’t have an available sentinel value, Pandas automatically type-casts when NA values are
present. Forexample,ifweset avalueinanintegerarrayto np.nan,itwill automaticallybeupcast toafloating- point
type to accommodate the NA
34
Unit –IV
x=pd.Series(range(2),dtype=int) x
00
11
dtype:int64
x[0] = None
x
0NaN
11.0
dtype:float64
Noticethat in addition tocastingtheintegerarrayto floatingpoint, Pandas automatically converts theNoneto a NaN
value.
PandashandlingofNAsbytype
Typeclass Conversionwhenstoring NAs NAsentinelvalue
floating No change np.nan
object No change Noneor np.nan
integer Castto float64 np.nan
boolean Castto object Noneor np.nan
Note:InPandas,stringdataisalwaysstored withanobjectdtype.
OperatingonNull Values
thereareseveralusefulmethodsfordetecting,removing,andreplacingnullvaluesinPandasdatastructures. They are:
isnull()-Generate aBooleanmask indicatingmissingvalues
notnull()-Oppositeof isnull()
dropna()-Returna filteredversion ofthe data
fillna()-Returnacopyof thedata with missingvalues filled orimputed
Detectingnullvalues
Pandasdatastructureshavetwousefulmethodsfordetectingnulldata:isnull()and notnull(). isnull()
data=pd.Series([1,np.nan,'hello',None])
data.isnull()
0False
1True
2False
3 True
dtype:bool
notnull()
data.notnull()
0True
1False
2True
35
Unit –IV
3 False
dtype:bool
Droppingnullvalues
dropna()
data.dropna()
01
2 hello
dtype:object
Droppingnullvaluesindataframe
df=pd.DataFrame([[1,np.nan,2], [2,
3, 5],
[np.nan,4,6]])
Df
012
01.0NaN2
12.03.05
2NaN 4.06
df.dropna()
012
12.03.05
Dropvaluesin columnorrow
Wecan dropNAvalues along adifferent axis; axis=1drops allcolumnscontaininganull value.
df.dropna(axis='columns')
02
15
26
Rowsorcolumnshavingallnullvalues
Youcan also specifyhow='all', whichwill onlydrop rows/columns thatareall null values.
df[3]=np.nan
df
0123
01.0NaN2NaN
12.03.05NaN
2 NaN 4.06NaN
36
Unit –IV
df.dropna(axis='columns',how='all')
012
01.0NaN2
12.03.05
2NaN 4.06
Specificnoof nullvalues(thresh)
thethreshparameter letsyou specifyaminimum numberof non-nullvalues fortherow/column tobe kept
df.dropna(axis='rows',thresh=3)
0123
12.03.05NaN
Fillingnullvalues
Sometimes rather than droppingNA values, you’d rather replace them with a valid value. This valuemight be a
singlenumberlikezero,oritmightbesomesort of imputationorinterpolationfromthegoodvalues.Youcould do this
in-placeusingthe isnull()method as amask,but becauseit is such acommon operation Pandas provides the fillna()
method, which returns a copy of the array with the null values replaced.
data=pd.Series([1,np.nan,2,None,3],index=list('abcde')) data
a 1.0
bNaN
c 2.0
dNaN
Fillwithsinglevalue
Wecanfill NAentrieswith asinglevalue, suchaszero
data.fillna(0)
a 1.0
b0.0
c 2.0
d0.0
e3.0
dtype:float64
Fillwithprevious value
Wecan specifyaforward-fill topropagate theprevious value forward
data.fillna(method='ffill')
a 1.0
b1.0
c 2.0
d2.0
e3.0
dtype:float64
37
Unit –IV
Fillwithnextvalue
Wecan specifyaback-fillto propagate thenext valuesbackward.
data.fillna(method='bfill')
a 1.0
b2.0
c 2.0
d3.0
e3.0
dtype:float64
HierarchicalIndexing
Up to this point we’ve been focused primarily on one-dimensional and twodimensional data, stored in Pandas
Series and DataFrame objects, respectively. Often it is useful to go beyond this and store higher-dimensional
data—that is, data indexed by more than one or two keys.
Pandas does provide Panel and Panel4D objects that natively handle three-dimensional and four-dimensional,
a far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing)to
incorporate multiple index levels within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series
and two-dimensional DataFrame objects.
Here we’ll explore the direct creation of MultiIndex objects; considerations around indexing, slicing, and
computing statistics across multiply indexed data; and useful routines for converting between simple and
hierarchically indexed representations of your data.
AMultiplyIndexedSeries
PandasMultiIndex
Pandasprovides abetterway.Ourtuple-basedindexingisessentiallya rudimentarymulti-index,andthePandas
MultiIndex type givesus thetypeofoperations wewish to have. Wecan create amulti-index from thetuples as
follows
index=[('California',2000),('California',2010),
('NewYork', 2000),('NewYork', 2010),
('Texas',2000),('Texas',2010)]
populations=[33871648,37253956,
18976457,19378102,
20851820,25145561]
pop=pd.Series(populations,index=index) pop
(California,2000)33871648
(California,2010)37253956
(NewYork,2000) 18976457
(NewYork,2010) 19378102
(Texas,2000) 20851820
38
Unit –IV
#creatingmulti index
index=pd.MultiIndex.from_tuples(index)
index
MultiIndex(levels=[['California','NewYork','Texas'],[2000,2010]],
labels=[[0,0, 1,1,2,2],[0,1,0,1, 0,1]])
Hierarchicalrepresentationof thedata
pop=pop.reindex(index)
pop
California200033871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype:int64
Here the first two columns of the Series representation show the multiple index values, while the third column
shows the data.
Accessalldatawithsecondindex
pop[:,2010]
California37253956
New York19378102
Texas25145561
dtype: int64
MultiIndexasextradimension
wecouldeasilyhavestoredthesamedatausingasimpleDataFramewithindexandcolumnlabels.The
unstack()method willquicklyconvert amultiplyindexedSeries intoaconventionallyindexed DataFrame.
pop_df=pop.unstack()
pop_df
20002010
California33871648 37253956
New York 1897645719378102
Texas 2085182025145561
California200033871648
2010 37253956
NewYork200018976457
2010 19378102
Texas2000 20851820
39
Unit –IV
2010 25145561
dtype:int64
Addanewcolumnin multidimensionaldataframe.
pop_df=pd.DataFrame({'total':pop,
'under18':[9267089,9284094,
4687374,4318033,
5906301,6879014]})
pop_df
total under18
California 2000 33871648 9267089
2010372539569284094
New York 2000189764574687374
2010193781024318033
Texas 2000208518205906301
2010251455616879014
Universalfunctions
Alltheufuncsandotherfunctionalityworkwithhierarchicalindices.
f_u18=pop_df['under18']/pop_df['total']
f_u18.unstack()
2000 2010
California 0.273594 0.249211
New York 0.247010 0.222831
Texas 0.283251 0.273568
MethodsofMultiIndexCreation
To construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the
constructor.
df=pd.DataFrame(np.random.rand(4,2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1','data2'])
df
data1 data2
a 10.5542330.356072
20.925244 0.219474
b 10.4417590.610054
20.171495 0.886688
ifyoupassadictionarywithappropriatetuplesaskeys,Pandaswillautomaticallyrecognizethisandusea MultiIndex by
default.
data={ ('California',2000):33871648,
('California',2010):37253956,
('Texas',2000):20851820,
('Texas',2010):25145561,
('NewYork',2000):18976457,
40
Unit –IV
('NewYork',2010):19378102}
pd.Series(data)
ExplicitMultiIndex constructors
YoucanconstructtheMultiIndexfromasimplelist ofarrays,givingthe indexvalueswithineach level.
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
MultiIndex(levels=[['a','b'],[1,2]],
labels=[[0,0, 1,1],[0, 1,0, 1]])
Multiindexfromalist oftuples,
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
MultiIndex(levels=[['a','b'],[1,2]],
labels=[[0,0, 1,1],[0, 1,0, 1]])
MultiIndex(levels=[['a','b'],[1,2]],
labels=[[0,0, 1,1],[0, 1,0,1]])
MultiIndexlevelnames
It is convenient to name the levels of the MultiIndex. You can accomplishthis by passing the names argument
to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact.
pop.index.names=['state','year']
pop
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype:int64
MultiIndex forcolumns
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels
of indices, the columns can have multiple levels as well.
41
Unit –IV
#hierarchicalindicesandcolumns
index=pd.MultiIndex.from_product([[2013,2014],[1,2]],
names=['year', 'visit'])
columns=pd.MultiIndex.from_product([['Bob','Guido','Sue'],['HR','Temp']],
names=['subject', 'type'])
#mocksome data
data=np.round(np.random.randn(4,6),1)
data[:, ::2]*=10
data+= 37
#createthe DataFrame
health_data=pd.DataFrame(data,index=index,columns=columns)
health_data
Multiplyindexed Series
Pop
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype:int64
Accesssingleelements
Wecan access single elementsbyindexingwithmultiple terms
pop['California',2000]
33871648
Partial indexing
TheMultiIndexalsosupports partialindexing,or indexingjustoneofthelevelsinthe index
pop['California']
42
Unit –IV
year
2000 33871648
2010 37253956
dtype:int64
Partial slicing
Partialslicingisavailableaswell,aslongasthe MultiIndexis sorted.
pop.loc['California':'NewYork']
state year
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
dtype:int64
Sorted indices
Withsortedindices,wecanperformpartialindexingonlowerlevelsbypassing anempty sliceinthe first index
pop[:,2000]
state
California 33871648
New York 18976457
Texas 20851820
dtype:int64
Othertypesofindexingandselection
Selection based on Boolean masks
pop[pop>22000000]
state year
California 200033871648
2010 37253956
Texas 2010 25145561
dtype:int64
Selectionbasedon fancyindexing
pop[['California','Texas']]
state year
California 2000 33871648
2010 37253956
Texas 2000 20851820
2010 25145561
dtype:int64
RearrangingMulti-Indices
43
Unit –IV
index=pd.MultiIndex.from_product([['a','c','b'],[1,2]]) data =
pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data
char int
a 1 0.003001
2 0.164974
c 1 0.741650
data=data.sort_index()
data
char int
a 1 0.003001
2 0.164974
b 1 0.001693
2 0.526226
c 1 0.741650
2 0.569264
dtype:float64
Withtheindexsortedinthisway,partialslicingwillworkas expected:
data['a':'b'] char int
a 1 0.003001
2 0.164974
dtype:float64
b 1 0.001693
2 0.526226
Stackingandunstacking indices
itispossibletoconvertadatasetfromastackedmulti-indextoasimpletwo-dimensionalrepresentation, optionally specifying
the level to use.
pop.unstack(level=0)
stateCaliforniaNewYorkTexas
year
20003387164818976457 20851820
20103725395619378102 25145561
44
Unit –IV
pop.unstack(level=1)
stateyear
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype:int64
Indexsettingandresetting
Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished
with the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state
and year column holding the information that was formerly in the index. For clarity, we can optionally specify
the name of the data for the column representation.
pop_flat=pop.reset_index(name='population')
pop_flat
state yearpopulation
0California 2000 33871648
1California 2010 37253956
2NewYork 2000 18976457
3NewYork 2010 19378102
4Texas 2000 20851820
5Texas 2010 25145561
DataAggregationsonMulti-Indices
We’ve previously seen that Pandas has built-in data aggregation methods, such as mean(), sum(), and max().For
hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the
aggregate is computed on.
Forexample,let’sreturntoourhealthdata:(youcancreateyourowndataframe/series)
health_data
CombiningDatasets
ConcatandAppend
SimpleConcatenationwithpd.concat
Pandashasafunction,pd.concat(),whichhasasimilarsyntaxtonp.concatenatebutcontainsanumberof options that
we’ll discuss momentarily
pd.concat()canbeusedforasimpleconcatenationofSeriesorDataFrameobjects,justasnp.concatenate()can be used
for simple concatenations of arrays
ser1=pd.Series(['A','B','C'],index=[1,2,3])
ser2=pd.Series(['D','E','F'],index=[4,5,6])
pd.concat([ser1, ser2])
1A
2B
3C
4D
5E
6F
dtype:object
Concatenationin dataframe.
df1=make_df('AB',[1,2])
df2=make_df('AB',[3,4])
print(df1);print(df2);print(pd.concat([df1,df2]))
46
Unit –IV
AB CD ABCD
0 A0B00C0 D00A0B0C0D0
1 A1B11C1 D11A1B1C1D1
Duplicate indices
Oneimportantdifferencebetweennp.concatenateandpd.concatisthatPandasconcatenationpreservesindices, even if
the result will have duplicate indices! Consider this simple example.
x=make_df('AB',[0,1])
y= make_df('AB',[2,3])
y.index=x.index#makeduplicate indices!
print(x);print(y);print(pd.concat([x,y]))
x y pd.concat([x,y])
AB AB AB
0 A0B00A2B20 A0B0
1 A1B11A3B31 A1B1
0 A2B2
1 A3B3
Theappend() method
SeriesandDataFrameobjectshaveanappendmethodthatcanaccomplishthesamethinginfewerkeystrokes. For
example, rather than calling pd.concat([df1, df2]), you can simply call df1.append(df2):
print(df1); print(df2);
print(df1.append(df2)) df1 df2
df1.append(df2)
AB AB AB
1 A1B13A3 B31A1B1
2 A2B24A4 B42A2B2
3 A3B3
4 A4B4
MergeandJoin
Oneessential featureoffered byPandasis itshigh-performance, in-memoryjoinand mergeoperations.
CategoriesofJoins
One-to-onejoins
Many-to-onejoins
Many-to-manyjoins
df2=pd.DataFrame({'employee':['Lisa','Bob','Jake','Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
47
Unit –IV
print(df1);print(df2)
df1 df2
employee group employee hire_date
0 Bob Accounting 0Lisa2004
1 Jake Engineering 1Bob 2008
2 Lisa Engineering 2Jake 2012
3 Sue HR 3Sue2014
Tocombinethisinformation intoasingle DataFrame,wecan usethepd.merge() function
df3=pd.merge(df1,df2) df3
Many-to-onejoins
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-
one case, the resulting DataFrame will preserve those duplicate entries as appropriate.
df4=pd.DataFrame({'group':['Accounting','Engineering','HR'],
'supervisor':['Carly','Guido','Steve']})
pd.merge(df3,df4)
Many-to-manyjoins
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the key column in
both the left and right array contains duplicates, then the result is a many-to-many merge. This will be perhaps
most clear with a concrete example.
df5=pd.DataFrame({'group':['Accounting','Accounting','Engineering','Engineering','HR','HR'],'skills': ['math',
'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']})
pd.merge(df1,df5)
48
Unit –IV
AggregationandGrouping
Computingaggregations likesum(),mean(),median(),min(),andmax(),inwhichasinglenumber givesinsight into the
nature of a potentially large dataset.
SimpleAggregationin Pandas
AswithaonedimensionalNumPyarray,foraPandasSeries theaggregatesreturn asingle value.
rng=np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
00.374540
10.950714
20.731994
30.598658
40.156019
dtype:float64
Su
m ser.sum()
2.8119254917081569
Mea
n ser.mean()
0.56238509834163142
ThesameoperationsalsoperformedinDataFrame
ListingofPandasaggregationmethods
Aggregation Description
count() Totalnumberofitems
first(), last() Firstandlastitem
mean(),median() Meanandmedian
min(),max() Minimumand maximum
std(),var() Standarddeviationandvariance
mad() Meanabsolutedeviation
prod() Productofallitems
sum() Sumofall items
GroupBy:Split,Apply,Combine
Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally
on some label or index: this is implemented in the socalled groupby operation. The name “group by” comes
from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms
first coined by Hadley Wickham of Rstats fame: split, apply, combine.
Example
df=pd.DataFrame({'key':['A','B','C','A','B','C'],
'data':range(6)},columns=['key','data'])
Df
keydata
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
TheGroupByobject
TheGroupByobject is a veryflexible abstraction. The most important operations madeavailable by a GroupBy
are aggregate, filter, transform, and apply.
Columnindexing.
TheGroupByobjectsupportscolumnindexinginthesamewayas theDataFrame,andreturnsamodified GroupBy object. For
example
df=pd.read_csv('D:\iris.csv')
df.groupby('variety')
<pandas.core.groupby.generic.DataFrameGroupByobjectat
0x0000023BAADE84C0>
df.groupby('variety)['petal.length'']
<pandas.core.groupby.generic.SeriesGroupByobjectat0x0000023BAADE8490>
50
Unit –IV
df.groupby('variety')[“petal.length''].sum()
varietySetosa73.1
Versicolor 213.0
Virginica 277.6
Name:petal.length,dtype:float64
Iterationovergroups.
TheGroupByobjectsupportsdirectiterationoverthegroups,returningeachgroupasaSeriesor DataFrame.
This can be useful for doing certain things manually, though it is often much faster to use the built-in apply functionality,
which we will discuss momentarily.
Dispatchmethods.
Through some Python class magic, any method not explicitly implemented by the GroupBy object will be passed through
and called on the groups, whether they are DataFrame or Series objects. For example, you can use the describe() methodof
DataFrames to perform a set of aggregations that describe each group in the data.
Example
df.groupby('variety')['petal.length'].describe().unstack()
variety
count Setosa 50.000000
Versicolor 50.000000
Virginica 50.000000
mean Setosa 1.462000
Versicolor 4.260000
Virginica 5.552000
std Setosa 0.173664
Versicolor 0.469911
Virginica 0.551895
min Setosa 1.000000
Versicolor 3.000000
Virginica 4.500000
25% Setosa 1.400000
Versicolor 4.000000
Virginica 5.100000
50% Setosa 1.500000
Versicolor 4.350000
Virginica 5.550000
75% Setosa 1.575000
Versicolor 4.600000
Virginica 5.875000
max Setosa 1.900000
Versicolor 5.100000
Virginica 6.900000
dtype: float64
Aggregate,filter,transform,and apply
rng=np.random.RandomState(0)
df=pd.DataFrame({'key':['A','B','C','A','B','C'],
51
Unit –IV
'data1':range(6),
'data2':rng.randint(0,10,6)},
columns=['key','data1','data2']) df
keydata1data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
Aggregation.
We’re now familiar with GroupBy aggregations with sum(), median(), and the like, but the aggregate() method
allows for even more flexibility. It can take a string, a function, or a list thereof, and compute all the aggregates
at once. Here is a quick example combining all these:
df.groupby('key').aggregate(['min',np.median,max])
data1 data2
minmedianmaxminmedianmax
key
A 0 1.5 3 3 4.0 5
B 1 2.5 4 0 3.5 7
C 2 3.5 5 3 6.0 9
Filtering.
Afilteringoperationallowsyoutodropdatabasedonthegroupproperties.Forexample,wemightwantto keep all groups
in which the standard deviation is larger than some critical value.
Thefilter() functionshouldreturn aBooleanvalue specifyingwhether thegroup passesthefiltering.
Transformation.
While aggregation must return a reduced version of the data, transformation can return some transformed
version of the full data to recombine. For such a transformation, the output is the same shape as the input. A
common example is to center the data by subtracting the group-wise mean:
df.groupby('key').transform(lambdax:x -x.mean())
data1 data2
0 -1.5 1.0
1 -1.5 -3.5
2 -1.5 -3.0
3 1.5 -1.0
4 1.5 3.5
5 1.5 3.0
52
Unit –IV
Theapply()method.
The apply() method lets you apply an arbitrary function to the group results. The function should take a
DataFrame, and return either a Pandas object (e.g., DataFrame, Series) or a scalar; the combine operation willbe
tailored to the type of output returned.
PivotTables
A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on
tabular data. The pivot table takes simple column wise data as input, and groups the entries into a two-
dimensional table that provides a multidimensional summarization of the data. The difference between pivot
tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a
multidimensional version of GroupBy aggregation. That is, you split apply- combine, but both the split and the
combine happen across not a one-dimensional index, but across a two-dimensional grid.
PivotTableCreation
importnumpyasnp
importpandasaspd
df=pd.read_csv('D:\diabetes.csv')
df.pivot_table('preg',index='age',columns='Class').sample(10)
age
63 5.500000 NaN
28 3.440000 2.000000
61 7.000000 4.000000
69 5.000000 NaN
45 7.285714 7.375000
62 6.500000 1.000000
53 2.000000 6.250000
68 8.000000 NaN
23 1.516129 1.857143
53
Unit –IV
age
52 13.000000 3.428571
54
Unit – V
UNITV
DATAVISUALIZATION
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots – Histograms –
legends – colors – subplots – text and annotation – customization – three dimensional plotting - Geographic Data
with Basemap - Visualization with Seaborn.
SimpleLinePlots
The simplest of all plots is the visualization of a single function y = f x . Here we will take a first look at creating a
simple plot of this type.
The figure (an instance of the class plt.Figure) can be thought of as a single container that contains all the
objectsrepresenting axes, graphics, text, and labels.
The axes (an instance of the class plt.Axes) is what we see above: a bounding box with ticks and labels, which will
eventually contain the plot elements that make up our visualization.
Differentformsofcolorrepresentation.
specifycolorbyname -color='blue'
shortcolorcode (rgbcmyk) -color='g'
Grayscalebetween0and1 -color='0.75'
Hex code (RRGGBB from 00 to FF)- color='#FFDD44'
RGB tuple, values 0 and 1 -color=(1.0,0.2,0.3)
all HTML color names supported - color='chartreuse'
Short assignment
linestyle='-'#
solidlinestyle='--'#d
ashedlinestyle='-.'#
dashdotlinestyle=':'
# dotted
AxesLimits
1
Unit – V
Themostbasicwaytoadjustaxislimitsistousetheplt.xlim()andplt.ylim()methods
Example
plt.xlim(10,0)
plt.ylim(1.2,-1.2);
The plt.axis() method allows you to set the x and y limits with a single call, by passing a list that
specifies[xmin, xmax, ymin, ymax]
plt.axis([-1,11,-1.5,1.5]);
LabelingPlots
Thelabelingofplotsincludestitles,axislabels,andsimplelegends. Title
-plt.title()
Label- plt.xlabel()
plt.ylabel()
Legend-plt.legend()
Exampleprograms
Line color
importmatplotlib.pyplotasplt
import numpy as np
fig=plt.figure()
ax = plt.axes()
x=np.linspace(0,10,1000)
ax.plot(x, np.sin(x));
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)
plt.plot(x,np.sin(x-2),color='0.75')#Grayscalebetween0and1
plt.plot(x,np.sin(x-3),color='#FFDD44')#Hexcode(RRGGBBfrom00toFF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse');# all HTML color names supported
Line style
importmatplotlib.pyplotasplt
2
Unit – V
importnumpyasnp fig
= plt.figure()
ax=plt.axes()
x = np.linspace(0, 10, 1000)
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x,x+2,linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');
#Forshort,youcanusethefollowingcodes:
plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted
importmatplotlib.pyplotasplt
import numpy as np
fig=plt.figure()
ax = plt.axes()
x =np.linspace(0,10,1000)
plt.xlim(-1,11)
plt.ylim(-1.5, 1.5);
plt.plot(x,np.sin(x),'-g', label='sin(x)')
plt.plot(x,np.cos(x),':b',label='cos(x)')
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");
plt.legend();
SimpleScatterPlots
3
Unit – V
Another commonly used plot type is the simple scatter plot, a close cousin of the line plot. Instead of points being
joined by line segments, here the points are represented individually with a dot, circle, or other shape.
Syntax
plt.plot(x,y,'typeofsymbol',color);
Example
plt.plot(x,y,'o',color='black');
Thethirdargumentinthefunctioncallisacharacterthatrepresentsthetypeofsymbolusedfortheplotting. Just as you
can specifyoptions such as '-' and '--' to control the line style, the marker style has its own set of short string
codes.
Example
Varioussymbols usedtospecify['o','.',',','x','+','v','^','<','>','s','d']
plt.plot(x,y,'-ok');
Additionalargumentsinplt.plot()
Wecanspecifysomeotherparametersrelatedwithscatterplotwhichmakesitmoreattractive.Theyare color, marker
size, linewidth, marker face color, marker edge color, marker edge width, etc
Example
plt.plot(x,y,'-
p',color='gray',
markersize=15,
linewidth=4,
markerfacecolor='whit
e',
markeredgecolor='gray
', markeredgewidth=2)
plt.ylim(-1.2,1.2);
ScatterPlotswithplt.scatter
A second, more powerful method of creating scatter plots is the plt.scatter function, which can be used very
similarly to the plt.plot function
plt.scatter(x,y,marker='o');
The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots where the
properties of each individual point (size, face color, edge color, etc.) can be individually controlled or
mapped to data.
Notice that the color argument is automatically mapped to a color scale (shown here by the colorbar()
command), and the size argument is given in pixels.
Cmap–colormap usedinscatterplotgivesdifferentcolorcombinations.
PerceptuallyUniformSequential
['viridis','plasma','inferno','magma']
Sequential
['Greys','Purples','Blues','Greens','Oranges','Reds','YlOrBr','YlOrRd',
'OrRd','PuRd','RdPu','BuPu','GnBu','PuBu','YlGnBu','PuBuGn','BuGn', 'YlGn']
Sequential(2)
['binary','gist_yarg','gist_gray','gray','bone','pink','spring','summer',
'autumn','winter','cool','Wistia','hot','afmhot','gist_heat','copper']
Unit – V
4
Unit – V
Diverging
['PiYG','PRGn','BrBG','PuOr','RdGy','RdBu','RdYlBu','RdYlGn','Spectral', 'coolwarm
Qualitative
['Pastel1','Pastel2','Paired','Accent','Dark2','Set1','Set2','Set3', 'tab10', 'tab2
Miscellaneous
['flag','prism','ocean','gist_earth','terrain','gist_stern','gnuplot',
'gnuplot2','CMRmap','cubehelix','brg','hsv','gist_rainbow','rainbow', 'jet', 'nipy_
Exampleprograms.
Simplescatterplot.
import numpyasnp
importmatplotlib.pyplotasplt x
= np.linspace(0, 10, 30)
y=np.sin(x)
plt.plot(x,y,'o',color='black');
import numpyasnp
importmatplotlib.pyplotasplt x
= np.linspace(0, 10, 20)
y=np.sin(x)
plt.plot(x,y,'-o',color='gray',
markersize=15, linewidth=4,
markerfacecolor='yellow',
markeredgecolor='red',
markeredgewidth=4)
plt.ylim(-1.5, 1.5);
VisualizingErrors
5
Unit – V
For any scientific measurement, accurate accounting for errors is nearly as important, if not more important, than
accurate reporting of the number itself. For example, imagine that I am using some astrophysical observations to
estimate the Hubble Constant, the local measurement of the expansion rate of the Universe.
Invisualizationofdataandresults,showingtheseerrorseffectivelycanmakeaplotconveymuchmorecomplete
information.
Typesoferrors
BasicErrorbars
Continuous Errors
BasicErrorbars
AbasicerrorbarcanbecreatedwithasingleMatplotlibfunctioncall.
import matplotlib.pyplot as
pltplt.style.use('seaborn-
whitegrid') import numpy as np
x=np.linspace(0,10,50)
dy=0.8
y=np.sin(x)+dy*np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='.k');
Herethefmtisaformatcodecontrollingtheappearanceoflinesandpoints,andhasthesamesyntaxasthe shorthand
used in plt.plot()
In additiontothesebasicoptions,theerrorbarfunctionhasmanyoptionstofinetunetheoutputs.Using these
additional options you can easily customize the aesthetics of your errorbar plot.
plt.errorbar(x,y,yerr=dy,fmt='o',color='black',ecolor='lightgray',elinewidth=3,capsize=0);
6
Unit – V
ContinuousErrors
In some situations it is desirable to show errorbars on continuous quantities. Though Matplotlib does not
have a built-in convenience routine for this type of application, it’s relatively easy to combine
primitiveslikeplt.plotandplt.fill_betweenforausefulresult.
Here we’ll perform a simple Gaussian process regression (GPR), using the Scikit-Learn API. This is a
method of fitting a very flexible nonparametric function to data with a continuous measure of the
uncertainty.
DensityandContourPlots
Todisplaythree-dimensionaldataintwodimensionsusingcontoursorcolor-codedregions. There
are three Matplotlib functions that can be helpful for this task:
plt.contourforcontourplots,
plt.contourfforfilled contour plots, and
plt.imshowforshowing images.
VisualizingaThree-DimensionalFunction
A contour plot can be created with the plt.contour function. It
takes three arguments:
agrid ofxvalues,
agrid ofyvalues,and
agrid ofzvalues.
The x and y values represent positions on the plot, and the z
values will be represented by the contour levels.
The way to prepare such data is to use the np.meshgrid
function, which builds two-dimensional grids from one-
dimensional arrays:
Example
deff(x,y):
returnnp.sin(x)**10
+np.cos(10+y*x)*np.cos(x) x=np.linspace(0,5,50)
y=np.linspace(0,5,40
)
X,Y=np.meshgrid(x,y
) Z = f(X, Y)
plt.contour(X,Y,Z,colors='black');
Notice that by default when a single color is used, negative values arerepresentedbydashedlines,and
positive values by solid lines.
Alternatively,youcancolor-codethelines byspecifyinga colormapwith thecmap argument.
We’llalso specifythat wewant morelines tobedrawn—20 equallyspacedintervals within thedatarange.
7
Unit – V
plt.contour(X,Y,Z,20,cmap='RdGy');
One potential issue with this plot is that it is a bit “splotchy.” That is, the color steps are discrete rather than
continuous, which is not always what is desired.
You could remedy this by setting the number of contours to a very high number, but this resultsin a rather
inefficient plot: Matplotlib must render a new polygon for each step in the level.
A better way to handle this is to use the plt.imshow()function, which interprets a two-dimensional grid of
data as an image.
Thereareafew potentialgotchaswithimshow().
plt.imshow() doesn’t accept an xandy grid, soyou must manually specify theextent [xmin, xmax,ymin,
ymax] of the image on the plot.
plt.imshow() by default follows the standard image arraydefinition where the origin is in the upper left, not
in the lower left as in most contour plots. This must be changed when showing gridded data.
plt.imshow() will automatically adjust the axis aspect ratio to match the input data; you can change this by
setting, for example, plt.axis(aspect='image') to make x and y units match.
ExampleProgram
importnumpyasnp
import matplotlib.pyplotasplt
deff(x,y):
returnnp.sin(x)**10+np.cos(10+y*x)*
np.cos(x)
x=np.linspace(0,5,50)
y = np.linspace(0, 5,
40)
X,Y=np.meshgrid(x,y
) Z=f(X,Y)
plt.imshow(Z, extent=[0, 10, 0,
10],
origin='lower',cmap='RdGy')
plt.colorbar()
Histograms
Histogramisthesimpleplottorepresentthelargedataset.Ahistogramisagraphshowingfrequency distributions. It
is a graph showing the number of observations within each given interval.
Parameters
plt.hist()is used to plot histogram. The hist() function will use an array of numbers to create a histogram,
the array is sent into the function as an argument.
8
Unit – V
bins- A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted
as a bar whose height corresponds to how many data points are in that bin. Bins are also sometimes called
"intervals", "classes", or "buckets".
normed-Histogramnormalizationisatechniquetodistributethefrequenciesofthehistogramovera wider range
than the current range.
x-(n,)arrayorsequenceof(n,)arraysInputvalues,thistakeseitherasinglearrayorasequenceofarrays
whicharenotrequiredtobeofthesame length.
histtype-{'bar', 'barstacked', 'step', 'stepfilled'}, optional
The type of histogram to draw.
'left':barsarecenteredontheleftbinedges.
'mid':barsarecenteredbetweenthebinedges.
'right':barsarecenteredontherightbinedges.
Default is 'mid'
orientation-{'horizontal','vertical'},optional
If'horizontal',barh willbe usedforbar-typehistograms andthebottomkwargwillbetheleft edges.
color-colororarray_likeofcolorsorNone,optional
Colorspecor sequenceofcolor specs,oneper dataset.Default(None) usesthestandard linecolor sequence.
DefaultisNone
label-strorNone,optional.DefaultisNone
Other parameter
**kwargs-Patchproperties,itallowsustopassa
variable number of keyword arguments to a
python function. ** denotes this type of function.
Example
importnumpyasnp
import matplotlib.pyplot as
pltplt.style.use('seaborn-
white') data =
np.random.randn(1000)
plt.hist(data);
Thehist()functionhasmanyoptionstotuneboththecalculationandthedisplay;here’sanexampleofamore customized
histogram.
plt.hist(data,bins=30,alpha=0.5,histtype='stepfilled',color='steelblue',edgecolor='none');
The plt.hist docstring has more information on other customization options available. I find this combination of
histtype='stepfilled' along with some transparency alpha to be very useful when comparing histograms of several
distributions
9
Unit – V
x1=np.random.normal(0,0.8,1000)
x2=np.random.normal(-2,1, 1000)
x3=np.random.normal(3,2, 1000)
kwargs=dict(histtype='stepfilled',alpha=0.3,bin
s=40) plt.hist(x1,**kwargs)
plt.hist(x2,**kwargs
)
plt.hist(x3,**kwargs);
Two-DimensionalHistogramsandBinnings
Wecan create histograms in two dimensions bydividingpoints amongtwodimensionalbins.
Wewoulddefinexandyvalues.HereforexampleWe’llstartbydefiningsomedata—anxandyarray drawn from a
multivariate Gaussian distribution:
Simplewaytoplotatwo-dimensionalhistogramistouseMatplotlib’splt.hist2d()function
Example
mean=[0,0]
cov=[[1,1],[1,2]]
x,y=np.random.multivariate_normal(mean,cov,100
0).T plt.hist2d(x,y,bins=30,cmap='Blues')
cb=plt.colorbar()
cb.set_label('countsinbin')
10
Unit – V
Legends
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw how
to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend in
Matplotlib.
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously saw how
to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the legend in
Matplotlib
plt.plot(x,np.sin(x),'-b',label='Sine')
plt.plot(x, np.cos(x), '--r', label='Cosine')
plt.legend();
CustomizingPlot Legends
Location and turn off the frame - We can specify the location and turn off the frame. By the parameter loc and
framon.
ax.legend(loc='upperleft',frameon=F
alse) fig
Roundedbox,shadowandframetransparency
11
Unit – V
We can use a rounded box (fancybox) or add a shadow, change the transparency (alpha value) of the frame, or
change the padding around the text.
ax.legend(fancybox=True, framealpha=1, shadow=True,
borderpad=1) fig
ChoosingElementsfortheLegend
Thelegend includes all labeled elements bydefault. Wecan change which elements and labels appearin the
legend by using the objects returned by plot commands.
Theplt.plot()commandisabletocreatemultiplelinesatonce,andreturnsalistofcreatedlineinstances.
Passinganyof theseto plt.legend()will tell it whichto identify, alongwith thelabelswe’d liketospecify
y=np.sin(x[:,np.newaxis]
+np.pi*np.arange(0,2,0.5)) lines=plt.plot(x,y)
plt.legend(lines[:2],['first','second']);
#Applyinglabelindividually.
plt.plot(x, y[:,0], label='first')
plt.plot(x,y[:,1],label='second')
plt.plot(x, y[:, 2:])
plt.legend(framealpha=1,
frameon=True);
Multiplelegends
It is only possible to create a single legend for the entire plot. If you
tryto create a second legend using plt.legend() or ax.legend(), it will
simplyoverridethefirstone.Wecanworkaroundthisbycreatinga
new legend artist from scratch, and then using the lower-level ax.add_artist() method to manually add the second
artist to the plot
Example
import matplotlib.pyplotasplt
plt.style.use('classic')
importnumpyasnp
x=np.linspace(0,10,1000)
ax.legend(loc='lower center', frameon=True,
shadow=True,borderpad=1,fancybox=True) fig
ColorBars
InMatplotlib,acolorbarisaseparateaxesthatcanprovideakeyforthemeaningofcolorsinaplot.For continuous labels based on
the color of points, lines, or regions, a labeled color bar can be a great tool.
Thesimplest colorbarcan becreated withthe plt.colorbar()function.
CustomizingColorbars
Choosing color map.
Wecanspecifythecolormapusingthecmapargumenttotheplottingfunctionthatis creatingthevisualization. Broadly,
we can know three different categories of colormaps:
Sequentialcolormaps - Theseconsist ofone continuoussequenceof colors(e.g.,binaryor viridis).
Divergentcolormaps-Theseusuallycontaintwodistinctcolors,whichshowpositiveandnegativedeviations from
a mean (e.g., RdBu or PuOr).
Qualitativecolormaps -Thesemixcolorswithnoparticularsequence (e.g.,rainbowor jet).
12
Unit – V
Colorlimitsandextensions
Matplotliballowsfor a largerangeofcolorbar customization.The colorbar itselfissimply aninstance of
plt.Axes, so all of the axes and tick formatting tricks we’ve learned are applicable.
We can narrow the color limits and indicate the out-of-bounds values with a triangular arrow at the top and
bottom by setting the extend property.
plt.subplot(1,2, 2)
plt.imshow(I,
cmap='RdBu')
plt.colorbar(extend='bot
h') plt.clim(-1,1);
Discretecolorbars
Colormaps are by default continuous, but sometimes you’d like to
represent discrete values. The easiest way to do this is to use the
plt.cm.get_cmap() function, and pass the name of a suitable colormap
along with the number of desired bins.
plt.imshow(I,cmap=plt.cm.get_cmap('Blue
s',6)) plt.colorbar()
plt.clim(-1,1);
Subplots
Matplotlibhastheconceptof subplots:groupsof smalleraxes thatcanexist togetherwithin asinglefigure.
Thesesubplots mightbeinsets,grids ofplots,or othermorecomplicated layouts.
We’llexplorefourroutinesforcreatingsubplotsinMatplotlib.
plt.axes:Subplots byHand
plt.subplot:Simple Gridsof Subplots
plt.subplots:TheWholeGrid inOneGo
plt.GridSpec:MoreComplicatedArrangements
plt.axes:Subplotsby Hand
Themostbasicmethodofcreatinganaxesistousetheplt.axesfunction.Aswe’veseenpreviously,by default this
creates a standard axes object that fills the entire figure.
plt.axesalsotakes anoptional argumentthatis alist offournumbers inthefigurecoordinate system.
Thesenumbersrepresent[bottom,left,width,height] inthefigurecoordinatesystem,whichrangesfrom0at the
bottom left of the figure to 1 at the top right of the figure.
13
Unit – V
Forexample,
we might create an inset axes at the top-right corner of another
axes by setting the x and y position to 0.65 (that is, starting at
65%ofthewidthand65%oftheheightofthefigure)andthex and y
extents to 0.2 (that is, the size of the axes is 20% of the width
and 20% of the height of the figure).
import
matplotlib.pyplotasplt
importnumpyasnp
ax1=plt.axes()#standardaxe
s ax2 = plt.axes([0.65, 0.65, 0.2,
0.2])
Verticalsubplot
The equivalent of plt.axes()command within the
object-orientedinterfaceisig.add_axes().Let’susethis
to create two vertically stacked axes.
fig=plt.figure()
ax1=fig.add_axes([0.1,0.5,0.8,0.4],
xticklabels=[],ylim=(-1.2,1.2))
ax2=fig.add_axes([0.1,0.1,0.8,0.4],
ylim=(-1.2,1.2))
x=np.linspace(0,10)
ax1.plot(np.sin(x))
ax2.plot(np.cos(x));
We now have two axes (the top with no tick
labels) that are just touching: the bottom of the
upper panel (at position 0.5) matches the top of
the lower panel (at position 0.1+ 0.4).
If the axis value is changed in second plot both
the plots are separated with each other, example
ax2 = fig.add_axes([0.1, 0.01, 0.8, 0.4
plt.subplot:SimpleGridsof Subplots
Matplotlibhasseveralconvenience routines toaligncolumnsor rowsofsubplots.
Thelowestlevelof theseisplt.subplot(), whichcreates asingle subplotwithinagrid.
plt.subplots:TheWholeGrid in OneGo
The approach just described can become quite tedious when you’re creating a large grid of subplots,
especiallyifyou’dliketohidethex-andy-
axislabels onthe inner plots.
For this purpose, plt.subplots() is the easier
tool to use (note the s at the end of subplots).
Rather than creating a single subplot, this
function creates a full grid of subplots in a
single line, returning them in a NumPy array.
The arguments are the number of rows and
number of columns, along with optional
keywords sharex and sharey, which allow you
to specify the relationships between different
axes.
Here we’ll create a 2×3 grid of subplots,where
all axes in the same row share their y- axis
scale, and all axes in the same column share
their x-axis scale
fig, ax = plt.subplots(2, 3,
sharex='col', sharey='row')
Note that by specifying sharex and sharey, we’ve
automatically removed inner labels on the grid to
make the plot cleaner.
plt.GridSpec:MoreComplicatedArrangements
To go beyond a regular grid to subplots that span multiple rows and columns, plt.GridSpec() is the best
tool.The plt.GridSpec() object does not create a plot by itself; it is simply a convenient interface that is
recognizedby the plt.subplot() command.
For example, a gridspec for a grid of two rows and three columns withsome specified width and height
spacelooks like this:
TextandAnnotation
The most basic types of annotations we will use are axes labels and titles, here we will see some more
visualization and annotation information’s.
15
Unit – V
Text annotation can be done manually with the plt.text/ax.text command, which will place text at aparticular
x/y value.
The ax.text method takes an x position, a y position, a string, and then optional keywords specifying the
color, size, style, alignment, and other properties of the text. Here we used ha='right' and ha='center', where
ha is short for horizontal alignment.
TransformsandText Position
Weanchoredourtextannotationstodatalocations.Sometimesit’spreferabletoanchorthetexttoaposition on the
axes or figure, independent of the data. In Matplotlib, we do this by modifying the transform.
Anygraphics displayframework needs somescheme for translatingbetween coordinatesystems.
Mathematically, such coordinate transformations are relatively straightforward, and Matplotlib has a well-
developed set of tools that it uses internally to perform them (the tools can be explored in the
matplotlib.transforms submodule).
There arethreepredefined transformsthatcanbeusefulinthis situation.
Example
importmatplotlib.pyplotaspl
t importmatplotlibasmpl
plt.style.use('seaborn-
whitegrid')
importnumpyasnp
importpandasaspd
fig, ax =
plt.subplots(facecolor='lightgray')
ax.axis([0,10,0,10])
#transform=ax.transDataisthedefault,butwe'llspecifyitanyway
ax.text(1, 5, ". Data: (1, 5)", transform=ax.transData)
ax.text(0.5,0.1,".Axes:(0.5,0.1)",transform=ax.transAxes)
ax.text(0.2,0.2,".Figure:(0.2,0.2)", transform=fig.transFigure);
16
Unit – V
Note that by default, the text is aligned above and to the left of the specified coordinates; here the “.” at the
beginning of each string will approximately mark the given coordinate location.
The transData coordinates give the usual data coordinates associated with the x- and y-axis labels. The transAxes
coordinates give the location from the bottom-left corner of the axes (here the white box) as a fraction of the axes
size.
The transfigure coordinates are similar, but specify the position from the bottom left of the figure (here the gray
box) as a fraction of the figure size.
Notice now that if we change the axes limits, it is only the transData coordinates that will be affected, while the
others remain stationary.
ArrowsandAnnotation
Alongwith tickmarksand text,another usefulannotation mark isthe simplearrow.
Drawingarrowsin Matplotlibisnot muchharderbecausethereisa plt.arrow()function available.
ThearrowsitcreatesareSVG(scalablevectorgraphics)objectsthatwillbesubjecttothevaryingaspect ratio of your
plots, and the result is rarely what the user intended.
Thearrowstyleiscontrolledthroughthe arrowpropsdictionary, whichhasnumerousoptionsavailable.
Three-DimensionalPlottinginMatplotlib
Weenable three-dimensional plotsbyimportingthemplot3dtoolkit, included withthemainMatplotlib installation.
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkitsimport
mplot3d fig=plt.figure()
ax=plt.axes(projection='3d')
Withthis3Daxesenabled,wecannowplotavariety of
three-dimensional plot types.
Three-DimensionalPointsandLines
Themostbasicthree-dimensionalplot isalineor scatterplotcreated fromsets of(x,y,z) triples.
Inanalogywiththemorecommontwo-dimensionalplotsdiscussedearlier,wecancreatetheseusingtheax.plot3D
andax.scatter3Dfunctions
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkitsimport
mplot3d
ax=plt.axes(projection='3d')
#Dataforathree-dimensionalline
zline=np.linspace(0,15,1000)
xline=np.sin(zline)
yline=np.cos(zline)
ax.plot3D(xline, yline, zline,
'gray')
#Dataforthree-
dimensionalscatteredpoints
zdata=15*np.random.random(100)
xdata = np.sin(zdata) + 0.1 *
np.random.randn(100) ydata=np.cos(zdata)
+0.1* np.random.randn(100)
Unit – V
17
Unit – V
ax.scatter3D(xdata, ydata, zdata, c=zdata,
cmap='Greens'); plt.show()
Three-DimensionalContourPlots
mplot3dcontainstoolstocreatethree-dimensionalreliefplotsusingthesameinputs.
Like two-dimensional ax.contourplots, ax.contour3Drequires all the input data to be in the form of
two- dimensional regular grids, with the Z data evaluated at each point.
Herewe’llshowathree-dimensional contourdiagramof athreedimensional sinusoidal function
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkitsimport
mplot3d deff(x,y):
returnnp.sin(np.sqrt(x**2+y**2
)) x=np.linspace(-6,6,30)
y = np.linspace(-6, 6,
30)
X,Y=np.meshgrid(x,y)
Z=f(X,Y)
fig= plt.figure()
ax=plt.axes(projection='3d')
ax.contour3D(X,Y,Z,50,cmap='bina
ry') ax.set_xlabel('x')
ax.set_ylabel('
y')
ax.set_zlabel('z
') plt.show()
Sometimesthedefaultviewingangleisnotoptimal,inwhichcasewecanusetheview_init methodtosetthe elevation and
azimuthal angles.
ax.view_init(60,35
) fig
WireframesandSurface Plots
Twoothertypes ofthree-dimensional plotsthatwork ongridded data arewireframes andsurfaceplots.
Thesetakeagridofvaluesandprojectitontothespecifiedthreedimensionalsurface,andcanmakethe resulting three-
dimensional forms quite easy to visualize.
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkits import
mplot3d fig=plt.figure()
ax=plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z,
color='black')
ax.set_title('wireframe');
plt.show()
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkitsimport
mplot3d
ax=plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap='viridis',edgecolor='none')
ax.set_title('surface')
plt.show()
Surface Triangulations
For some applications, the evenly sampled grids required by
the preceding routines are overly restrictive andinconvenient.
Inthesesituations, thetriangulation-based plotscan beveryuseful.
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkitsimport
mplot3d
theta = 2 * np.pi *
np.random.random(1000)
r=6*np.random.random(1000)
x=np.ravel(r*np.sin(theta))
y=np.ravel(r*np.cos(theta))
z=f(x,y)
ax=plt.axes(projection='3d')
ax.scatter(x,y,z,c=z,cmap='viridis',linewidth=0.5)
GeographicDatawithBasemap
One common type of visualization in data science is that
of geographic data.
Matplotlib’smaintoolforthistypeofvisualizationistheBasemaptoolkit,whichisoneofseveral Matplotlib toolkits
that live under the mpl_toolkitsnamespace.
Basemapisa usefultoolforPythonusers tohave in theirvirtual toolbelts
Installation of Basemap. Once you have the Basemap toolkit installed and imported, geographic plots
alsorequire the PIL package in Python 2, or the pillow package
inPython3.
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8,8))
m = Basemap(projection='ortho',
resolution=None, lat_0=50,lon_0=-100)
m.bluemarble(scale=0.5);
19
Unit – V
We’lluseanetopoimage(whichshowstopographicalfeaturesbothonlandandundertheocean)asthe map
background
Programtodisplayparticularareaofthemapwithlatitudeand longitude
lines
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkits.basemap import
Basemap fromitertoolsimportchain
fig=plt.figure(figsize=(8,8))
m = Basemap(projection='lcc', resolution=None,
width=8E6,height=8E6,
lat_0=45,lon_0=-100,)
m.etopo(scale=0.5,
alpha=0.5) def draw_map(m,
scale=0.2):
#drawashaded-reliefimage
m.shadedrelief(scale=scale)
#latsandlongsarereturnedasadictionary
lats =m.drawparallels(np.linspace(-
90,90,13))
lons=m.drawmeridians(np.linspace(-
180,180,13))
#keyscontaintheplt.Line2Dinstances
lat_lines = chain(*(tup[1][0] for tup in
lats.items())) lon_lines=chain(*(tup[1]
[0]fortupinlons.items())) all_lines =
chain(lat_lines, lon_lines)
#cyclethroughtheselinesandsetthedesiredstyle
forlineinall_lines:
line.set(linestyle='-',alpha=0.3,color='r')
MapProjections
TheBasemappackageimplementsseveraldozensuch projections,allreferencedbyashortformatcode.Herewe’ll briefly
demonstrate some of the more common ones.
Cylindricalprojections
Pseudo-cylindricalprojections
Perspectiveprojections
Conic projections
Cylindricalprojection
Thesimplestofmapprojectionsarecylindricalprojections,inwhichlines ofconstant latitudeand longitude are
mapped to horizontal and vertical lines, respectively.
Thistypeofmappingrepresentsequatorialregionsquitewell,butresultsinextremedistortionsnearthe poles.
Thespacingoflatitudelinesvariesbetweendifferentcylindricalprojections,leadingtodifferent conservation
properties, and different distortion near the poles.
OthercylindricalprojectionsaretheMercator(projection='merc')andthecylindricalequal-area
(projection='cea') projections.
TheadditionalargumentstoBasemapforthisviewspecifythelatitude(lat)andlongitude(lon)ofthe lower-left
corner (llcrnr) and upper-right corner (urcrnr) for thedesired map, in units of degrees.
importnumpyasnp
importmatplotlib.pyplotasplt
frommpl_toolkits.basemapimportBasemap
Unit – V
20
Unit – V
fig=plt.figure(figsize=(8,6),edgecolor='w')
m = Basemap(projection='cyl',
resolution=None, llcrnrlat=-
90,urcrnrlat=90,
llcrnrlon=-
180,urcrnrlon=180,)
draw_map(m)
Pseudo-cylindricalprojections
Pseudo-cylindricalprojectionsrelaxtherequirementthatmeridians(linesofconstantlongitude)remain vertical;
this can give better properties near the poles of the projection.
TheMollweideprojection(projection='moll')isonecommonexampleofthis,inwhichallmeridiansare elliptical
arcs
Itisconstructedsoasto
preserve areaacrossthe map:though there are
distortions near the poles, the area of small
patches reflects the true area.
Other pseudo-cylindrical projections are the
sinusoidal (projection='sinu') and Robinson
(projection='robin') projections.
The extra arguments to Basemap here refer to
the central latitude (lat_0) and longitude
(lon_0) for the desired map.
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 6),
edgecolor='w')
m = Basemap(projection='moll',
resolution=None, lat_0=0,lon_0=0)
draw_map(m)
Perspectiveprojections
Perspective projections are constructed using a particular choice of perspective point, similar to if you
photographedtheEarthfromaparticularpointin space(apoint which, for someprojections,technicallylies within
the Earth!).
Unit – V
21
Unit – V
Onecommonexample istheorthographicprojection(projection='ortho'), whichshowsonesideof theglobe as
seen from a viewer at a very long distance.
Thus,it canshow onlyhalftheglobeat a time.
Other perspective-based projections include the
gnomonic projection (projection='gnom') and
stereographic projection (projection='stere').
These are often the most useful for showing small
portions of the map.
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho',
resolution=None, lat_0=50,lon_0=0)
draw_map(m);
Conicprojections
Aconicprojectionprojects themaponto asingle cone, whichisthen unrolled.
Thiscanleadtoverygoodlocalproperties,butregionsfarfromthefocuspointoftheconemaybecome very distorted.
Oneexampleofthisisthe Lambertconformalconicprojection (projection='lcc').
It projects the map onto a cone arranged in such a way that two standard parallels (specified in Basemap by
lat_1 and lat_2) have well-representeddistances, with scale decreasingbetween them and increasingoutside
of them.
Other useful conic projections are the equidistant conic (projection='eqdc') and the Albers equal-area
(projection='aea') projection
importnumpyasnp
importmatplotlib.pyplotasplt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 8))
m=Basemap(projection='lcc',resolution=None,
lon_0=0,lat_0=50,lat_1=45,lat_2=55,width=1.6E7,height=1.2E7)
draw_map(m)
22
Unit – V
• Political boundaries
drawcountries() - Draw country boundaries
drawstates() - Draw US state boundaries
drawcounties()-DrawUScountyboundaries
• Mapfeatures
drawgreatcircle()-Drawagreatcirclebetweentwopoints
drawparallels() - Draw lines of constant latitude
drawmeridians() - Draw lines of constant longitude
drawmapscale() - Draw a linear scale on the map
• Whole-globeimages
bluemarble()-ProjectNASA’sbluemarbleimageontothemap
shadedrelief() - Project a shaded relief image onto the map
etopo() - Draw an etopo relief image onto the map
warpimage()-Projecta user-providedimageontothe map
PlottingDataonMaps
TheBasemap toolkit isthe abilityto over-plotavarietyof data onto a mapbackground.
Therearemanymap-specificfunctionsavailableasmethodsoftheBasemapinstance. Some
of these map-specific methods are:
contour()/contourf()-Drawcontourlinesorfilledcontours
imshow() - Draw an image
pcolor()/pcolormesh()-Drawapseudocolorplotforirregular/regularmeshes plot() -
Draw lines and/or markers
scatter()-Drawpointswithmarkers
quiver() - Draw vectors
barbs() - Draw wind barbs
drawgreatcircle()-Drawagreatcircle
VisualizationwithSeaborn
ThemainideaofSeabornisthatitprovideshigh-levelcommandstocreateavarietyofplottypesusefulfor statistical data
exploration, and even some statistical model fitting.
23
Unit – V
Histograms,KDE,and densities
In statistical data visualization, all you want is to plot
histograms and joint distributions of variables. We have
seen that this is relatively straightforward in Matplotlib
Rather than a histogram, we can get a smooth estimate of
the distribution using a kernel density estimation, which
Seaborn does with sns.kdeplot
import pandas as pd
importseabornassns
data=np.random.multivariate_normal([0,0],[[5,2],[2,
2]],size=2000)
data=pd.DataFrame(data,columns=['x','y'])
for colin'xy':
sns.kdeplot(data[col],shade=True)
HistogramsandKDEcanbecombinedusingdistplot
sns.distplot(data['x'])
sns.distplot(data['y']);
Pairplots
When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is veryuseful for
exploringcorrelationsbetweenmultidimensionaldata,whenyou’dliketoplotallpairsofvaluesagainst eachother.
24
Unit – V
Facetedhistograms
Sometimesthebestwaytoviewdataisviahistogramsofsubsets.Seaborn’sFacetGridmakesthis extremely simple.
We’ll take a look at some data that shows the amountthat restaurant staffreceive in tips based on
variousindicator data
25
Unit – V
Factorplots
Factorplotscanbeusefulforthiskindofvisualizationaswell.Thisallowsyoutoviewthedistributionofa parameter within bins
defined by any other parameter.
Jointdistributions
Similartothepairplotwesawearlier,wecanusesns.jointplottoshowthejointdistributionbetweendifferent datasets, along with
the associated marginal distributions.
Barplots
Timeseries can beplotted with sns.factorplot.
26