ST2195 Complete
ST2195 Complete
VLE Materials
There are extra materials such as videos and test quizzes for this block on the VLE – You can
find them here: https://emfss.elearning.london.ac.uk/course/view.php?id=382#section-1
What Is Data?
“Data is a set of values of subjects with respect to qualitative or quantitative variables.”
Wikipedia
• Vector or matrix
• Tensor (high-order matrix)
• Image, or text
Notice the key words computational (aata science typically involves some sort of algorithmic
methods written in code), statistical (statistical inference enables us to draw conclusions and
make predictions), and real world (we are talking about deriving insight not into some artificial
process, but into some “truth” in the real world).
We can also think of data science as the union of the various techniques that are required to
accomplish the above. In other words, something like:
This definition is also useful because it emphasizes that all these areas are crucial to obtaining
the goals of data science. In this course we will explore the programming aspect of all of
the above areas.
Another way to conceptualise data science is by thinking in terms of what it is not, more precisely
what is it not (just); see below.
Data Science Is Not (Just) Machine Learning
Making good machine learning based predictions can be an important part of data science, but
the truly hard elements of data science involve also
Data vs Information
Data
• Raw, unorganized facts that need to be processed
• Unusable until it is organized
Information
• Created when data is processed, organized, and structured
• Needs to be put in an appropriate context in order to become useful
Examples of Data Science Tasks
Below we provide some indicative examples
• Data collection
• Data processing (wrangling)
• Data visualisation
• Train and apply algorithms from fields such as machine learning, statistics, data mining,
optimisation, image processing, etc.
Clearly the above tasks require the use of computers and in particular programming.
(Computer) programming: The process of producing an executable computer program that
performs a specific task, like the ones above. The purpose of programming is to find a sequence
of instructions that automate the implementation of the task for solving a given problem.
Programming Language: The source code of a program is written in one or more languages
that are intelligible to humans, rather than machine code, which is directly executed by the
central processing unit. In this course we will explore the computer programming languages R
and Python in the context of data science, i.e. to perform data science tasks.
Open Source Software
• Both R and Python are open source, i.e. free computer software which the user can
modify and distribute within the terms of a licence
• Collaborative development has created diverse and very powerful software ecosystems
• Modular structure permits users to build an environment exactly suited to their needs.
GitHub
• GitHub code hosting platform for version control and collaboration
• Based on Git
o Version control system for tracking changes in computer files and coordinating
work on those files among multiple people
o Created in 2005 by Linus Torvalds
• Largest host of source code in the world
References
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Springer.
Source-Code Editors and IDEs for R
and Python
Source-Code Editors
Programming is the process of instructing a computer about how to perform a task. The
predominant way of doing so nowadays is by creating a set of instructions in one or more
programming languages, like R and Python.
That set of instructions forms the source code for the task at hand, and is typically provided in
the form of text, using special, language-specific syntax. Then, depending on the
implementation of the language, another computer program, called a compiler), converts the
set of instructions into a lower-level language that the computer and operating system
understand, producing an executable program that the user can use to perform the task.
Another implementation of programming languages allows to execute the instructions
directly through a computer program called an interpreter.
So, for writing programs we can use any text editor as a source code editor. For example, you
can open your favourite text editor (e.g. Notepad in Windows, TextEdit on macOS, or a plain
text editor in Linux) and type in it the following:
az <- LETTERS
az[c(8, 5, 12, 12, 15)]
Congratulations! You have authored an R script that first creates a variable az and assigns to
it the vector with all upper-case letters of the English alphabet, and then subsets that vector to
keep only the 8th, 5th, 12th, 12th and 15th letters (in that order). If we pass the instructions in
the R script to R (and we will see how we can do so later), then we will get the output "H"
"E" "L" "L" "O".
Of course, a standard text editor may miss some features that are extremely helpful when
authoring programs, such as autocompletion, brace matching, automatic code indentation,
syntax highlighting, automatic syntax checks while code is being written, etc.
As a result, a wide-range of source-code editors has been developed that provide a range of
such features in an attempt to simplify and speed up working with source code in one or even
multiple languages at a time. Source-code editors come either as part of integrated
development environments (IDEs; more on this later), or as stand-alone programs.
Below, we list the most popular source-code editors that are not formally part of an IDE
amongst the participants of the 2019 StackOverflow Analysis Survey, who program with at
least one of R and Python (in decreasing order of popularity).
All the editors above provide rich code development features for both R and Python.
IDEs
As stated in Wikipedia “an integrated development environment (IDE) is a software
application that provides comprehensive facilities to computer programmers for software
development.”
An IDE typically provides a source-code editor, automation tools for writing programs and
compiling source code, a debugger, and utilities and ready processes to test a program. Also,
the boundaries between an IDE and some of the editors we listed above are nowadays not
very clear, as some editors have plugins that can make them extremely powerful IDEs
(e.g. Notepad++, Vim, Atom and Emacs).
R Installation
To use R, you first need to download your copy of it. R is free and easy to download and
install on major platforms such as Windows, macOS and Linux.
To download R, simply open your internet browser and go to the web page of the
Comprehensive R Archive Network (CRAN) and follow the instructions. For example, and as
the current CRAN web pages are, for installing R in Windows do the following:
The web-page at the link in step 3 above provides more details about the Windows
installation process and some frequently asked questions.
If you have a macOS system, then, after step 1 above click Download R for (Mac) OS X and
download and run the linked .pkg file.
A new major release of R happens once a year, and there are typically a few minor releases
within the year. It is a good idea to update your installation regularly, especially when a
major release becomes available. Note that after updating to a major version you will need to
reinstall your packages.
Using R
There are several ways to use R after installation. Some popular ways are
• Through the Command Prompt (Windows) or a terminal (on Linux and macOS)—the
minimalists choice
• Through the default R GUI (on Window and macOS)—simple and effective
• Through RStudio—powerful and widely used
Another way popular developers and scientists is through the Emacs Speaks Statistics (ESS)
package for the Emacs editor. ESS provides an extremely rich development experience, if
you are familiar with Emacs, but we will not cover it here.
Assignment Operator
Before we continue with the various ways of using R, we need to know how we can—slightly
informally put—give names to the results of an operation for further manipulation later.
a <- 987654321/123456789
will compute the ratio 987654321/123456789 (did you know that this ratio is almost 8?) on
the right of <- using floating point arithmetic, store the result in computer memory and then
bind it to the name a on the left of <-. For example, we can then do
print(a, 15)
[1] 8.0000000729
The symbol <- (less and a dash) is R’s assignment operator. Another common convention
in recent versions of R is to use = for assignment (as in Python and other programming
languages). That’s OK but <- is formally more valid because i) it highlights the direction of
assignment, and ii) formally a and 987654321/123456789 are not really equal (by definition
of assignment a = b is not the same statement as b = a!). For this reason, we are and will be
using <- for assignment in all R code in these notes. Note that the statement
987654321/123456789 -> a
also binds the result of 987654321/123456789 to the name a, though it is less commonly
used.
Note here that we said “the value bound to a” instead of “the value of a.” This is specific to
how R works. In R, values do not have a name (!), instead the names have values associated
to them! So, a is simply a reference to a value. For example, if we type b <- a, R does not
create another copy the already-computed result of 987654321/123456789 but rather creates
a binding of the already existing value that is in memory to another name b! See, the “Named
and values” section of the Advanced R book for more details on this.
Nevertheless, for convenience, you may see “value/object a” in these notes being used as a
shorthand to “value/object that is bound to the name a.”
Before doing that in Windows you will need to set the path variable to include your R
installation. For example, in Windows 10 this can be done as follows:
R is a scripting language, which basically means that you can run commands as soon as you
type them. When R is running, the greater-than sign (>) is R’s “prompt,” which indicates that
R is ready for you to enter commands. You can then type in R commands here to work with
R interactively. When you complete typing your command, press Enter and R will do the
computation and print the answer. In the above screenshots, we have issued the command
print("Hello :D") into the R prompt, and we end up seeing
The text after [1] is the result from the R code that you wrote after the greater-than sign.
Another example is computing the numeric value of \(log(2) + 3\), assigning it to a variable
a,and then printing the value of a:
For more complex computations you may want to use a source-code editor or an IDE for
writing the R script. You can then either execute the script interactively by copying and
pasting its lines to an R prompt, or execute it non-interactively in a Command Prompt or
terminal by doing Rscript /path/to/script.R, where /path/to/script.R is the R script
that you wish to run.
Through the Default R GUI
On Windows and macOS machines the R installer installs by default a custom graphical user
interface (GUI) for R.
For Windows users, you can open R’s GUI by double-clicking on the R icon on the desktop
or in the start menu. By doing so, you should get a window that looks like this:
As you can see in these screenshots, you can interact with R by typing in commands. The R
GUIs offer several useful features and utilities, such an in-built editor to write code, help
pages, devices to display graphics, command history and completion, etc.
Through RStudio
Downloading and Installing RStudio
RStudio can be downloaded from here. It comes in two flavours: one is RStudio Desktop,
which is a stand-alone computer applications and the other is RStudio Server which runs on a
remote server and allows accessing RStudio using a web browser. For the purposes of this
course, you will only need to use the free version of RStudio Desktop. After hitting the
“Download” button you will be redirected at a web page with instructions and a list of
installers for RStudio for a wide range of operating systems.
RStudio Interface
The default RStudio interface is set up as a four-pane workspace that is split up for:
Note that you can customize the pane layout slightly if you go to the Tools menu → Global
Options. Here is a screenshot of what RStudio looks like:
1. Top left:
• The Source pane is where you write your R scripts. You can also run selected lines
from this pane by going to the line you wish to run, and by typing CTRL + Enter
(Windows, Linux) or CMD + Enter (Mac) or hitting the “Run” button on the top-right
of the pane. The “Source” button will run all lines from the active script.
2. Top right:
• Environment tab: Shows the current workspace in this session and shows the list of R
objects that you have created in the session
• History tab: Shows the history of all previous commands
• Connections tab: Makes it possible to easily connect to a variety of data sources
• Tutorial tab: Tutorials for R hosted by the learnr package
3. Bottom left:
4. Bottom right:
The console pane opens up an R console where you can type in your commands like you
would when using the terminal or default R GUIs. Here you can type code in directly.
However, for more complicated tasks, it is recommended to put your code in scripts, so that
you can have a complete record of what you did in session or what you want to do. In this
way, you can share your code with others or continue developing it later on if needed. R
script files typically end with the “.R” suffix.
You can write scripts in the Source pane and you can open a new script by clicking on File
→ New File → R Script. After writing a script, you can copy and paste it in the R console.
RStudio allows you to send the current line or a currently selected text to the R console by
using Ctrl + Enter (Windows) or CMD + Enter (Mac).
If R is ready to accept commands, the R console will show a > prompt to receive a command.
If it is waiting for more data because the command is not complete yet, the console will show
a + prompt.
It is also possible to use the R function setwd(), which stands for “set working directory.”
For example, setwd("/path/to/directory") will set the working directory to
/path/to/directory. Note that in Windows systems paths are customarily given as “” but
“" is a special character in strings. So, in Windows machines the R convention is to either do
setwd("/path/to/directory") or”escape" the special character by doing
setwd("\\path\\to\\directory"). The former is recommended if you are planning to
share your code or if you are working on multiple operating systems.
You can also print your current directory with getwd(), which stands for “get working
directory”
> getwd()
[1] "/Users/yiannis/Documents"
Here, R tells us that the current working directory is the Documents directory in the home
directory of the user “yiannis” .
Installing Packages
You can install any R package available in CRAN by typing
install.packages("packagename"), where packagename is the name of the package you
wish to install.
In RStudio, packages can be installed through a GUI after pressing the install button at the
top of the Packages tab.
Getting Help
R has an in-built help facility that can be useful whenever we want to get more information
on functions or features. To get more information about any specific function, for example if
we wanted to get some documentation for the function rnorm() (which is a function to
generate random samples from a normal distribution), we type help(rnorm) or ?rnorm.
If we run ?rnorm in the RStudio console, the documentation of the rnorm() function appears
in the Help tab. If we run this in one of the R GUIs or through the Windows Command
Prompt, the documentation opens by default in HTML format. If R is running on a terminal
in macOS or Linux, the documentation opens by default in place in the terminal.
Another useful command is help.start(), which will open the HTML version of R’s online
documentation, which you can browse for help with R and with the installed packages.
Markdown
Markdown is a markup language that consists of a set of rules for adding formatting
elements (e.g. boldface, italics, headers, paragraphs, lists, code blocks, images, etc.) to plain
text documents.
John Gruber developed Markdown having in mind a plain-text format that is easy to read and
write (so you can use you favourite text editor to author), and that can subsequently be easily
converted to HTML. In fact, Jon Gruber’s Markdown came with a software tool, written in
the Perl programming language, that takes care of the text-to-HTML conversion.
For example, all text before the current sentence can be written in Markdown syntax as:
**Markdown** is a [markup
language](https://en.wikipedia.org/wiki/Markup_language) that consists
of a set of rules for adding formatting elements (e.g. boldface,
italics, headers, paragraphs, lists, code blocks, images, etc.) to
plain text documents.
John Gruber developed Markdown having in mind a plain text format that
is easy to read and write (so you can use you favourite text editor to
author), and that can subsequently be easily converted to HTML. In
fact, Jon Gruber's Markdown came with a software tool, written in the
[Perl programming language](https://en.wikipedia.org/wiki/Perl), that
takes care of the text-to-HTML conversion.
Over the coming years after its release, a wide range of Markdown variants sprung up (see
for example the Markdown flavours listed on CommonMark’s repositories), mainly in an
effort to extend the syntax to handle more complex constructs that what the original
Markdown implementation provides, such as tables, footnotes, math expressions, and
citations.
A landmark development in the markup arena was John MacFarlane’s Pandoc, which is an
impressive software tool that allows to effortlessly convert between a wide range of markup
formats to others and to various file types (see https://pandoc.org/index.html for a list).
Pandoc supports some key Markdown variants, including Pandoc’s Markdown, which
significantly enriches John Gruber’s Markdown but under the same easy-to-read, easy-to-
write principles.
R Markdown
R Markdown combines the strengths of Pandoc’s Markdown and R to produce an authoring
framework for data science that allows the combination of text (e.g. describing a narrative),
code, and results in a single document, which can in turn, be converted to a wide range of
formats (like PDFs, HTML pages, Word documents, etc) using Pandoc.
This makes R Markdown an excellent tool for sharing your analyses and findings with others,
either by directly sharing the R Markdown file or by rendering a report from it in a alternative
format (e.g. PDF, HTML or Word document).
An R Markdown file has the “.Rmd” extension and typically hosts three types of content:
• YAML metadata, where some configuration options are provided to guide the R
Markdown build process
• Text, for the narrative part of the R Markdown file, written in Markdown syntax
• Code chunks, where R code is placed and options are provided to determine what
happens with the code and its outputs
Below are the contents of an example R Markdown file. The YAML metadata are given first
and are enclosed between three dashes, like ---. Then, the body follows with text in
Markdown syntax, and two code chunks, which are the parts that start with three backticks,
typically followed with the language name, like ```{r}, and end with another three backticks
```).
---
title: "Hello Rmd"
author: "[Luke Skywalker](https://en.wikipedia.org/wiki/Luke_Skywalker)"
date: 21 December 2112
---
## My First R Markdown File
I can also ask R to run code and return the results. For example, I can
ask R to print the quote
```{r quote}
print("Do. Or do not. There is no try")
```
```{r arithmetic}
print(1/81, 15)
```
install.packages("rmarkdown")
Then, save the contents of the example R Markdown file above into a file called hello-
Rmd.Rmd. The we load the rmarkdown package and render hello-Rmd.Rmd by doing
library("rmarkdown")
render("/path/to/hello-Rmd.Rmd")
where /path/to/ is the path to the directory you saved hello-Rmd.Rmd. This directory will
now have an HTML file (which is the default output format) called hello-Rmd.html, which
if opened in a browser window it will look like
Screenshot of the HTML output from knitting hello-Rmd.Rmd.
Notice how all code chunks have been evaluated and their results have been returned. Make
sure to check the render() functions help pages (type ?rmarkdown::render) for more
options on output formats and a wealth of rendering options. RStudio has excellent built-in
support for authoring and rendering R Markdown files into a range of formats through its
GUI. We will demo that support in the screencast of the current Section.
R Notebooks
The R Notebook mode is a special way of working with R Markdown files, where the code
chunks can be executed independently and interactively, with the chunk output shown
immediately under the code chunks. So, R notebooks provide a finer level of interaction with
the R Markdown contents than simply rendering the Rmd file does, which makes them great
for developing and testing the output of code chunks on the fly.
R Markdown documents can be used as notebooks, and R notebooks can be rendered to other
file types for publication or sharing. In fact, RStudio opens R Markdown files in notebook
mode by default. For example, below is a screenshot where we have opened the Luke
Skywalker’s R Markdown file in RStudio and run only the first code chunk (by simply hitting
the play button next to the chunk).
Notice how the result is displayed directly after the first code chunk, and that the second code
chunk has no output.
The screencast of this segment shows how to create an R Markdown file in RStudio, how to
use it as an R notebook, and how to render it in various formats.
Useful Links and Resources
• Markdown guide: A free online reference guide that explains how to use Markdown
• Markdown cheat sheet: A quick reference to the Markdown syntax
• R Markdown cheat sheet: A PDF giving concise set of notes to be used for quick
reference when working with R Markdown. The PDF also provides a quick reference
to Markdown syntax
• RStudio’s R Markdown Quick Tour: A web-page providing a quick intro to R
Markdown
• Notebooks with R Markdown: Video from the talk that J. J. Allaire gave at useR!2016
conference, introducing R notebooks
• R Markdown: The definitive guide: A book that is free to read online (click the link)
and is exactly what the title says
• R Markdown section from the R for data science book
References
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize,
and model data (1st ed.). O’Reilly Media.
Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R markdown: The definitive guide. Chapman
& Hall/CRC.
Installing and Working With Python
Setup
The simplest and standard way to use Python is by first installing Anaconda. Anaconda is an
open-source cross-platform distribution of the Python programming language. It contains several
packages such as
In this set of notes we will work with QtConsole, Jupyter Notebook, and Spyder to perform basic
operations in Python.
Running Python via QtConsole
One way to program in Python is by directly executing Python scripts, in other words text files
that end with the extension .py, and then using the Python interpreter. Nevertheless, it is
generally more efficient to use an interactive session via QtConsole which provides a highly
productive environment supporting a number of useful features such as tab completion,
integrated help, inline figures etc.
A simple way to launch QtConsole, working both in macOS and Windows and also for Jupyter
Notebook and Spyder, is to open the Anaconda Navigator and click Launch on the QtConsole
icon.
You should get a similar result to the picture below
print('Hello Python')
run hello_python.py
If you get the output Hello Python as before you just executed your first Python script!
To navigate within the QtConsole the following commands are needed:
See below how these commands were used to navigate to the directory
/users/kalogerk/Hello and run the program hello_python.py:
1. The Editor window that can be used to write your scripts as the hello_python.py program.
2. The Object Inspector where you can access variables, plots and files. In the picture
above we see the files in the working directory (where we have saved the file
hello|_python.py).
3. The Console, which is actually a Python console. In the first line in the image above we
typed
runfile('hello_python.py')
We are now going to run the previous program but this time using Spyder. To execute the
program, either
• Select Run –> Run from the menu (or press F5), and confirm the Run settings if required.
• Click the green play button.
In the image of the previous section we did the above and got the Hello Python output.
Jupyter Notebooks
Notebooks integrate code and its output into a single document that combines visualizations,
narrative text, mathematical equations, and other rich media. In other words: it is a single
document where you can run code, display the output, and also add explanations, formulas,
charts, and make your work more transparent, understandable, repeatable, and shareable. Using
Notebooks is now a major part of the data science workflow at companies across the globe. If
your goal is to work with data, using a Notebook will speed up your workflow and make it easier
to communicate and share your results.
• Jupyter Noteboooks are open source web-browser based applications to create and
share documents that contain
o Live code
o Equations
o Visualizations
o Explanatory text.
• The name, Jupyter, comes from the core supported programming languages that it
supports: Julia, Python, and R. Nevertheless, Jupyter notebooks are most frequently
used to code in Python despite the fact that there is support for 40 languages.
• Notebook files have .ipynb extension and can be easily shared, e.g. on GitHub
In order to launch open the Anaconda Navigator and click on Jupyter Notebook launch icon,
i.e. in the same way as with QtConsole and Spyder.
Using Jupyter Notebooks
To create a new notebook select from the menu
File → New Notebook → Python 3
Your webpage should look like this
Notice the cell type set to Code below the help menu. This means that you can type Python code
in the cell which you can run when finished by pressing the Run button.
You can also use the notebook for writing text. For this change the cell type to Markdown, type
your Markdown code and hit the Run button when finished.
print('Hello')
Task 2: Your first Jupyter notebook. Get Python to print Hello Python using a Jupyter notebook.
Solution for Task 2: See the Hello_Python.ipynb Jupyter notebook.
More on the print Function
You can put several words together using commas, e.g.
Note however that when you do that the different words are separated with spaces giving you 7
. rather than 7.. An alternative option is to use the + instead of the comma:
Note that the above requires text or else string characters input. Hence, we used the command
str(10) to convert the number 10 into text.
Task 3: Print the sentence “4 people out of 3 struggle with math” in the code cell below. Use both
, and + to connect each of the word.
which will open a web page in a different tab. You can then print the page into PDF using your
operating system’s facilities
Task 4: Open the Jupyter notebook you created for tasks 1 and 2 or the Hello_Python.ipynb
Jupyter notebook. Export to HTML and PDF files.
• Changes to all files (or to the whole project) and who made to be recorded over time
so that they can be undone or reviewed
• Multiple developers to work simultaneously on the project’s files
• Developers to experiment with new features without “breaking” the latest stable
version of the project.
A version control system (VCS) is the system that undertakes the tedious task of keeping
track of the changes to all project’s files and who made them, allowing users to recover any
previous version at any given time.
For example, R is developed using a VCS called Subversion and RStudio is developed using
a VCS called Git. You can browse R’s Subversion repository at https://svn.r-project.org/R/,
and RStudio’s Git repository at https://github.com/rstudio/rstudio.
VCS are useful for all sorts of projects. For example, while developing these lecture notes we
had to somehow keep track of the edits and changes that we made to the files, and for that we
used Git. You can also use Git to track changes to an R Markdown file or a Jupyter notebook.
Installation
The Getting Started section of Git project’s documentation provides detailed instructions on
how to download and install Git in Windows, macOS, and Linux.
If you have installed Git successfully, then issuing git --version in a terminal or the
Windows Command Prompt you should get the version of your Git installation. On my
system this returns git version 2.28.0:
A Git repository can end up being on your computer in one of two ways:
• A local directory holding a project’s files and folders that is not under version control
is turned into a Git repository
• A remote Git repository is cloned into your computer from elsewhere.
Then, the Git utilities (what we installed earlier) allow you to interact with repositories in all
sorts of ways. For example, you can view the history of a file, revert the file to a previous
version, add new files into the project, commit changes to the files, create branches, merge
changes between branches, etc.
Let’s start by creating our first local Git repository. Before we start, you will need to tell Git
your name and email address. These will added to each commit, so that when you start
collaborating with others on your project, you can all identify who made each commit. Open
the Command Prompt or a terminal and do
after you replace NAME and EMAIL, with your full name and email address.
Setting up a New Git Repository
First, create a directory called gitABC to host the files of your first Git project. In order to
create a Git repository in gitABC, in the Command Prompt or terminal change to the gitABC
directory (e.g. using the cd command). For example, in my macOS Terminal app, I do
cd ~/Repositories/gitABC/
Then, we type
git init
There is no Git repository already in gitABC, so this command returns the message
Initialized empty Git repository in ~/Repositories/gitABC/.git/. We have just
been successful in creating an empty Git project in gitABC, ready to host our project files!
Let’s now create the first file of gitABC project. Using your favourite source-code editor or
IDE create an R script with the following R code
and save it in gitABC as hello-git.R. However, hello-git.R is not yet version controlled!
In order to make Git track changes to the file, the first step to take is to stage it. Continuing
on the Command Prompt or terminal under the gitABC directory, we do
The current status of our repository can be checked with the git status command. For
example, in my terminal this returns
$ git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: hello-git.R
The staging area is the place were all changes have to go before they get committed to version
control. The staging area is there to give you control on which changes of your project files
should to be committed and which should not.
hello-git.R is now under version control and there is nothing to further to commit.
96a5cf1 is the commit ID, which is a SHA-1 hash. The text after the -m modifier is the
commit message, which helps to keep a quick record of what happened on each commit. So,
try to make those as short and as informative as possible!
Making Changes
Let’s now create a new file under gitABC called README.txt with the following text
Also, if you paid close attention, the R code in hello-git.R will not work. In fact, running it
in an R prompt returns an error
Clearly our code had a bug! print(b) should have really been print(a). Just open the
hello-git.R file again, fix the bug, and save it.
$ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: hello-git.R
Untracked files:
(use "git add <file>..." to include in what will be committed)
README.txt
no changes added to commit (use "git add" and/or "git commit -a")
Git has figured out that hello-git.R has been modified, the is a new untracked file called
README.txt in the gitABC project, and that there are no changes in the staging area. Again,
we do
$ git log
commit a722f3e84e74957e1bda4cc52301748ffe8cceb9 (HEAD -> master)
Author: ...
Date: Wed Sep 23 12:00:01 2020 +0100
commit 96a5cf143fce135a7db11aeb5776ea9bff353fcc
Author: ...
Date: Wed Sep 23 12:05:02 2020 +0100
my first commit
where I replaced my name and email above with “….” I typically do git log --
pretty=oneline for a more compact view that involves the commit identifier and the
commit message only. See here for the list of available options for printing logs.
Branching
It is often the case that we want to diverge from the current state of the project and continue
to do work without messing with that state. We can certainly do this by copying the project
directory elsewhere and continue to work there, but this is expensive both in time and space
for large projects, and can easily cause confusion.
Git offers a lightweight and neat way to support multiple, parallel development of the project
without messing with its main state, called branching.
Note that we are still in the master branch; checking the branch I am at gives
$ git branch
* master
my-first-branch
with the * pointing to master. In order to change to the new branch we created we need to
checkout that branch
Let’s now open hello-git.R again, add a new line print(paste(a, "Hello
branching!")) and save the file. hello-git.R should now have the following lines
We can compare the state of the project in my-first-branch and master using the git
diff command, followed by the names of the two branches we want to compare. In my
terminal, this gives
showing the new line that has been added in hello-git (see after +).
Merging
Once the development in one branch is complete we may want all the changes in that branch
to be incorporated into another. For example, currently my-first-branch is one commit
ahead of master. In order to update master with the changes in my-first-branch we need
to merge my-first-branch into master. This can be done with the git merge command.
If you are curious, open hello-git.R now to see that it is in the state it was before we
switched to my-first-branch, i.e. without the new line we added!. Then, you can merge my-
first-branch into it by doing
Updating a722f3e..bc934c3
Fast-forward
hello-git.R | 1 +
1 file changed, 1 insertion(+)
You have now successfully brought the changes you made in my-first-branch into master
and you can continue development, either by creating a new branch or directly on master.
In large-scale projects there can be conflicts when a merge is performed. Resolving conflicts
is a process that requires familiarity with the project and experience. See GitHub’s guide on
resolving conflicts for more information.
Also make sure to check out the Git Cheat Sheets for a collection of cheat sheets and manuals
about using Git (and GitHub).
References
Chacon, S., & Straub, B. (2014). Pro git: Everything you need to know about git (2nd ed.).
Apress.
Repository Hosting Services and
Collaboration Platforms: GitHub
An illustation of version control through git from Anita Chengs blog post
about Git for non-developers (and total newbies)
GitHub
Until now we have been working on a local repository. When working on collaborative
projects, it is often necessary to push the changes you made on the branches of your local
repository to a remote repository, so that others can pick them up and continue with
development.
The are multiple ways to create remote repositories, however the most popular solution is to
host them on a repository hosting service like GitHub, GitLab or Bitbucket, which also,
nowadays, act as collaboration platforms.
In this course we are going to use GitHub. So, go on and create an account on GitHub
(choose the free plan), and use the same email address as you used when you set your Git
credentials earlier.
Cloning
A core feature that Git offers is the ability to clone a remote repository into your computer, in
order to continue development locally and, possibly, then push your changes. For example,
you can easily get the whole Git repository (files, history, and everything!) of pandas (a
popular Python data analysis toolkit) by doing
(make sure you first navigate to the directory you want to keep the pandas repository). The
URL used above can be found on the pandas GitHub page by clicking “Code.”
You may then want to clone the forked project onto your computer to work locally, and push
your changes. Note that your changes are pushed to your fork of the project and not to the
project you forked! You can let the project developers know about your edits by raising a
Pull request, where you discuss what are the changes you made and allow them to review
them. See GitHub’s guide on creating a pull request from a fork for more details.
GitHub Education
Go to https://education.github.com/ and get the Student Developer Pack for some cool
freebies.
Useful Links and Resources
• Happy Git and GitHub with R: A book by Jenny Bryan aimed at people who use R for
data analysis or who work on R packages. The book explains how Git/GitHub are
used in data science and what are the differences with the use of Git/GitHub for pure
software development.
• Git Cheat Sheets: A collection of cheat sheets and manuals about using Git and
GitHub.
• Git and GitHub section of the 2nd edition of the R packages book: An accessible
introduction to Git and GitHub, with focus on projects related to R through RStudio.
Block 02 Data
Structured, Semi-Structured, and
Unstructured Data
VLE Resources
There are extra resources for this Block on the VLE – including quizzes and videos please
make sure you review them as part of your learning. You can find them here:
https://emfss.elearning.london.ac.uk/course/view.php?id=382#section-2
There are broadly three categories of data: structured, semi-structured and unstructured. This
categorization is useful as it can direct as to what solutions may be used for storing and
analyzing the various kinds of data, or what steps need to be taken in order to analyze the
data. However, this categorization is not as clear-cut as one would hope, and there are data
sets that are a combination of structured, unstructured and semi-structured data.
Structured Data
Structured data is data that is organized according to a predetermined set of rules. A
characteristic of structured data is that they are relatively straightforward to read, filter and
analyze, and that there is a wealth of statistical and machine learning procedures for directly
drawing insights from them. Structured data are typically represented in tabular format, in
one or more different tables, with predetermined relationships between the different rows and
columns. For example, the data that a smartwatch or a smartphone collects when someone is
jogging includes records of latitude, longitude, altitude, distance run, heart rate, and current
speed at regular times. So, we can organize the data in a tabular format as follows
The organization of the data in this way prescribes that the set of values recorded by the
smartwatch or smartphone at each timestamp will be organized in rows, and each row will
consist of the values that each variable takes, organized in a way so that each column of the
resulting table contains the values of one and only one variable. We also know exactly how to
organize a new record of the same information in the table. The existence of a predetermined
set of rules for structuring the data makes it easier to organize, search, and analyze the
information within it. To appreciate this, try to quickly spot all records after 18:40:36 on
2013-06-01, or determine whether at those timestamps the average speed is greater than 3
metres per second.
There is a wealth of structured data sets generated both by machines and humans. Examples
of structured data sets that you have encountered include stock prices, addresses, identity
information (e.g. social security numbers, names, surname, etc), sensory data produced by
smartphones (such as the above example), flight arrival and departure details, etc.
Relational database management systems (RDBMS), like SQL that you will encounter later,
are very powerful tools for the efficient storing and handling of structured data
(e.g. searching, filtering, joining, sorting etc).
Unstructured Data
Unstructured data sets are data sets for which it is difficult to have a predetermined set of
rules for organizing them. Examples include images, videos, sounds (music tracks for
example), text, websites, books. As a result the vast majority of the data that we interact with
nowadays is unstructured!
You can appreciate the challenges involved with unstructured data if you try a simple
exercise: organize the text in this paragraph in a tabular format. You will see that there is
more than one way of doing so. Would you organize the text by words, by sentences, by
letters? Would you keep the order the words appear in? What happens with punctuation?
Clearly, the lack of structure makes unstructured data more difficult to search, organize and
analyze. This is the reason why many machine learning and statistical procedures for
unstructured data rely on the appropriate transformation of unstructured data into a structured
form. For example, topic models can be used for the discovery of topics or themes in massive
and, otherwise unstructured, collections of documents (see, for example, Blei, 2012 for a
review of basic topic models). While topic models provide insights for unstructured data,
they typically require the transformation of unstructured text into a document-term matrix
which is then structured data! As an example in R, we use the tm R package to convert text
from four documents into a document-term matrix.
library("tm")
doc1 <- "I love programming in R and hate programming in Python"
doc2 <- "I love programming in Python and hate programming in R"
doc3 <- "I love programming in Python and R"
doc4 <- "I hate programming"
## Build a corpus and a document-term matrix
corpus <- Corpus(VectorSource(c(doc1, doc2, doc3, doc4)))
dt_mat <- DocumentTermMatrix(corpus)
as.matrix(dt_mat)
Terms
Docs and hate love programming python
1 1 1 1 2 1
2 1 1 1 2 1
3 1 0 1 1 1
4 0 1 0 1 0
Of course, the document-term matrix is nowhere close to capturing all the information that is
in the unstructured text!
The specifics of topic modelling are beyond the scope of this course, but if you want to run
the above commands on your own, you first need to type install.packages("tm") in R in
order to install the tm R package. Also, if you are curious, you can browse through the help
pages of Corpus(), VectorSource(), and DocumentTermMatrix() (e.g. by typing ?Corpus,
?VectorSource, and ?DocumentTermMatrix, respectively) to see what each of these
functions does.
Semi-Structured Data
Semi-structured data does not conform to relational databases such SQL, but still contains
some level of organization through semantic elements like tags and metadata. An example of
semi-structured data are Markdown documents! For example, the contents of the hello-
Rmd.Rmd R Markdown file that we encountered in previous weeks are
I can also ask R to run code and return the results. For example, I can
ask R to print the quote
```{r quote}
print("Do. Or do not. There is no try")
```
I can also do complex arithmetic. For example, if your R installation
could do infinite arithmetic you could see that `1/81` has all single
digits numbers from 0 to 9 repeating in its decimal, except 8!
```{r arithmetic}
print(1/81, 15)
```
Clearly, there is a quite a bit of unstructured data in the above, like free text and code. But,
there is a certain level of organization of R Markdown files, by using # for titles. > for quoters
and ``` for code blocks, [ ]() for links and so on. So, extracting the code from an R
Markdown document, or counting the number of links, paragraphs, sections or quotes is a
relatively straightforward process.
References
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.
File Formats for Data Exchange
Plain Text
The simplest way to store information in a computer is in plain text. Plain text files are
perhaps the most basic (and perhaps the most reliable) form of data storage. They are not the
most sophisticated or efficient to read and write, but they have the advantage of being
readable and writable across platforms and with virtually any text editor. The typical
extension for plain text file is .txt but they can be also encountered with other extensions or
no extension at all.
Just as an example, if you open nile.txt file in your favourite text editor — even your
source-code editor —, you will see
1120 1160 963 1210 1160 1160 813 1230 1370 1140 995 935 1110 994 1020 960
1180 799 958 1140 1100 1210 1150 1250 1260 1220 1030 1100 774 840 874 694
940 833 701 916 692 1020 1050 969 831 726 456 824 702 1120 1100 832 764 821
768 845 864 862 698 845 744 796 1040 759 781 865 845 944 984 897 822 1010
771 676 649 846 812 742 801 1040 860 874 848 890 744 749 838 1050 918 986
797 923 975 815 1020 906 901 1170 912 746 919 718 714 740
The data are from Cobb (1978) and represent successive measurements of the annual flow of
the river Nile (in \(10^8\) \(m^3\)) at Aswan, between 1871 and 1970. nile.txt contains the
flow measurements in a single separated by spaces.
If you open the sunset-salvo file with a text editor you will see
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data
This is a quote extracted from a worth-reading academic paper titled “Sunset Salvo” that has
been written by one of the most prominent statisticians, John W. Tukey (Tukey, 1986). The
data in this file can be either the letters and spaces or the words themselves, if, for example,
one wants to model the frequency or order of letters or words.
Delimiter-Separated Values
As we discussed, a widespread way to represent structured data is in two-dimensional arrays.
For example, the jogging data we encountered in the previous section is a two-dimensional
array:
Such data structures are ubiquitous; each line is a data record (in our case a record coming
from the smartwatch or smartphone) and each of these records consists of the values that one
of more variables take.
A common way that such structured data are stored and a shared is in text files containing
delimiter-separated values. Such files are organized as follows: each line corresponds to a
separate record, and the values of the record are separated by a special character (the
delimiter), with the most common being the comma, tab, and colon.
For example, if you open the tab-delimited file (running_dat.tsv) with the above data in a
text editor you should see
"timestamp","latitude","longitude","altitude","distance","heart_rate","spee
d"
2013-06-01 18:40:29,50.813805,-
1.7126063,80.2000122,1805.9399512,133,4.06005860000005
2013-06-01 18:40:30,50.8138298,-
1.7126487,80,1810.0000098,133,4.55004880000001
2013-06-01 18:40:31,50.8138543,-
1.7127005,79.7999878,1814.5500586,133,2.97998050000001
2013-06-01 18:40:32,50.8138709,-
1.7127338,79.7999878,1817.5300391,133,2.96997069999998
2013-06-01 18:40:33,50.8138757,-
1.7127769,79.5999756,1820.5000098,133,3.65002439999989
2013-06-01 18:40:34,50.8138862,-
1.712826,79.5999756,1824.1500342,133,3.22998040000016
2013-06-01 18:40:35,50.8139051,-
1.7128616,79.4000244,1827.3800146,133,4.65002449999997
2013-06-01 18:40:36,50.8139326,-
1.7129115,79.4000244,1832.0300391,133,4.14990229999989
2013-06-01 18:40:37,50.8139517,-1.7129626,79.2000122,1836.1799414,133,2
2013-06-01 18:40:38,50.8139537,-
1.7129945,79.2000122,1838.1799414,133,4.21008299999994
2013-06-01 18:40:39,50.8139622,-
1.7130533,79,1842.3900244,133,5.18994140000018
TSV, and particularly CSV, files are widely used for storing and sharing data, and can be
opened by many applications, including most spreadsheet programs (like Microsoft Excel and
Libre Office Calc), and have excellent support in many programming languages (including R
and Python). Despite the C in “CSV file” standing for comma, you may encounter files with
“.csv” extension using different delimiters (e.g. colon, semi-colon, or tab).
You may be wondering what happens if one of the data values is or includes the delimiter;
this is a well-studied topic known as delimiter collision and most modern spreadsheet
applications and programming languages try hard to deal with it; see, for example,
Wikipedia’s delimiter page for more information.
XML
XML (Extensible Markup Language) is a markup language (like Markdown is) that defines a
set of rules for encoding information in objects called XML documents. XML is a format for
organizing structured and semi-structured data in a way that is both human-readable and
machine-readable. XML documents are typically stored in plain text files with extension
.xml.
Since its proposal by W3C in 1998, the XML format has found a diverse range of uses,
ranging from a way to store and share structured and semi-structured data, up to defining
other popular file formats. In fact, you are using XML files on a daily basis; for example,
Word document files with .docx extension from recent versions of Microsoft Office are
bundles of many XML documents (see this blog post for more information).
Let’s introduce some terminology. Clearly, an XML document is a string of characters. XML
documents typically begin with an XML declaration like <?xml version="1.0"
encoding="UTF-8"?> above. Such declarations describe information about the XML
document. The characters making up the XML document are divided into markup and
content. Whatever is between < and > is markup, and anything that is not markup is content.
For example, in the line <surname>Tukey</surname>, <surname> and </surname> are
markup, while Tukey is content. <surname> and </surname> are instances of start and end
tags, respectively, which are the main tag types XML supports. Also,
<statistician>
<name>Karl</name>
<middle></middle>
<surname>Pearson</surname>
<dob>27/03/1857</dob>
<wiki>https://en.wikipedia.org/wiki/Karl_Pearson</wiki>
</statistician>
is an element and
<surname>Pearson</surname>
is one of its child elements. Finally, XML supports comments by enclosing text between <!--
and >. For example, <!-- A list of famous statisticians -->.
XML is a great format for storing hierarchical data. For example, as is apparent by the
statisticians example above, the contents of an XML document can be graphically
represented as a tree. The top node is records, which has six children nodes statistician,
each of which has five children nodes, name, middle, surname, dob, and wiki.
An XML parser is a special bit of software that processes XML documents and passes the
structured information within them to another application or on a file. Mainstream
programming languages, like R and Python, have packages and modules that allow parsing
XML files.
More technical details on XML (e.g. metadata and attributes on tags, XML schemas, etc) and
extensions (e.g. XML namespaces) can be found at Wikipedia’s XML page.
JSON
JavaScript Object Notation (JSON) is a data interchange file format, which, like XML, aims
at being both human-readable and machine-readable. JSON was originally used in JavaScript,
but many modern programming languages (like R and Python) provide great support for
generation and parsing of JSON-format data.
JSON is seeing a diverse range of technological applications, being used for storing and
sharing data sets, for communication between different devices and programs (e.g. server-
client communication), for interfacing with APIs, etc. In fact, your web browser and the apps
on your mobile phone or tablet are creating, processing, sending and receiving hundreds of
JSON files every day. One of the reason for its popularity is that it is a lightweight format to
store and transmit information, and easy/fast to parse (using a JSON parser).
[
{
"name": "Ronald",
"middle": "Aylmer",
"surname": "Fisher",
"dob": "17/02/1890",
"wiki": "https://en.wikipedia.org/wiki/Ronald_Fisher"
},
{
"name": "William",
"middle": "Sealy",
"surname": "Gosset",
"dob": "13/08/1876",
"wiki": "https://en.wikipedia.org/wiki/William_Sealy_Gosset"
},
{
"name": "David",
"middle": "Roxbee",
"surname": "Cox",
"dob": "15/07/1924",
"wiki": "https://en.wikipedia.org/wiki/David_Cox_(statistician)"
},
{
"name": "Thomas",
"middle": null,
"surname": "Bayes",
"dob": "07/04/1761",
"wiki": "https://en.wikipedia.org/wiki/Thomas_Bayes"
},
{
"name": "Karl",
"middle": null,
"surname": "Pearson",
"dob": "27/03/1857",
"wiki": "https://en.wikipedia.org/wiki/Karl_Pearson"
},
{
"name": "John",
"middle": "Wilder",
"surname": "Tukey",
"dob": "16/06/1915",
"wiki": "https://en.wikipedia.org/wiki/John_Tukey"
}
]
Clearly, this file contains the same information as statisticians.xml but in a less verbose
manner (less characters; e.g. there are no end tags), sacrificing a bit in terms of detail that can
be transmitted along with the file (e.g. no support for comments or metadata, no namespaces,
etc.)
JSON supports some basic variable types, including strings, numbers, Booleans, null, arrays,
and objects. For example, the below
{
"language": "Python",
"release": 1991,
"os": ["Linux", "macOS", "Windows"],
"oo": true,
"pastnames": null
}
is a JSON object (what is between { and }), with five key-value combinations;
e.g. “language” is a key and “Python” is the value. The value of “release” is a number, the
value of “os” is an array, the value of “oo” (object-oriented) is a Boolean, and “pastnames” is
null.
More details on JSON and extensions can be found at Wikipedia’s JSON page and at
json.org.
Spreadsheets
A spreadsheet is a computer application for the organization, analysis and storage of tabular
data. Examples of spreadsheet are Microsoft Excel, LibreOffice Calc or Apple Numbers.
They all support a variety of formats for storing tabular data. For example, opening up
running.csv in Microsoft Excel one sees the following (you get a similar picture if you open
running.csv in LibreOffice)
Screenshot of Microsoft Excel showing the data in running.csv
The data are opened in a sheet, which is part of a workbook. A sheet is customarily a two-
dimensional array with rows named with numbers and columns named by letters. Every
combination of number and letter corresponds to a cell, and each value in the data occupies a
single cell.
Modern spreadsheet applications allow for the computation of summaries using in-cell
formulas, provide data manipulation utilities and tools to carry out a range of statistical
analyses and produce visualizations. A description of spreadsheet features is out of the scope
of this course. YouTube provides a range of Excel quick tours for a range of levels (e.g. this
one is for absolute beginners and this one is for more advanced users). Also, LibreOffice
provides a range of guides for its applications, including Calc; see here.
Spreadsheets also provide alternative formats for saving tabular data, along with any
formulas or visualizations that have been produced. For example, the default file extension
for Microsoft Excel is .xlsx, for LibreOffice Calc is .ods (which are both bundles of XML
files), and for Apple Numbers is .numbers.
Other Data Formats
There is a wide variety of other file formats (and database systems!) to store structured, semi-
structured, and unstructured data. Many file formats are associated with software for the
statistical analysis of data, like SAS, SPSS and MINITAB, or have been invented for
particular purposes, such as the nifti data format that is used in neuroimaging applications.
There are also binary data formats, that is data formats where the data are not stored or
communicated as text, as is done, for example, for CSV, TSV, XML, and JSON. Examples of
such data formats are NASA’s HDF5 (hierarchical data format) and UCAR’s netCDF data
files (network common data form), which allow users to store scientific data in array-oriented
ways, including descriptions, labels, formats, units, etc. Other binary data formats are formats
used for images, such PNG (portable network graphics) and JPEG (Joint Photographic
Experts Group format) and videos, such as MP4 (MPEG-4).
References
Cobb, G. W. (1978). The problem of the Nile: Conditional solution to a changepoint problem.
Biometrika, 65(2), 243–251.
Tukey, J. W. (1986). Sunset salvo. The American Statistician, 40(1), 72–76.
Import/Export of Data-Exchange
Files in R
Note here that we assign the result of scan("nile.txt") to a variable called nile, and we
assume that nile.txt is in our working directory (type getwd() to see your working
directory). If nile.txt is not in your working directory, section “Installing and Interacting
with R” describes how you can change it. Alternatively, you can replace
scan("/path/to/nile.txt"), where /path/to/ is the path to the directory that nile.txt
leaves, or, if you are using the terminal, navigating into the directory where nile.txt is and
starting R from there. In what follows, we assume that all the data files we import data from
are in the working directory.
Each call to scan() can read a particular type of data. For example, the above line of code
read the contents of nile.txt as double (double precision numbers).
typeof(nile)
[1] "double"
The typeof() function can be used to determine the R internal type or storage mode of any
object (here of nile).
So, a histogram of the annual flow of the river Nile between 1871 and 1970 is
hist(nile)
If we want to read the contents of nile.txt as characters we do
The;combination;of;some;data;and;an;aching;desire;for;an;answer;does;not;en
sure;that;a;reasonable;answer;can;be;extracted;from;a;given;body;of;data
Hence, the words are delimited by semi-colon (;), and we need to tell that to scan() in order
to extract them:
scan() is one of the most basic utilities for reading data in R. It may seem simple at first but
it offers a wide range of options, including skipping lines, reading only the first few lines of a
file and reading different data types in tabular data; for example, we can read the data in
running_dat.tsv by skipping the first line with the names, specifying that the file is tab-
delimited (sep = "\t"), and that the first column is a character and the others are numbers
(note that the type of data is specified here by simply supplying a list of a character and a
number to what).
scan("running_dat.tsv", what = list("", 1.0, 1.0, 1.0, 1.0, 1.0, 1.0), sep
= "\t", skip = 1)
[[1]]
[1] "2013-06-01 18:40:29" "2013-06-01 18:40:30" "2013-06-01 18:40:31"
[4] "2013-06-01 18:40:32" "2013-06-01 18:40:33" "2013-06-01 18:40:34"
[7] "2013-06-01 18:40:35" "2013-06-01 18:40:36" "2013-06-01 18:40:37"
[10] "2013-06-01 18:40:38" "2013-06-01 18:40:39"
[[2]]
[1] 50.81381 50.81383 50.81385 50.81387 50.81388 50.81389 50.81391
50.81393
[9] 50.81395 50.81395 50.81396
[[3]]
[1] -1.712606 -1.712649 -1.712700 -1.712734 -1.712777 -1.712826 -1.712862
[8] -1.712911 -1.712963 -1.712994 -1.713053
[[4]]
[1] 80.20001 80.00000 79.79999 79.79999 79.59998 79.59998 79.40002
79.40002
[9] 79.20001 79.20001 79.00000
[[5]]
[1] 1805.94 1810.00 1814.55 1817.53 1820.50 1824.15 1827.38 1832.03
1836.18
[10] 1838.18 1842.39
[[6]]
[1] 133 133 133 133 133 133 133 133 133 133 133
[[7]]
[1] 4.060059 4.550049 2.979981 2.969971 3.650024 3.229980 4.650024
4.149902
[9] 2.000000 4.210083 5.189941
Export
A convenient way to write lines to a file is the function cat(). cat() is used to output
objects either in the standard output connection (typically directly in the R prompt), or to a
file. For example,
will create a file called nile1.txt in the working directory, where the values stored in nile
are separated by ,. See ?cat for more information. Other useful, low-level functions to
export lines whose help pages you may want to check out are write and writeLines.
Delimiter-Separated Values
Import
The R function read.table() is the principal means for importing tables from text files
containing delimiter-separated values (like CSV files, for example). It uses scan() internally
but it packs it in a convenient interface. For example, we can import the data in
running_dat.tsv by doing
str(running)
The ability to automatically figure out the variable types comes in extremely handy when it
comes to large structured data sets or data sets that we are not otherwise familiar with.
Secondly, we tell read.table() that the first row of running_dat.tsv has the variable
names, by setting the argument header to TRUE, and that running_dat.tsv is tab-separated.
or simply
read.csv("running_dat.csv")
timestamp latitude longitude altitude distance heart_rate
speed
1 2013-06-01 18:40:29 50.81381 -1.712606 80.20001 1805.94 133
4.060059
2 2013-06-01 18:40:30 50.81383 -1.712649 80.00000 1810.00 133
4.550049
3 2013-06-01 18:40:31 50.81385 -1.712700 79.79999 1814.55 133
2.979981
4 2013-06-01 18:40:32 50.81387 -1.712734 79.79999 1817.53 133
2.969971
5 2013-06-01 18:40:33 50.81388 -1.712777 79.59998 1820.50 133
3.650024
6 2013-06-01 18:40:34 50.81389 -1.712826 79.59998 1824.15 133
3.229980
7 2013-06-01 18:40:35 50.81391 -1.712862 79.40002 1827.38 133
4.650024
8 2013-06-01 18:40:36 50.81393 -1.712911 79.40002 1832.03 133
4.149902
9 2013-06-01 18:40:37 50.81395 -1.712963 79.20001 1836.18 133
2.000000
10 2013-06-01 18:40:38 50.81395 -1.712994 79.20001 1838.18 133
4.210083
11 2013-06-01 18:40:39 50.81396 -1.713053 79.00000 1842.39 133
5.189941
In fact, read.csv() (for comma-separated values using . for decimal points) and
read.csv2() (for semi-colon-separated values using , for decimal points) are convenience
functions that are the same as read.table() apart from having different default arguments
(e.g. write.csv() has sep = ",", dec = "." and header = TRUE by default; see
?read.table for details and type read.csv() in R command to see that it simply calls
read.table()). The same is true for read.delim() (for tab-delimited files using . for
decimal points) and read.delim2() (for tab-delimited files using , for decimal points). For
example, treating sunset-salvo-sem as semi-colon-delimited tabular data gives
Note here that we had to set header = FALSE in order to let read.csv2() (and implicitly
read.table()) know that sunset-salve-sem does not have names and the imported data
get the default column names V1 to V28. Otherwise,
read.csv2("sunset-salvo-sem")
[1] The combination of some data and
[7] an aching desire for. an.1 answer
[13] does not ensure that a reasonable
[19] answer.1 can be extracted from a.1
[25] given body of.1 data.1
<0 rows> (or 0-length row.names)
which imports the words in sunset-salvo-sem as the column names and imports no (0 rows
of) observations! Now, try to import the data in running_dat.csv with header = FALSE to
see what happens.
readr is a noteworthy R package when it comes to importing tabular data from files, offering
fast reading of tabular data, the option to provide column specifications, and many other
features. See the readr’s vignettes for an introduction.
Export
We also can write data in files with delimiter-separated values. For example, if for some
reason we wanted to write the data in running in a file with &&-delimited values with decimal
characters represented by * (both quite atypical choices!), we do
We can quickly inspect the lines in the file that was generated by doing
readLines("running.dat")
[1]
"\"timestamp\"&&\"latitude\"&&\"longitude\"&&\"altitude\"&&\"distance\"&&\"
heart_rate\"&&\"speed\""
[2] "\"1\"&&\"2013-06-01 18:40:29\"&&50*813805&&-
1*7126063&&80*2000122&&1805*9399512&&133&&4*06005860000005"
[3] "\"2\"&&\"2013-06-01 18:40:30\"&&50*8138298&&-
1*7126487&&80&&1810*0000098&&133&&4*55004880000001"
[4] "\"3\"&&\"2013-06-01 18:40:31\"&&50*8138543&&-
1*7127005&&79*7999878&&1814*5500586&&133&&2*97998050000001"
[5] "\"4\"&&\"2013-06-01 18:40:32\"&&50*8138709&&-
1*7127338&&79*7999878&&1817*5300391&&133&&2*96997069999998"
[6] "\"5\"&&\"2013-06-01 18:40:33\"&&50*8138757&&-
1*7127769&&79*5999756&&1820*5000098&&133&&3*65002439999989"
[7] "\"6\"&&\"2013-06-01 18:40:34\"&&50*8138862&&-
1*712826&&79*5999756&&1824*1500342&&133&&3*22998040000016"
[8] "\"7\"&&\"2013-06-01 18:40:35\"&&50*8139051&&-
1*7128616&&79*4000244&&1827*3800146&&133&&4*65002449999997"
[9] "\"8\"&&\"2013-06-01 18:40:36\"&&50*8139326&&-
1*7129115&&79*4000244&&1832*0300391&&133&&4*14990229999989"
[10] "\"9\"&&\"2013-06-01 18:40:37\"&&50*8139517&&-
1*7129626&&79*2000122&&1836*1799414&&133&&2"
[11] "\"10\"&&\"2013-06-01 18:40:38\"&&50*8139537&&-
1*7129945&&79*2000122&&1838*1799414&&133&&4*21008299999994"
[12] "\"11\"&&\"2013-06-01 18:40:39\"&&50*8139622&&-
1*7130533&&79&&1842*3900244&&133&&5*18994140000018"
See ?write.table for more options. There are also the convenience functions write.csv()
and write.csv2() that simply have different default arguments to write.table().
XML
Import
There are several ways to import XML data into R, with the most commonly used being the
XML and xml2 R packages.
From the two, XML is perhaps the most feature-rich providing options for the very fine
control of the process of importing and exporting XML files (see the pages of the Omega
project for tutorials and details), but xml2 is particularly straightforward to use, because it
takes care of several details relating to XML files automatically (see the xml2 page for
details).
Printing the object stats_people we see the first tag <records> and a printout of its 6
<statistician> children. xml2 provides functions to interrogate stats_people.
For example, xml_name() and xml_children() can be used to extract the tag name of
stats_people, and its children.
xml_name(stats_people)
[1] "records"
xml_children(stats_people)
{xml_nodeset (6)}
[1] <statistician>\n <name>Ronald</name>\n <middle>Aylmer</middle>\n
<surn ...
[2] <statistician>\n <name>William</name>\n <middle>Sealy</middle>\n
<surn ...
[3] <statistician>\n <name>David</name>\n <middle>Roxbee</middle>\n
<surna ...
[4] <statistician>\n <name>Thomas</name>\n <middle/>\n
<surname>Bayes</sur ...
[5] <statistician>\n <name>Karl</name>\n <middle/>\n
<surname>Pearson</sur ...
[6] <statistician>\n <name>John</name>\n <middle>Wilder</middle>\n
<surnam ...
If we want to extract the surnames of the famous statistics we first have to find all children
with tag <surname>. This can be done using XPath (XML Path language) syntax, which
allows to select nodes from an XML document (see the w3schools.com tutorial for learning
XPath). We do
xml_text(surname_nodes)
[1] "Fisher" "Gosset" "Cox" "Bayes" "Pearson" "Tukey"
library("XML")
stats_people_df <- xmlToDataFrame("statisticians.xml")
stats_people_df
name middle surname dob
1 Ronald Aylmer Fisher 17/02/1890
2 William Sealy Gosset 13/08/1876
3 David Roxbee Cox 15/07/1924
4 Thomas Bayes 07/04/1761
5 Karl Pearson 27/03/1857
6 John Wilder Tukey 16/06/1915
wiki
1 https://en.wikipedia.org/wiki/Ronald_Fisher
2 https://en.wikipedia.org/wiki/William_Sealy_Gosset
3 https://en.wikipedia.org/wiki/David_Cox_(statistician)
4 https://en.wikipedia.org/wiki/Thomas_Bayes
5 https://en.wikipedia.org/wiki/Karl_Pearson
6 https://en.wikipedia.org/wiki/John_Tukey
Bear in mind though, that in most cases, it is not possible to represent data in XML files in
tabular form. See ?xmlToDataFrame for which data types are supported and about what
xmlToDataFrame() can do.
Export
Objects like stats_people, i.e. objects that represent XML documents or nodes (in R
terminology, objects of class xml_document or xml_node) can be written on disk using the
write_xml() functions from xml2. This is useful, for example, if we have an R script whose
results need to be exported as XML to be later read by another utility in a data-analytic
pipeline.
For example, if we want to remove Thomas Bayes’ record from stats_people and write the
result on disk we do
We knew that Thomas Bayes’ record was at the fourth node (this is what [[4]] above simply
extracts the fourth record) by looking at the order the surnames appear when we extracted
those earlier.
library("jsonlite")
stats_people <- fromJSON("statisticians.json")
stats_people
name middle surname dob
1 Ronald Aylmer Fisher 17/02/1890
2 William Sealy Gosset 13/08/1876
3 David Roxbee Cox 15/07/1924
4 Thomas <NA> Bayes 07/04/1761
5 Karl <NA> Pearson 27/03/1857
6 John Wilder Tukey 16/06/1915
wiki
1 https://en.wikipedia.org/wiki/Ronald_Fisher
2 https://en.wikipedia.org/wiki/William_Sealy_Gosset
3 https://en.wikipedia.org/wiki/David_Cox_(statistician)
4 https://en.wikipedia.org/wiki/Thomas_Bayes
5 https://en.wikipedia.org/wiki/Karl_Pearson
6 https://en.wikipedia.org/wiki/John_Tukey
There is also a function read_json(), which will read the contents of the JSON document
and bring into R as a nested list (we will see what this is in later sections), that is somewhat
closer to the hierarchy present in a JSON document.
This long string is the minified version of the JSON representation of the data in running. We
can prettify it (or indent the document) to more clearly see what toJSON() does by doing
prettify(running_json)
[
{
"timestamp": "2013-06-01 18:40:29",
"latitude": 50.8138,
"longitude": -1.7126,
"altitude": 80.2,
"distance": 1805.94,
"heart_rate": 133,
"speed": 4.0601
},
{
"timestamp": "2013-06-01 18:40:30",
"latitude": 50.8138,
"longitude": -1.7126,
"altitude": 80,
"distance": 1810,
"heart_rate": 133,
"speed": 4.55
},
{
"timestamp": "2013-06-01 18:40:31",
"latitude": 50.8139,
"longitude": -1.7127,
"altitude": 79.8,
"distance": 1814.5501,
"heart_rate": 133,
"speed": 2.98
},
{
"timestamp": "2013-06-01 18:40:32",
"latitude": 50.8139,
"longitude": -1.7127,
"altitude": 79.8,
"distance": 1817.53,
"heart_rate": 133,
"speed": 2.97
},
{
"timestamp": "2013-06-01 18:40:33",
"latitude": 50.8139,
"longitude": -1.7128,
"altitude": 79.6,
"distance": 1820.5,
"heart_rate": 133,
"speed": 3.65
},
{
"timestamp": "2013-06-01 18:40:34",
"latitude": 50.8139,
"longitude": -1.7128,
"altitude": 79.6,
"distance": 1824.15,
"heart_rate": 133,
"speed": 3.23
},
{
"timestamp": "2013-06-01 18:40:35",
"latitude": 50.8139,
"longitude": -1.7129,
"altitude": 79.4,
"distance": 1827.38,
"heart_rate": 133,
"speed": 4.65
},
{
"timestamp": "2013-06-01 18:40:36",
"latitude": 50.8139,
"longitude": -1.7129,
"altitude": 79.4,
"distance": 1832.03,
"heart_rate": 133,
"speed": 4.1499
},
{
"timestamp": "2013-06-01 18:40:37",
"latitude": 50.814,
"longitude": -1.713,
"altitude": 79.2,
"distance": 1836.1799,
"heart_rate": 133,
"speed": 2
},
{
"timestamp": "2013-06-01 18:40:38",
"latitude": 50.814,
"longitude": -1.713,
"altitude": 79.2,
"distance": 1838.1799,
"heart_rate": 133,
"speed": 4.2101
},
{
"timestamp": "2013-06-01 18:40:39",
"latitude": 50.814,
"longitude": -1.7131,
"altitude": 79,
"distance": 1842.39,
"heart_rate": 133,
"speed": 5.1899
}
]
Export
We can write tabular objects like running as JSON to disk, using jsonlite’s write_json()
function. For example,
Spreadsheets
The are various package to read spreadsheet formats. For example, the readxl R package
provides excellent out-of-the-box support for reading Microsoft Excel files (with extensions
.xls for older versions, and .xlsx for newer versions of Microsoft Excel). See, readxl’s
page for demonstrations. Also, the writexl allows to write R objects with rectangular data
into .xlsx files. See, writexl’s page for demonstrations.
writexl is part of a collection of packages by the rOpenSci initiative, which is worth taking a
look at.
The readODS R package provides utilities to read and write Open Document Spreadsheets
(with extension .ods) files from R. See the package’s vignettes for more information.
• Logical
• Character
• Integer
• Numeric
• Factor.
Objects with these data types can be organized into data structures, such as
• Atomic vector
• Matrix
• Array
• List
• Data frame.
Logical
One of the most useful data types in R (and programming in general) is logical. An object of
type logical in R has two possible values, TRUE and FALSE. For example, if we do
a <- TRUE
b <- FALSE
we have created two objects of type logical with names a and b and values TRUE and FALSE,
respectively.
class(a)
[1] "logical"
R provides the key logical operators to carry out Boolean algebra. In particular,
Operator Description
& AND (conjunction)
| OR (disjunction)
! NOT (negation)
So, we can easily reproduce the results of the basic Boolean algebra operations in R
a & a
[1] TRUE
b & b
[1] FALSE
a & b
[1] FALSE
a | a
[1] TRUE
b | b
[1] FALSE
a | b
[1] TRUE
!a
[1] FALSE
!b
[1] TRUE
Character
A character object represents text (or string) values in R. We use double or single quotation
marks to represent text. For example,
We can also combine single and double quotes if we want to make quotes part of the string.
For example,
Adding "\n" tells R to break a line after concatenating the strings. See ?Quotes for a
descriptions of the various “character constants” (also known as special characters) in R.
and
str6 <- Structures
Error in eval(expr, envir, enclos): object 'Structures' not found
In the last code chunk, R tries, unsuccessfully, to find an object called Structures and
assign its value to str6. Note that this could be potentially dangerous, and definitely not what
we had in mind (str6 <- "Structures"), if an object named Structures actually existed!
x <- 2.5
class(x)
[1] "numeric"
z <- 2
class(z)
[1] "numeric"
Even though the number 2 is an integer, R assigns z above as numeric by default. If we want
an integer value we have to explicitly write
z <- as.integer(5)
z
[1] 5
class(z)
[1] "integer"
z <- 5L
z
[1] 5
class(z)
[1] "integer"
As you may expect, R can be used to do basic arithmetic. Below are the basic arithmetic
operators:
• Addition: +
• Subtraction: -
• Multiplication: *
• Division: /
• Exponentiation: ^ or **
• Modulo: %%
# Addition
3 + 2
[1] 5
# Subtraction
5 - 2
[1] 3
# Multiplication
5 * 3
[1] 15
5 * -2
[1] -10
5 * 2.5
[1] 12.5
# Division
5 / 3
[1] 1.666667
5 / -2
[1] -2.5
5 / 2.5
[1] 2
# Exponentiation
2**2
[1] 4
5^2.5
[1] 55.9017
5^-2
[1] 0.04
# Modulus
10 %% 3
[1] 1
9 %% 3
[1] 0
We can also compare numbers with the result of the comparison being a logical. For example
2 > 4
[1] FALSE
exp(pi) < pi^exp(1)
[1] FALSE
log(4) < 4
[1] TRUE
a <- 2.321
a == 2.321
[1] TRUE
as.character(FALSE)
[1] "FALSE"
as.character(1.123)
[1] "1.123"
as.integer(1.123)
[1] 1
as.logical(4)
[1] TRUE
as.logical(0)
[1] FALSE
as.numeric(1L)
[1] 1
as.numeric("abc")
Warning: NAs introduced by coercion
[1] NA
as.list(mtcars)
$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8
19.7
[31] 15.0 21.4
$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
$disp
[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6
275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0
350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
$hp
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230
66 52
[20] 65 97 150 150 245 175 66 91 113 264 175 335 109
$drat
[1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07
2.93
[16] 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22
3.62
[31] 3.54 4.11
$wt
[1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440
4.070
[13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435
3.840
[25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780
$qsec
[1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90
17.40
[13] 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87 17.30
15.41
[25] 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60
$vs
[1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
$am
[1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
$gear
[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
$carb
[1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
We can also test the type of an object using is.*() functions. For example,
is.character(3)
[1] FALSE
is.character("hello")
[1] TRUE
is.logical(4)
[1] FALSE
is.logical(TRUE)
[1] TRUE
is.numeric(4)
[1] TRUE
is.numeric(FALSE)
[1] FALSE
is.complex(FALSE)
[1] FALSE
is.complex(43.3)
[1] FALSE
is.complex(43.3i + 2)
[1] TRUE
NA and NaN
NA (not available) is a logical constant in R of length 1 which contains a missing value
indicator. For example, if you compute the mean of a character, or try to convert a character
to numeric, you get NA
mean("a")
Warning in mean.default("a"): argument is not numeric or logical: returning
NA
[1] NA
as.numeric("b")
Warning: NAs introduced by coercion
[1] NA
NAs occur often when subsetting vectors, as we see in the next section.
Another special object is NaN (not available number), which is often confused to the NA. For
example, the logarithm and the sqrt of \(-1\) are NaN, simply because they do not exist in the
reals.
log(-1)
Warning in log(-1): NaNs produced
[1] NaN
sqrt(-1)
Warning in sqrt(-1): NaNs produced
[1] NaN
Note though that NaN is always NA, but NA is not always NaN, as the following tests show.
is.na(NA)
[1] TRUE
is.nan(NA)
[1] FALSE
is.na(NaN)
[1] TRUE
is.nan(NaN)
[1] TRUE
Also, note that almost every operation performed with NA or NaN will return NA or NaN.
NA + 1
[1] NA
NA + NaN
[1] NA
NaN * 3
[1] NaN
3^NaN
[1] NaN
but
length(NA)
[1] 1
length(NaN)
[1] 1
Vectors
Vectors are one-dimensional arrays that can hold data. They are simple objects to store
numeric data, character data and logical data. Bear in mind that a vector can contain only one
single data type. You cannot have different data types inside the same vector.
In R, we can create a vector with the combine function c(). The vector elements are
separated by a comma inside the parentheses. For example,
We can even give names to the elements of our vectors. This is very useful to keep track of
what each element represents.
To name elements of a vector, you can use the names() function. For example,
Now that we know how to create vectors, we can perform calculations with them. These are
performed element-wise. Here is an example:
In addition, here are some useful functions that can be used with vectors:
marks
John Jane Richard Emma Tim
80 75 90 99 100
length(marks)
[1] 5
sum(marks)
[1] 444
min(marks)
[1] 75
max(marks)
[1] 100
mean(marks)
[1] 88.8
median(marks)
[1] 90
sd(marks)
[1] 11.16692
var(marks)
[1] 124.7
marks2 <- c(85, 50, 64, 95, 45)
marks2
[1] 85 50 64 95 45
cov(marks, marks2)
[1] 27.95
cor(marks, marks2)
[1] 0.1152433
sort(marks, decreasing = FALSE)
Jane John Richard Emma Tim
75 80 90 99 100
sort(marks, decreasing = TRUE)
Tim Emma Richard John Jane
100 99 90 80 75
We can even make element-wise comparisons between vectors using relational operators. Let
us compare average daily weekday temperatures for 2 weeks, namely temperature_week_1
and temperature_week_2 respectively:
Now, suppose we want to select specific elements of a vector. R lets us do this using square
brackets [ ]:
Finally, you can also pass a logical vector inside square brackets. R will only return elements
that correspond to TRUE. For week 1, suppose we only want the day(s) where the
temperature was higher than week 2. Here is how to do it:
We first construct a character vector called colour_vector with five character values “red,”
“blue,” “yellow,” “green” and “grey.” Using the factor() function, we convert it into a
factor variable called colour_vector_factor. After printing colour_vector_factor, R
conveniently displays the levels (or the categories) of our factor variable.
There are two types of categorical variables, namely “ordinal categorical variables” and
“nominal categorical variables.” Nominal categorical variables have categories without an
implied order, that is, it is not possible to say that “one category ranks higher or lower than
the other.” An example is hair colour. On the other hand, ordinal categorical variables have
categories with a natural ordering. For instance, suppose the categorical variable
customer_satisfation has the following categories; “Low,” “Medium” and “High.” Then,
clearly, “Low” ranks below “Medium,” which in turn ranks below “High.”
Sometimes, the factor levels don’t have entirely meaningful names. For example, we might
have “L,” “M” and “H” instead of “Low” , “Medium” and “High.” Fortunately, R lets us
change the names of the levels (or categories) using the levels() function as follows:
Note that we did not specify the order of the levels. R automatically ranked the categories in
alphabetical order. This is why “H” is before “L” and “L” is before “M.” When using the
levels() function, we should be careful to respect this ordering, else R will not name the
levels correctly.
To explicitly instruct R to order the factors, we should set ordered = TRUE when calling the
factor() function. We can also specify the levels in the same command, Here are some
examples:
Here, we create a factor variable called speed_factor with three levels, namely “slow,”
“medium” and “fast” in that order. We then compare a few elements of the speed_factor
variable to see whether they rank above or below each other. Finally, we use the summary()
function to obtain a convenient summary of our factor.
Matrices
A matrix is similar to a vector in the sense that it holds elements of the same data type.
However, a matrix is two-dimensional whereas a vector is only one-dimensional. This means
that the elements of a matrix are arranged into a number of rows and columns. You cannot
have different data types inside the same matrix.
You can also create matrices from a collection of vectors. Here, we will be using the
temperature vectors from the previous subsection:
temperature_week_1
Monday Tuesday Wednesday Thursday Friday
10 11 12 15 13
temperature_week_2
Monday Tuesday Wednesday Thursday Friday
10 9 13 15 16
# Combine temperature_week_1 and temperature_week_2 into a single vector
temperature_combined <- c(temperature_week_1, temperature_week_2)
temperature_combined
Monday Tuesday Wednesday Thursday Friday Monday Tuesday
Wednesday
10 11 12 15 13 10 9
13
Thursday Friday
15 16
# Converting into a matrix
temperature_week_1_and_2 <- matrix(temperature_combined, byrow = FALSE,
nrow = 2)
temperature_week_1_and_2
[,1] [,2] [,3] [,4] [,5]
[1,] 10 12 13 9 15
[2,] 11 15 10 13 16
# R does not throw any error, but does the above matrix make sense in our
context? What is wrong?
# Let us try setting byrow = TRUE
temperature_week_1_and_2 <- matrix(temperature_combined, byrow = TRUE, nrow
= 2)
temperature_week_1_and_2
[,1] [,2] [,3] [,4] [,5]
[1,] 10 11 12 15 13
[2,] 10 9 13 15 16
# This matrix makes more sense as the data for each week is shown on a
separate row
We can also use the rownames() and colnames() functions to give names to the rows and
the columns of a matrix. For example,
temperature_week_1_and_2
[,1] [,2] [,3] [,4] [,5]
[1,] 10 11 12 15 13
[2,] 10 9 13 15 16
weeks <- c("Week 1", "Week 2")
weekdays <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
rownames(temperature_week_1_and_2) <- weeks
colnames(temperature_week_1_and_2) <- weekdays
temperature_week_1_and_2
Monday Tuesday Wednesday Thursday Friday
Week 1 10 11 12 15 13
Week 2 10 9 13 15 16
Just like vectors, we can perform a lot of interesting calculations on matrices. In addition to
the functions introduced in the previous subsection, here are four additional functions that are
useful when working with matrices:
• rowSums(): Returns a vector where each element represents the total of the values of
the corresponding row.
• colSums(): Returns a vector where each element represents the total of the values of
the corresponding column.
• rbind(): Adds a row or multiple rows to a matrix. It can also be used to merge
multiple matrices or vectors together by row.
• cbind(): Adds a column or multiple column to a matrix. It can also be used to merge
multiple matrices or vectors together by column.
In the following demo, we are interested in the daily earnings of two merchants, John and
Jane over a period of one week. First, we create an earnings vector for John
(earnings_John) and Jane (earnings_Jane), respectively. Then, we combine them into a
named matrix (earnings_combined) and we display the results:
If we want to add the earnings for Saturday and Friday, we first create a named matrix to
store the weekend earnings (earnings_combined_weekend) and then merge it with
earnings_combined using cbind() to create a new matrix called earnings_whole_week.
As the name suggests, the latter contains the earnings for Jane and John for the entire week.
Now, suppose we want to add a third merchant called Tim. We can do so using the rbind()
function:
We can find these using the colSums() and rowSums() functions respectively:
Just like vectors, we can also select specific matrix elements. Here are some examples:
Finally, just like vectors, we can perform element-wise arithmetic operations with matrices:
Note that R performed the matrix multiplication element-wise. This is not the standard way
that matrices are multiplied. To perform matrix multiplication in the standard way, we can
use the %*% operator like this:
Arrays
An array is the multi-dimensional extension of matrices in R. In fact, a matrix can be viewed
as a two-dimensional array. Just like vectors and matrices, an array can only store one data
type at a time.
To create an array, we use the array() function. The two main arguments of the array()
function are:
• data: This is a vector which contains the elements to be stored in the array.
• dim: This specifies the dimension of the array. For example, if we set dim = c(4, 3,
2), we are telling R to create an array consisting of 2 matrices, each with 4 rows and 3
columns.
• dimnames: This optional argument specifies the names of the dimensions.
Again, you cannot have different data types inside the same matrix.
, , 2
[,1] [,2]
[1,] 3 8
[2,] 6 9
[3,] 7 9
Suppose there are only 3 students in a school. Suppose further that there are 3 terms in an
academic year and that each student takes 5 tests during each term. The names of the students
are “John,” “Jane” and “Tim.” Assume that we have 9 vectors of length 5, each containing
the test results for one student for one term.
The above code block is not complete. Complete it by constructing the 9 vectors using the
above names, using some numeric values for the test results.
Currently, this data is stored in 9 different objects (9 different vectors). We want to store
them in one single object, that is, we want to store them in one single array. There are many
ways we can define the dimensions of our array but a natural choice is to store the values in
an array consisting of three \(5 \times 3\) matrices, where each matrix represents the marks
for one student only. Inside a matrix, each row will represent a test and each column will
correspond to a term.
, , 2
, , 3
dim(marks_combined)
[1] 5 3 3
We can use the dimnames argument of the array() function to add names to the dimensions
of the array.
, , Jane
, , Tim
, , Tim
Term 1 Term 3
Test 1 74 85
Test 2 72 90
Test 3 71 72
Test 4 79 77
Test 5 90 86
, , Jane
Term 1 Term 3
Test 1 63 55
Test 2 43 72
Test 3 62 64
Test 4 85 47
Test 5 65 75
, , Tim
Term 1 Term 3
Test 1 81 80
Test 2 83 94
Test 3 90 95
Test 4 84 87
Test 5 94 80
# Selecting John and Tim's Test 3 results for Term 2
marks_combined_upgraded["Test 2", "Term 2", c("John", "Tim")]
John Tim
80 77
Finally, we can perform calculations across the array elements using the apply() function.
This function takes three arguments:
In the following examples, we use the sum() function to calculate various interesting totals.
# Find total marks for all students, across all tests and terms
sum(marks_combined_upgraded)
[1] 3437
# Find total marks for all students per term
apply(marks_combined_upgraded, c(2), sum)
Term 1 Term 2 Term 3
1136 1142 1159
# Find the total marks for all students per test per term
apply(marks_combined_upgraded, c(1,2), sum)
Term 1 Term 2 Term 3
Test 1 218 237 220
Test 2 198 184 256
Test 3 223 253 231
Test 4 248 238 211
Test 5 249 230 241
# (Less useful in this context) Find the total marks for all students per
test
apply(marks_combined_upgraded, c(1), sum)
Test 1 Test 2 Test 3 Test 4 Test 5
675 638 707 697 720
Lists
Lists can hold elements of different data types. They allow us to store multiple objects under
one single object in an orderly fashion. Lists can store vectors, matrices, data frames and even
other lists.
To create a list, we use the list() function. The arguments of the list() function represent
the components of the list.
Below is an example where we create a list containing a vector, a matrix and a data frame.
We create the data frame my_data_frame by selecting the first 8 rows of the build-in mtcars
dataset.
[[2]]
[,1] [,2]
[1,] 4 7
[2,] 6 1
[[3]]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Again, if we want to name the components of our list, we use the names() function:
$`I am a Matrix`
[,1] [,2]
[1,] 4 7
[2,] 6 1
Now, it’s time to learn how to access the components of our list. We use:
Data Frames
A data frame in R is a matrix-like object where each column contains values of one variable,
with one big advantage: it can store different data types. While each column can only contain
values of a single data type, we can have columns of different data types. Another
requirement is that each column should contain the same number of items or observations.
In the previous section, we saw the built-in mtcars data frame. The name is an abbreviated
version of Motor Trend Car Road Tests. According to the dataset description (see ?mtcars),
“the data was extracted from the 1974 Motor Trend US magazine, and comprises fuel
consumption and 10 aspects of automobile design and performance for 32 automobiles.”
Here is the dataset
mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Each column represents one aspect of the cars and each row represents all the aspects of one
car.
This dataset contains only 32 observations, so we can display it entirely on one single page.
However, what if we have a very long dataset? Can we only display part of it to gain a feel of
the data? The answer is yes. To do so, we can use the head() and tail() functions. Let’s try
this with another built-in dataset called iris:
Another way to get rapid insight into a data frame is via the str() function. The following
information is returned when calling the str() function on a data frame:
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1
1 1 1 ...
We can create data frames through the data.frame() function. As arguments, we simply
pass the vectors that will become the various columns of our data frame. While the vectors
can be of different data types, they need to be of the same length.
Suppose there are 5 students in a class and they sit an exam where the pass mark is 75 out of
100. Let us create a data frame called Class_DF that contains the following variables:
# Creating variables
Student <- c("Jane", "John", "Tim", "Michael", "Emma")
Marks <- c(80,95,98.5,78.4,50.25)
Rank <- as.integer(c(3,2,1,4,5))
Passed <- c(TRUE,TRUE,TRUE,TRUE,FALSE)
# Creating Data Frame
Class_DF <- data.frame(Student, Marks, Rank, Passed)
str(Class_DF)
'data.frame': 5 obs. of 4 variables:
$ Student: chr "Jane" "John" "Tim" "Michael" ...
$ Marks : num 80 95 98.5 78.4 50.2
$ Rank : int 3 2 1 4 5
$ Passed : logi TRUE TRUE TRUE TRUE FALSE
Class_DF
Student Marks Rank Passed
1 Jane 80.00 3 TRUE
2 John 95.00 2 TRUE
3 Tim 98.50 1 TRUE
4 Michael 78.40 4 TRUE
5 Emma 50.25 5 FALSE
Observe that each column has a different data type. We would not have been able to store this
data in a matrix, so, in this sense, data frames are super convenient. In fact, data frames are
very popular. You are very likely to encounter them regularly when working with R. In a
later week, we will use the dplyr package to do all sorts of interesting operations on data
frames.
Just like matrices, we can select elements of data frames in the usual way by using square
brackets [ ].
In the above demo, we have used selection by index. We could have done the same thing via
selection by name. Data frames are formally a special kind of list. So, we can also use the $
notation to select specific columns:
Class_DF$Student
[1] "Jane" "John" "Tim" "Michael" "Emma"
Class_DF$Marks
[1] 80.00 95.00 98.50 78.40 50.25
Class_DF$Rank
[1] 3 2 1 4 5
Class_DF$Passed
[1] TRUE TRUE TRUE TRUE FALSE
# Selecting name of second student
Class_DF$Student[2]
[1] "John"
Now, suppose we want to select rows based on some condition. Consider the following
different scenarios. Suppose we only want to display observations where:
To subset a data frame based on some condition, we use the subset() function as follows:
is.data.frame(mtcars)
[1] TRUE
is.matrix(mtcars)
[1] FALSE
is.list(mtcars)
[1] TRUE
as.matrix(iris[1:5, ])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 "5.1" "3.5" "1.4" "0.2" "setosa"
2 "4.9" "3.0" "1.4" "0.2" "setosa"
3 "4.7" "3.2" "1.3" "0.2" "setosa"
4 "4.6" "3.1" "1.5" "0.2" "setosa"
5 "5.0" "3.6" "1.4" "0.2" "setosa"
as.list(matrix(1:10, nrow = 2))
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 5
[[6]]
[1] 6
[[7]]
[1] 7
[[8]]
[1] 8
[[9]]
[1] 9
[[10]]
[1] 10
is.list(letters[1:4])
[1] FALSE
Dependencies
An essential part of Python is to incorporate dependencies, also known as packages. These are
modules of existing code and functions which you can base your code upon. They may actually
include code written by you. A wide range of dependencies is available with the Anaconda
installation so you will probably only need to activate them. But if you need to install a package
with Anaconda you need to use conda commands in the QtConsole (macOS) or the Anaconda
prompt (Windows). For example, conda install Quandl. Sometimes a package is not
available as a conda package but can be installed by pip install, for example pip install
Quandl.
To find out more about conda please consult this guide. That guide also provides a handy cheat-
sheet if you want to learn, for example, about updating packages.
To use a dependency in a Python package, the basic code is
For example, the below code is using the alias np for the Python package numpy and printing the
number \(\pi\):
import numpy as np
print(np.pi)
3.141592653589793
Note that the alias np above (or any other you want to use) need to be placed before pi,
i.e. np.pi. In fact, using an alias is optional, yet recommended practice. For more on the numpy
package see the dedicated set of notes that follows.
1. float
2. int
3. Strings
4. Boolean
5. Tuples
6. Lists
7. Sets
8. Dictionaries
a=4.6
print(type(a))
<class 'float'>
b=10
print(type(b))
<class 'int'>
Strings are for text and have type str. The code below demonstrates Python’s character
processing capabilities along with their output:
Finally Boolean type is bool and is typically used for the results of true/false statements. For
example,
k=1>3
print(k)
False
print(type(k))
<class 'bool'>
Data Structures
Data structures or collections in Python are containers that are used to store collections of data
of potentially different types. There are four collection data types in the Python programming
language:
• Tuples are collections which is ordered and unchangeable. A tuple can have duplicate
members.
• Lists are collections which is ordered and changeable. A list can have duplicate
members.
• Sets are collections which is unordered and unindexed. A set cannot have duplicate
members.
• Dictionaries are collections which is unordered, changeable and indexed. A dictionary
cannot have duplicate members.
Once a tuple is created, you cannot change its values. Tuples are unchangeable, or immutable
as it also is called.
You can access tuple items by referring to the index number, inside square brackets.
Python indexing: Note that in Python the first item of a tuple (or any other collection) has index 0,
not 1! Similarly the second item has index 1, the third item has index 2 and so on. This is
different to R where the first item is indexed with 1, the second with 2 and so on.
Negative indexing is also possible and it means starting counting from the end; -1 refers to the
last item, -2 refers to the second last item etc.
You can specify a range of indexes by specifying where to start and where to end the range.
When specifying a range, the return value will be a new tuple with the specified items.
For example, if we want to return the third, fourth, and the fifth item, we can use the following
code:
For example if we want to print the second item of the list, we can use the following code:
print(list1[1])
banana
A set is a collection which is unordered and unindexed. In Python, sets are defined with curly
brackets.
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(thisdict)
{'brand': 'Ford', 'model': 'Mustang', 'year': 1964}
Items of a dictionary can be accessed by referring to their key names, inside square brackets:
thisdict["model"]
'Mustang'
You can change the value of a specific item by referring to its key name:
For example if we want to change the “year” to 2018, we can use the following code:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
thisdict["year"] = 2018
Adding an item to the dictionary is done by using a new index key and assigning a value to it:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
thisdict["color"] = "red"
print(thisdict)
{'brand': 'Ford', 'model': 'Mustang', 'year': 1964, 'color': 'red'}
The pop() method removes the item with the specified key name:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
thisdict.pop("model")
'Mustang'
print(thisdict)
{'brand': 'Ford', 'year': 1964}
Useful Links and Resources
• Data structures documentation from python.org
Working With NumPy and Pandas
** Note: The code chunks below should be run in the following order **
NumPy
NumPy is the key package for numerical computing in Python. It allows you to perform
operations on whole blocks of data without writing loops. The fast numerical processing it permits
makes it the basis of most visualisation packages as well more sophisticated scientific and
machine learning packages.
To activate it in Spyder scripts or Jupyter notebooks, simply add the following line at the
beginning of your script or notebook.
import numpy as np
Note that the code blocks that follow in this page assume that NumPy has been activated. So
make sure you execute the above line beforehand.
NumPy Arrays
A NumPy array, or else ndarray, is a grid of values of the same type, which is indexed by a
tuple of non-negative integers. The number of dimensions is the rank of the array; the shape of
an array is a tuple of integers giving the size of the array along each dimension.
We can create a ndarray object by using the array() function. Also the function type() gives
us the type of the array.
To create an ndarray, we can also pass a list, tuple or any array-like object into the array()
function, and it will be converted into an ndarray:
Arrays may have dimensions of 0 (single numbers, strings or Booleans), 1 (vectors), 2 (matrices)
or even 3 (tensors).
NumPy also provides many methods to create some commonly used arrays:
In the example below np.sqrt is a vectorised calculation on a NumPy array set up using the
function arange. In other words the function np.sqrt() is applied to all the elements of the array
rng at the same time rather than each of them separately.
rng=np.arange(10)
print(rng)
[0 1 2 3 4 5 6 7 8 9]
print(type(rng))
<class 'numpy.ndarray'>
print(np.sqrt(rng))
[0. 1. 1.41421356 1.73205081 2. 2.23606798
2.44948974 2.64575131 2.82842712 3. ]
NumPy Operations
NumPy provides some handy methods to summarise the data. For example
a=np.array([[1.0,2.0,4.0],[-1.0,2.0,-5.0]])
print(a.shape)
(2, 3)
print(a)
[[ 1. 2. 4.]
[-1. 2. -5.]]
print(a.sum(axis=0))
[ 0. 4. -1.]
print(a.sum(axis=1))
[ 7. -4.]
print(a.sum())
3.0
Note the Python indexing feature we saw on Python collections: axis=0 refers to the first
dimension of the 2-D array above; namely the rows. Similarly axis=1 refers to its second
dimension; namely the columns.
NumPy can also be used for matrix operations such as transposing and multiplying matrices:
a=np.array([[1.0,2.0,4.0],[-1.0,2.0,-5.0]])
b=np.transpose(a) # b is the transpose of a
print(b.shape)
(3, 2)
print(b)
[[ 1. -1.]
[ 2. 2.]
[ 4. -5.]]
c=np.dot(a,b) #c = a * b
print(c)
[[ 21. -17.]
[-17. 30.]]
c=a @ b #c = a * b as before
print(c)
[[ 21. -17.]
[-17. 30.]]
a=np.array([[1.0,2.0,4.0],[-1.0,2.0,-5.0]])
print(a)
[[ 1. 2. 4.]
[-1. 2. -5.]]
print(a[0:2,1:3])
[[ 2. 4.]
[ 2. -5.]]
You can also set the values in a NumPy array using simple Boolean expressions:
np.random.seed(0)
arr=np.random.random((3,3))
print(np.round(arr,3))
[[0.549 0.715 0.603]
[0.545 0.424 0.646]
[0.438 0.892 0.964]]
arr[arr<0.5]=0
print(arr)
[[0.5488135 0.71518937 0.60276338]
[0.54488318 0. 0.64589411]
[0. 0.891773 0.96366276]]
print(np.sum((arr<0) & (arr>1))) # The operator '&' stands for 'and'
0
print(np.sum((arr<0) | (arr>0.7))) # The operator '|' stands for 'or'
3
Pandas
The pandas package provides a major tool in the Python ecosystem. While it is possible to work
with data with other Python tools such as the NumPy package (see the notes on Data structures
with Python), the key difference is that pandas is designed for working with heterogeneous data.
Moreover it gives Python a functionality similar to R Data frames and integrates powerful low
level modelling tools such as data import, aggregation and cleaning.
Pandas is included in the Anaconda installation. To activate it in Spyder scripts or Jupyter
notebooks, simply add the following lines in the beginning of your program.
import pandas as pd
Note: The code blocks that follow in this document assume that Pandas has been activated. So
make sure you execute the above line beforehand. Some of the blocks also require NumPy so
make sure this is activated as well.
Pandas Series
A Series is a one-dimensional labelled array capable of holding any data type (integers, strings,
floating point numbers, etc.). The axis labels are collectively referred to as the index. The basic
method to create a Series is to call:
s = pd.Series(data, index=index)
The object data can be several things, for example a Python dictionary or a ndarray. For
ndarrays, the index must be the same length as data. If no index is passed, one will be created
having values [0, …, len(data) - 1].
np.random.seed(1)
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s
a 1.624345
b -0.611756
c -0.528172
d -1.072969
e 0.865408
dtype: float64
Pandas provides a nice tool to get basic summaries from your data set. The following code
creates the pandas data frame df and applies the describe() function on it.
Also, in a very similar manner as with NumPy, we can use Boolean indexing to generate a
subset of your data:
d = {'one': [1., 2., 3., 4.],
'two': [4., 3., 2., 1.]}
df=pd.DataFrame(d)
print(df[df>0])
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
Finally, indexing also works in a similar manner with NumPy. The additional feature here is that
we can use the rows/column names in addition to the numbered indices.
File Formats
We have already described several file formats that provide standard ways to encode information
to be stored in a file. They can usually be identified by looking at the file extension; e.g. a file
saved with name Data in CSV format will appear as Data.csv. By noticing .csv extension we
can clearly identify that it is a comma-separated file and data is stored in a specific way, in this
case a tabular format.
An essential skill for a data scientist is to be able to understand the underlying structure of
various file formats, and their advantages and disadvantages. There may also be situations in
which decisions need to be taken on how to store data in order to improved the performance.
Next we will look in various file formats and provide information on how to import their information
in Python.
The above text is written in the attached file called Sentence.txt and can be read in various
ways one of them being with the code below:
To load data in Python from a CSV file, you can use the Pandas package. The code below
activates the pandas package (first line), reads the data from the attached file titanic.csv,
stores them in the pandas data frame pd (second line), which we then view by simply typing its
name (third line).
import pandas as pd
df = pd.read_csv('titanic.csv')
df
To export pandas data frames to csv files the following line can be used df.to_csv(r'Path to
store the exported CSV file\File Name.csv') e.g.
import pandas as pd
df = pd.read_json('example.json')
df
To export pandas Data Frames to JSON files the following line can be used df.to_json(r'Path
to store the exported JSON file\File Name.json')
To read this file into Python, the ElementTree package can be used for example with the code
below:
import xml.etree.ElementTree as et
tree = et.parse('example.xml')
root = tree.getroot()
XML files contain a single root that contains all the data. In the above code the ElementTree
package is activated (1st line), the file is read and parsed into the object tree (2nd line) and the
the root of the xml file is extracted (3rd line).
The code below illustrates how to read and access the data from the XML file.
• Store
• Retrieve / query
• Update
• Manage
• Control access to
data in a database.
• All data is represented in terms of tuples and stored in tables with columns and rows
• The structure on how the data is stored is pre-defined and imposed by the programmer
• Structured data can be held.
Nonrelational Databases
An alternative to relational database model is the nonrelational database model. Nonrelational
databases, also know as NoSQL, store data in a different format to relational tables. Their key
feature is that they allow unstructured and semi-structured data to be stored and manipulated.
There are many different kinds of NoSQL databases, including:
• Graph database: A graph database stores data in terms of entities (or nodes) and the
relationships (or edges) between entities. It is suitable for data with relations are well
represented as a graph
• Document-oriented database: A document-oriented database stores semi-structured
data (or document-oriented information) in the form of JSON-like documents.
Document databases store all information for a given object in a single instance in the
database, and every stored object can be different from every other
o Below shows a JSON-like document storing information about two books.
Note that data is stored in nested key-value pairs (e.g. year is the key and
1995 is the value for the first book) and the information available for each
book is different (e.g. edition and description are only available for the
first book).
• [
• {
• "year" : 1995,
• "title" : "An introduction to database systems",
• "info" : {
• "authors" : "C. J. Date",
• "edition" : 6,
• "subject" : "database management",
• "description" : "A comprehensive treatment of database
technology. Features of this edition include: a proposal for
rapprochement between object-oriented and relational technologies;
expanded treatment of distributed databases; and chapters on
functional dependencies, views, domains and missing information.",
• "publisher" : "Reading, Mass. : Addison-Wesley Pub. Co."
• }
• },
• {
• "year" : 2016,
• "title" : "Introduction to computation and programming using
Python: with application to understanding data",
• "info" : {
• "authors" : "Guttag, John V",
• "subject" : ["Computer programming", "Python (Computer
program language)"],
• "publisher" : "The MIT Press"
• }
• }
]
RDBMS Terminology
Relation/Table
In a relational database, a relation is a table, with its rows representing a set of records (called
tuples). . For example, a university database may have a relation (table) called Student to
store the information for all students, a table Course for the course information and a table
Grade for the grade of students in different courses:
Student:
Course:
Grade:
Attribute
In relational databases, the attribute is a column in the table (or relation). Each attribute has a
type (or domain). For example, the Student table above has the attributes
• student_id (for student ID)
• name
• year
Tuple
A tuple is a set of attribute values. In a relational database, a tuple is a row in a table. For
example, each row in the Student table is a tuple storing the information for one student.
Schema
The schema is the description on how the database and the database tables are constructed.
For example, the Student table can have the schema:
Key
A primary key is the attribute used to uniquely identify a tuple, or the set of attributes whose
combined values are unique. For the examples above, the primary keys are:
• Student: student_id
• Course: course_id
• Grade: student_id and course_id
A foreign key is an attribute or a set of attributes in a relational database table that provides a
link between data in two tables. For example, student_id is the foreign key to link between
the tables Grade and Student.
• Create (databases/tables)
• Manipulate (insert, update and delete tables/tuples)
• Query
a relational database.
SQLite
SQLite is the most widely used database engine. The other two popular database engines are
MySQL and PostgreSQL. SQLite:
• Is lite
• Does not need configuration and uses little space
• Does not require a separate server process like other RDBMS systems do—SQLite
reads and writes to a single file on your disk
• Is self-contained in the sense that it has very few dependencies
• Offres full-featured SQL.
Creating Databases
In this course we use DB Browser for SQLite (DB4S) to create, query and edit database
files compatible with SQLite. You can download ot at the DB Browser for SQLite web
pages.
Click on New Database and use University as the name of the database. You can safely
close the “Edit table definition” pop-up window by hitting Cancel.
Screenshot on how to create a database
Browse Data
Click Browse Data to browse the data. When clicking on the list from Table, you should see
three tables Student, Grade, and Course. Select the table that you want to browse.
Screenshot on how to browse data
Manipulating Databases
Add a New Table
Add a new table Teacher:
Click Browse Data to browse the data. When click on the list from Table, you should see
four tables. Select the table Teacher and it will display an empty table.
Delete a Table
We can remove the Teacher table in the following way:
Browse and now there are only three tables on the list from Table.
Screenshot of list of tables
Insert Tuples/Rows
Insert the year 1 student Harper Taylor with student ID 202029744 to Student:
Browse the table Student and makes sure the new student information is in the table.
Update Tuples/Rows
Update the student ID of student Harper Taylor to 201929744:
Browse the table Student and make sure the student information is updated in the table.
Screenshot of the table with a updated row
Delete Tuples/Rows
Delete the record for the student Harper Taylor from table Student:
Browse the table Student and make sure the row is deleted.
• Create (database/tables)
• Manipulate (insert, update and delete tables/tuples)
• Query
• Is easy to learn: SQL commands are like English sentences, which makes it easy to
read and write SQL queries
• Has well-defined standards: American National Standards Institute (ANSI) and
International Organization for Standardization (ISO) have officially adopted the
standard “Database Language SQL” language definition.
Manipulating Databases
We have already seen how to use the DB Browser for SQLite graphical interface to add and
delete tables and rows in a database. Here, we show how to do the same operations via SQL
commands. We can run SQL code directly in DB Browser for SQLite in the following way:
1. Click on Execute SQL and type your SQL code in the top text box.
2. Click the run button to execute the SQL code.
Below is a screenshot of a SQL command executed in DB Browser for SQLite
In what follows we will work with the University database that we have created previously
using DB Browser for SQLite.
Now, click Browse Data to browse the database. When you click on the list Table, you
should get four tables. Browse the table Teacher and it will display an empty table with
attributes “staff_id” and “name.”
Screenshot of the newly created empty table
Delete a Table
We can remove the Teacher table by the DROP TABLE command:
If you browse now, you will see that there are only three tables left.
Screenshot of list of tables
Insert Tuples/Rows
Insert the year 1 student “Harper Taylor” with student ID 202029744 to Student by using the
INSERT INTO command:
Browse the table Student and make sure the new student information is in the table.
Screenshot of the table with a newly added row
Update Tuples/Rows
Update the student ID of “Harper Taylor” to 201929744 by using the UPDATE command:
UPDATE Student
SET student_id = "201929744"
WHERE name = "Harper Taylor"
Screenshot on how to update a row
Browse the table Student and make sure the new student information is in the table.
Screenshot of the table with an updated row
Delete Tuples/Rows
Now, delete the record for the student “Harper Taylor” from table Student:
Browse the table Student and make sure that there is no row for “Harper Taylor.”
SQL Queries
SQL queries are used to select data from a database. The basic SQL query has the form:
SELECT A_1,A_2,…,A_n
FROM R_1,R_2, …,R_m
WHERE conditions
Example 1: Conditions
We would like to return the student ID of all students who are in ST101:
SELECT student_id
FROM Grade
WHERE course_id = 'ST101'
SELECT *
FROM Grade
WHERE course_id = 'ST101'
SELECT Student.name
FROM Grade, Student
WHERE Grade.course_id = 'ST101' AND Student.student_id = Grade.student_id
• On the WHERE line we use AND to specify that both conditions Grade.course_id =
'ST101' and Student.student_id = Grade.student_id need to be satisfied.
• In this query we use Student.student_id and Grade.student_id instead of
student_id to specify which table the attribute is from. This is because the attribute
student_id exists in both Student and Grade, and we need to indicate which table
the attribute is from to eliminate ambiguity.
• We have Student.name instead of name and Grade.course_id instead of course_id
to specify which table the attribute is from. However, this is not necessary here; it is
perfectly fine to instead write:
SELECT name
FROM Grade, Student
WHERE course_id = 'ST101' AND Student.student_id = Grade.student_id
Screenshot of the query result
The reason we can do this is because the attribute name is only in the table Student and the
attribute course_id is only in the table Grade.
If we want to order the result, we can use the ORDER BY clause. For example, if we want to
order the names of the students in alphabetical order:
SELECT Student.name
FROM Grade, Student
WHERE Grade.course_id = 'ST101' AND Student.student_id = Grade.student_id
ORDER BY Student.name
Screenshot of the query result
SELECT Course.name
FROM Student, Grade, Course
WHERE (Student.name ='Ava Smith' OR Student.name = 'Freddie Harris') AND
Student.student_id = Grade.student_id AND Course.course_id =
Grade.course_id
Screenshot of the query result
Note:
• OR is used because only one of the conditions Student.name ='Ava Smith' and
Student.name = 'Freddie Harris' is needed for department.
• If you look at the output closely, there are a few duplicate rows, which we can remove
using DISTINCT:
Example 4: Aggregation
We would like to calculate the average mark for each course according to the value of
course_id:
The attribute name for the average mark looks different from other attributes. We can rename
it using the AS clause:
SELECT course_id, AVG(final_mark) as avg_mark
FROM Grade
GROUP BY course_id
GROUP BY groups rows that have the same attribute value. It is often used to group rows
before applying aggregation functions to an attribute of the group. Common aggregation
functions are:
• AVG
• COUNT
• SUM
• AVG
• MAX
• MIN
In our example, we group the rows based on the course_id, and summarize the grouped
rows by the average final_mark for each group.
SQL Joins
In the previous examples, we have used conditions like Student.student_id =
Grade.student_id in the WHERE to make use of multiple tables for query. The same queries
can be written more concisely using JOIN clauses.
A JOIN clause combines rows from two or more tables based on related column(s) between
them. The SQL language offers many different types of joins. Note that not all types of JOIN
are implemented by the database engines.
• INNER JOIN: Select rows that have matching values in both tables based on the given
columns
• NATURAL JOIN: Similar to INNER JOIN except that there is no need to specify which
columns are used for matching values
• OUTER JOIN: Unlike INNER JOIN, unmatched rows in one or both tables can be
returned. There are LEFT, RIGHT and FULL OUTER JOIN. SQLite only supports LEFT
OUTER JOIN. For LEFT OUTER JOIN, all the records from the left table are included in
the result.
• CROSS JOIN: Return the Cartesian product of the two joined tables, by matching all
the values from the left table with all the values from the right table.
INNER JOIN
As in Example 2, below we get the records about students who took the course with course
ID ST101, but we also sort the student names in alphabetical order:
SELECT *
FROM Grade, Student
WHERE Grade.course_id = 'ST101' AND Student.student_id = Grade.student_id
ORDER BY Student.name
Screenshot of the query result
SELECT *
FROM Student JOIN Grade ON Student.student_id = Grade.student_id
WHERE course_id = 'ST101'
ORDER BY Student.name
Note that
• We write JOIN instead of INNER JOIN as by default INNER JOIN is used when you do
not specify the join type.
• The ON keyword specifies on what condition you want to join the tables.
• If you look at the result, student_id appears twice—this is because all the columns
from both tables are returned.
• By using the JOIN clause, we separate the logic of combining the tables
(Student.student_id = Grade.student_id) and the other condition (course_id
= 'ST101'), which makes the SQL query more readable.
Instead of joining using ON, we can use USING with JOIN if the columns that we are joining
have the same name. For example:
SELECT *
FROM Student JOIN Grade USING(student_id)
WHERE course_id = 'ST101'
ORDER BY Student.name
Screenshot of the query result
Note that
• The USING keyword specifies which column is used to select rows that have matching
values in both tables.
• If you look at the result, student_id appears only once now.
NATURAL JOIN
SELECT *
FROM Student NATURAL JOIN Grade
WHERE course_id = 'ST101'
ORDER BY Student.name
Screenshot of the query result
Note that:
• We do not specify how to join the two tables. The join condition is automatically
identified.
• If you look at the result, student_id appears only once.
LEFT JOIN
When we run the following SQL commands to use INNER JOIN to combine the tables
Student and Grade
SELECT *
FROM Student INNER JOIN Grade USING (student_id)
ORDER BY Student.name
Screenshot of the query result
the record of students Ben Johnson and Dan Norris are not shown, because there are not
corresponding records for these two students in the table Grade. If we instead use LEFT
OUTER JOIN to combine the tables Student and Grade:
SELECT *
FROM Student LEFT JOIN Grade USING (student_id)
ORDER BY Student.name
Screenshot of the query result
• All the students from the left table Student are included (including Ben Johnson and
Dan Norris)
• The students with no corresponding record in the right table Grade, have NULL
value in attributes course_id and final_mark from Grade.
CROSS JOIN
SELECT *
FROM Student CROSS JOIN Grade
ORDER BY Student.name
Screenshot of the query result
we get 56 rows, which is number of rows in Student (7) times number of rows in Grade (8).
Another (and probably a better) way is to do it all in R and Python—we can connect R and
Python to the database to create, update and query as appropriate. We then can work on the
query result directly without the need of saving and loading the query result to an external
file. Similarly, we can directly store the analyzed data back to the database.
In these pages we will focus on how to interact with databases in R and Python. Here we will
continue to use SQLite as the database engine, and we will use the RSQLite and DBI R
packages to show that what we have been done so far on DB Browser for SQLite can also
be done within in R. To highlight this point, we will use the same examples we used before.
In this notebook, we will show how to connect and create databases and tables using R.
Connecting to Databases
We load the library DBI and use the function dbConnect() to create an object, conn, to
connect to the SQLite driver to manipulate the database university.db. If the database
university.db exists in your working directory, the following code chunk will remove it.
# install.packages("RSQLite")
library(DBI)
if (file.exists("university.db"))
file.remove("university.db")
[1] TRUE
conn <- dbConnect(RSQLite::SQLite(), "university.db")
We can list all tables in university.db using the function dbListTable() from DBI.
# list all tables
dbListTables(conn)
character(0)
Nothing is returned because we have not created any tables in the database yet.
Creating Tables
Now we are going to create some tables to the database university.db. Like before, we will
create the tables using data saved in the CSV files. We first read the CSV files into
data.frame in R:
We then copy the data frames student, grade and course to tables in the database
university.db using DBI’s dbWriteTable() function:
dbListTables(conn)
[1] "Course" "Grade" "Student"
We can also browse any table in the database using the function dbReadTable() from DBI.
For example, we can browse the Student table by:
dbReadTable(conn, "Student")
student_id name year
1 201921323 Ava Smith 2
2 201832220 Ben Johnson 3
3 202003219 Charlie Jones 1
4 202045234 Dan Norris 1
5 201985603 Emily Wood 1
6 201933222 Freddie Harris 2
7 201875940 Grace Clarke 2
dbListFields(conn, "Student")
[1] "student_id" "name" "year"
We can also check if the database has been properly created by opening the database in DB
Browser for SQLite and browse the tables from there.
Manipulating Databases
The simplest way to manipulate databases is to use the dbExecute() function. This functions
executes SQL statements and returns the number of rows affected. This allows us to leverage
SQL commands to manipulate databases from within R. There are also some other functions
in DBI that are specialized to perform certain tasks.
# Alternative:
# dbExecute(conn,
# "CREATE TABLE Teacher (
# staff_id TEXT PRIMARY KEY,
# name TEXT)")
dbListTables(conn)
[1] "Course" "Grade" "Student" "Teacher"
dbListFields(conn, "Teacher")
[1] "staff_id" "name"
dbReadTable(conn, "Teacher")
[1] staff_id name
<0 rows> (or 0-length row.names)
Deleting a Table
We can remove a table by using the function dbRemoveTable():
dbRemoveTable(conn, "Teacher")
# Alternative:
# dbExecute(conn,
# "DROP TABLE Teacher")
dbListTables(conn)
[1] "Course" "Grade" "Student"
Inserting Tuples/Rows
Below we insert the year 1 student “Harper Taylor” with student ID 202029744 to Student
by using the function dbAppendTable():
When we browse the table, we can see the new row has been added.
dbReadTable(conn, "Student")
student_id name year
1 201921323 Ava Smith 2
2 201832220 Ben Johnson 3
3 202003219 Charlie Jones 1
4 202045234 Dan Norris 1
5 201985603 Emily Wood 1
6 201933222 Freddie Harris 2
7 201875940 Grace Clarke 2
8 202029744 Harper Taylor 1
Updating Tuples/Rows
Below, we update the student ID of student Harper Taylor to 201929744 by dbExecute().
There is no specific function for updating a row in DBI.
dbExecute(conn,
"Update Student
SET student_id = '201929744'
WHERE name = 'Harper Taylor'")
[1] 1
When we browse the table, we can see the row has changed.
dbReadTable(conn, "Student")
student_id name year
1 201921323 Ava Smith 2
2 201832220 Ben Johnson 3
3 202003219 Charlie Jones 1
4 202045234 Dan Norris 1
5 201985603 Emily Wood 1
6 201933222 Freddie Harris 2
7 201875940 Grace Clarke 2
8 201929744 Harper Taylor 1
Deleting Tuples/Rows
Below, we delete the record for the student Harper Taylor from table Student using
dbExecute(). There is no specific function for deleting a row in DBI.
dbExecute(conn,
"DELETE FROM Student
WHERE name = 'Harper Taylor'")
[1] 1
When we browse the table, we can see the row has been removed.
dbReadTable(conn, "Student")
student_id name year
1 201921323 Ava Smith 2
2 201832220 Ben Johnson 3
3 202003219 Charlie Jones 1
4 202045234 Dan Norris 1
5 201985603 Emily Wood 1
6 201933222 Freddie Harris 2
7 201875940 Grace Clarke 2
dbDisconnect(conn)
library(DBI)
conn <- dbConnect(RSQLite::SQLite(), "university.db")
dbListTables(conn)
[1] "Course" "Grade" "Student"
We should see all the three tables Student, Course, and Grade that we have created before.
The second argument in dbGetQuery() is the SQL query statement we used in previous
Sections, when queries were sent using DB Browser for SQLite.
class(q1)
[1] "data.frame"
Note dbSendQuery() only sends and executes the SQL query to the database engine. It does
not extract any records.
dbFetch(q1)
final_mark
1 78
2 60
3 47
When we run dbFetch(), the executed query result will then be fetched.
dbGetQuery(conn,
"SELECT Student.name
FROM Student NATURAL JOIN Grade
WHERE course_id = 'ST101'
ORDER BY Student.name")
name
1 Ava Smith
2 Charlie Jones
3 Emily Wood
dbGetQuery(conn,
"SELECT DISTINCT Course.name
FROM (Student NATURAL JOIN Grade) S JOIN Course on Course.course_id =
S.course_id
WHERE S.name ='Ava Smith' or S.name ='Freddie Harris'")
name
1 programming for data science
2 Managing and Visualising Data
3 Databases
After we finish manipulating the database, we can close the connection using the function
dbDisconnect():
dbDisconnect(conn)
• The mutate() function adds new variables that are transformations of existing
variables.
• The select() function picks variables based on their names.
• The filter() function picks cases based on their values.
• The summarize() function reduces multiple values down to a single summary.
• The arrange() function changes the ordering of the rows.
dplyr verbs operate on data frames, but they also, almost seamlessly apply to database tables!
In particular dplyr allows you to use database tables as if they are data frames, by internally
converting dplyr code into SQL commands using dbplyr. The following table compares the
syntax used in SQL with dplyr syntax:
See the SQL Translation article in dbplyr’s pages for more information.
Pipe Operator
All of the dplyr functions take a data frame (or a tibble) as the first argument. In this way,
the %>% operator from the magrittr R package can be used, so that user does not need to save
intermediate objects or nest verbs. For example, the statement x %>% f(y) is equivalent to
f(x, y), and the result from one step is “piped” into the next step. You can think of the pipe
operator as “then.”
Querying Databases Using dplyr
Connecting to the Database
Again, we first connect to the database university.db and list all the tables. We should see
all the three tables student, course and grade that we have created before.
library(DBI)
conn <- dbConnect(RSQLite::SQLite(), "university.db")
dbListTables(conn)
[1] "Course" "Grade" "Student"
By creating references to the tables as done above, we can treat student_db, grade_db, and
course_db as data frames, and use dplyr functionality to query the database.
The function filter() selects the rows in the grade_db which satisfy the condition
course_id == "ST101".
We can use the function show_query() to show the SQL query that dbplyr produced, when
we run the code above:
show_query(q1)
<SQL>
SELECT *
FROM `Grade`
WHERE (`course_id` = 'ST101')
Getting Names of Students in Alphabetic Order
If we want to work on more than one table in dplyr, we can use join or set operations. In this
example we show how to use inner_join(). arrange() is then used to order the query
result.
show_query(q2)
<SQL>
SELECT `name`
FROM (SELECT `LHS`.`student_id` AS `student_id`, `name`, `year`,
`course_id`, `final_mark`
FROM `Student` AS `LHS`
INNER JOIN `Grade` AS `RHS`
ON (`LHS`.`student_id` = `RHS`.`student_id`)
)
WHERE (`course_id` = 'ST101')
ORDER BY `name`
show_query(q3)
<SQL>
SELECT DISTINCT `name.course`
FROM (SELECT `student_id`, `LHS`.`name` AS `name.student`, `year`,
`LHS`.`course_id` AS `course_id`, `final_mark`, `RHS`.`name` AS
`name.course`, `capacity`
FROM (SELECT `LHS`.`student_id` AS `student_id`, `name`, `year`,
`course_id`, `final_mark`
FROM `Student` AS `LHS`
INNER JOIN `Grade` AS `RHS`
ON (`LHS`.`student_id` = `RHS`.`student_id`)
) AS `LHS`
INNER JOIN `Course` AS `RHS`
ON (`LHS`.`course_id` = `RHS`.`course_id`)
)
WHERE (`name.student` = 'Ava Smith' OR `name.student` = 'Freddie Harris')
Note the computer-generated SQL code that dplyr created internally is more complicated
than the SQL code we wrote for the same query.
show_query(q4)
<SQL>
SELECT `course_id`, AVG(`final_mark`) AS `avg_mark`
FROM `Grade`
GROUP BY `course_id`
Disconnecting From the Database
After we finish manipulating the database, we can close the connection using the function
dbDisconnect() from DBI:
dbDisconnect(conn)
** Note: The code chunks below should be run in the following order **
# This makes sure you can run this notebook multiple times without errors
import os
try:
os.remove('University.db')
except OSError:
pass
import sqlite3
conn = sqlite3.connect('University.db')
import pandas as pd
student = pd.read_csv("student.csv")
course = pd.read_csv("course.csv")
grade = pd.read_csv("grade.csv")
We then write record stored in DataFrames student, grade and course as tables to the
database University.db using the DataFrame method to_sql().
# index = False to ensure the DataFrame row index is not written into the SQL table
s
student.to_sql('Student', con = conn, index = False)
course.to_sql('Course', con = conn, index = False)
grade.to_sql('Grade', con = conn, index = False)
Again, we can check if the database is created properly by opening the database in DB Browser
for SQLite and browse the tables.
c = conn.cursor()
After that we can execute the SQL commands we learned before using the function execute()
and fetchall(). For example, if we want to get all the tables in the database, we can run:
c.execute('''
SELECT name
FROM sqlite_master
WHERE type='table'
''')
<sqlite3.Cursor object at 0x1447c1f80>
c.fetchall()
[('Student',), ('Course',), ('Grade',)]
We can see there are three tables in the database. If we want to browse the table Student we
can run (here we display the results as pandas DataFrame):
import pandas as pd
q = c.execute("SELECT * FROM Student").fetchall()
pd.DataFrame(q)
0 1 2
0 201921323 Ava Smith 2
1 201832220 Ben Johnson 3
2 202003219 Charlie Jones 1
3 202045234 Dan Norris 1
4 201985603 Emily Wood 1
5 201933222 Freddie Harris 2
6 201875940 Grace Clarke 2
Note here we combine the use of execute() and fetchall() in one line.
c.execute('''
CREATE TABLE Teacher (staff_id TEXT PRIMARY KEY,
name TEXT)
''')
<sqlite3.Cursor object at 0x1447c1f80>
conn.commit() # save (commit) the changes
c.execute('''
SELECT name
FROM sqlite_master
WHERE type='table'
''').fetchall()
[('Student',), ('Course',), ('Grade',), ('Teacher',)]
Delete a Table
We can delete a table by running the SQL command through execute():
c.execute('''
SELECT name
FROM sqlite_master
WHERE type='table'
''').fetchall()
[('Student',), ('Course',), ('Grade',)]
Insert Tuples/Rows
Insert the year 1 student Harper Taylor with student ID 202029744 to Student:
When we browse the table, we can see the new row is added.
Update Tuples/Rows
Update the student ID of student Harper Taylor to 201929744:
c.execute('''
UPDATE Student
SET student_id = "201929744"
WHERE name = "Harper Taylor"
''')
<sqlite3.Cursor object at 0x1447c1f80>
conn.commit()
When we browse the table, we can see the row has changed.
Delete Tuples/Rows
Delete the record for the student Harper Taylor from table Student:
c.execute('''
DELETE FROM Student
where name = "Harper Taylor"
''')
<sqlite3.Cursor object at 0x1447c1f80>
conn.commit()
When we browse the table, we can see the row has been removed.
conn.close()
import sqlite3
conn = sqlite3.connect('University.db')
c = conn.cursor()
c.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()
[('Student',), ('Course',), ('Grade',)]
import pandas as pd
pd.DataFrame(q1)
0
0 78
1 60
2 47
pd.DataFrame(q2)
0
0 Ava Smith
1 Charlie Jones
2 Emily Wood
pd.DataFrame(q3)
0
0 programming for data science
1 Managing and Visualising Data
2 Databases
pd.DataFrame(q4)
0 1
0 ST101 61.666667
1 ST115 82.333333
2 ST207 66.500000
conn.close()
Computer Programming
Computer programming is the process of writing instructions that a computer can follow in order
to perform a particular task. Programming is very much like writing down a cooking recipe: the
aim is to provide precise instructions, which, if followed to the letter by someone else, the
prepared meal will have the same taste and look as what you had in mind. So, as with cooking
recipes, the first and most critical stage of the programming process is to be absolutely clear on
what is the goal of the program or what is the problem that the program solves. In other words, to
define what the program does. For example, a spreadsheet application should enable the
organization and analysis of rectangular data, a word processor should make it easy to edit and
format text, optical character recognition (OCR) software should be able to identify the characters
in an image and extract them in a text file, and so on.
The second stage is to plan the steps that the program should take to fulfil its purpose. Typically,
the set of steps takes some inputs (analogous to ingredients and cooking equipment) and
produces predetermined outputs (analogous to the prepared meal). For example, the input of an
OCR program is the image of a document and the output is the recognized text, possible in a text
file. This sequence of steps is called a programming algorithm. Note here that the may be more
than one algorithms that produce the same output starting from the same input, each having its
own advantages and disadvantages. Typically, the aim is to produce algorithms that are fast and
reliable, and require predictable time and resources to run.
Of course, the more we want the computer to do, the more instructions we need to provide. Let
us come back to the recipe analogy. It is not so difficult to write a recipe for making pasta. Just
boil water, throw in the pasta for 10 minutes, add salt, drain, and serve. But how about cooking
curry or making a birthday cake? It definitely requires more steps and ingredients.
Avoid Spaghetti Programming
It is very important to spend some time to carefully plan your program, otherwise it is likely to
create something that is difficult to understand, maintain or extend. Spaghetti program is a
phrase commonly used to describe poorly planned or structured programs. Below is an example
of a spaghetti program written in the BASIC language (one of the early programming languages)
that first prints “This line prints first”, then “This line prints second” and then ends.
10 GOTO 50
20 PRINT "This line prints second"
30 END
40 GOTO 20
50 PRINT "This line prints first"
60 GOTO 40
Clearly, this program is more convoluted than it should be. Line 10 (the first line) sends the
computer to line 50 (the fifth line). Then line 50 gets the computer to print This line prints
first. Afterwards the computer follows the command in line 60 that sends it to line 40, which in
turn sends it to line 20. There it tells the computer to print This line prints second. After that,
the computer automatically follows the command on the next line, which is line 30 and tells it to
end. Despite that the program consists of six lines, the GOTO command gets the computer to
move between lines, making it very difficult for a human to understand the flow and the purpose
of this program. In cases of several hundreds lines of code this kind of coding will create
programs that are almost impossible to understand or maintain, and should be avoided.
Structured Programming
It is essential to impose some structure to keep programs organized. Programs are usually
divided into three distinct parts:
• Sequences
• Branches or conditional statements
• Iterations or loops
Sequences
A sequence is a set of steps that the computer should follow one after another. An example is
the following running calorie calculator that takes as inputs the user’s age, weight, average heart
rate and time the user ran and outputs the calories burned.
1. Calculate [(Age x 0.2017) + (Weight x 0.09036) + (Heart Rate x 0.6309) --
55.0969] x Time / 4.184
Iterations
Sometimes you may want the computer to run the same commands over and over again. For
example, a program might ask the user for a password. If the user types an invalid password, the
program displays an error message and asks the user to type the password again. If you wanted
your program to ask the user for a password three times, you could write the same group of
commands to ask for the password three times, but that would be wasteful. Not only would this
force you to type the same commands multiple times, but if you wanted to modify these
commands, you’d have to modify them in three different locations as well. Loops are basically a
shortcut to writing one or more commands multiple times.
An iteration, or loop, consists of two parts:
The idea is that writing a large program may be tough, but writing a small program is easier. So if
you keep dividing the tasks of your program into smaller and smaller parts, eventually you can
write a small, simple program that can solve that task. Then you can put these small programs
together like building blocks, and you’ll have a well-organized big program. Also, if you need to
modify part of the large program, you just need to find the small program that needs changing,
modify it, and plug it back into the larger program. Ideally, each small program should be small
enough to fit on a single sheet of paper. This makes each small program easy to read,
understand, and modify. When you divide a large program into smaller programs, each small
program is a subprogram.
A <- 2
B <- 3
C <- sqrt(A * A + B * B);
A and B are multiplied by themselves, the results are added together, and then the square root of
the sum is stored in C. But this does not tell us why this code is doing this. Below we see the
same set of commands but with some comments added to the code:
A <- 2
B <- 3
# Calculates the hypotenuse of a triangle (C) ‘ using the Pythagoras theorem: C2 <-
A2 + B2
C <- sqrt(A * A + B * B)
Even if one does not know (or care) about Pythagoras’ theorem, the comments are helpful in
understanding what the purpose of this line is. Another example with more comments and
information is given below
# This formula uses the epidemic model to calculate the spread of a flu epidemic as
a function of time. P is the current population of a city, t is time measured in we
eks, c is the number of people in contact with an infected person and B is constant
value that can be determined by the initial parameters of the flu epidemic
P <- 10000
B <- 1
c <- 2000
t <- 2
F <- P / (1 + (B * exp(-c * t)))
Comments are therefore useful to explain both what code does and how it works. But having too
many comments in a computer program is usually a sign that the code is too complicated. So,
instead of writing lengthy comments, a good programmer aims to write code which is as self-
explanatory as possible and uses comments only if necessary.
A program can be documented using comments that explain the purpose of one or more lines of
code or of entire subprograms. Comments can also be used to provide information such as
identifying the original programmer, their contact details, the license of the program, etc.
References
Wang, W. (2008). Beginning programming all-in-one desk reference for dummies. Wiley.
Variables, Control Flow Structures
and Functions
** Note: The code chunks below should be run in the following order **
Variables
Variables provide a way to associate names with storage points in the computer that the
programs can access and manipulate if needed. Variables can have different data types such as
float, int, string, vector, array, matrix etc. Perhaps you are familiar with the term variable
in mathematics. Programming variables are similar. A variable is a way of saving a piece of
information with a specific name that we can easily reuse several times in our code. In line with
the idea of variability, a programming variable allows us to easily change a value throughout our
code.
Consider the following chunk of R code based on an example from Guttag (2013):
pi <- 3
radius <- 11
area <- pi * (radius^2)
radius <- 14
First, we assign the names pi and radius to different storage points that contain data of type
int in this case. Then we assign the name area to a third storage point containing the value of
the product between the square of radius and pi. This is represented graphically in the left
panel of the figure below:
Variables assignment; image taken from Guttag (2013)
The last line of code then changes radius to another value radius <- 14 as shown in the right
panel of the figure above. Note that this assignment has no effect on the value of the variable
area which remains 3*(11**2). By the way, if you believe that the actual value of pi, reflecting
the famous number \(\pi\) is not 3, you are right. This was done for illustration purposes only. A
value of 3.14159 would offer a more accurate answer.
At this point it may be relevant to clarify the difference between variables and data. Variables
reflect the assignment to storage points that contain data. So in that sense and if you are familiar
with the term random variable in probability and statistics and the notation \(X\) vs \(x\), a
programming variable refers to the mathematical function denoted by \(X\) whereas data relates
the realised value of a random variable \(x\).
Note also that the names assigned are important as programs are also read by humans. It is not
always easy to write programs that work correctly and programmers often devote a lot of time to
understand why programs do not behave in the way they should. A good choice of variable
names can provide substantial help in making computer programs easier to read. Consider the
two code chunks:
a = 3.14159
b = 11.2
c = a * (b**2)
and
pi = 3.14159
diameter = 11.2
area = pi * (diameter**2)
As far as R or Python is concerned, the two code chunks do exactly the same thing. Looking at
the first chunk, we see nothing wrong with the code. But a quick glance at the code in the second
chunk immediately raises suspicions as to whether the program really does what the
programmer wanted to; either the variable diameter should have been named radius, or
diameter should have been divided by 2 before squaring it in the calculation of the area.
In R and Python, variable names can contain upper-case and lower-case letters and digits (but
they cannot start with a digit), and the dot or underline characters. Variable names are case-
sensitive.
Conditional Statements
It is possible to create programs that can make decisions based on the inputs you provide or the
value that a variable has during run time.
With conditional statements we can write programs that determine whether a particular part of
the code should be executed instead of another.
We start with the if statement. The basic forms of if statements encountered in R are
if (condition) true_action
if (condition) true_action else false_action
If the condition in the above code evaluates to TRUE, the true_action is performed; if the
condition evaluates to FALSE, the false_action is performed. Of course, specifying a
false_action is optional. If the actions are compound statements of more than one command
they need to be contained within brackets {}
x <- 105
if (x > 100) {
print("A")
print("B")
}
[1] "A"
[1] "B"
We can also have nested if statements. For example, the following code chunk will print “A” if
\(x > 90\), “B” if \(80 < x \leq 90\), and “C” otherwise (if \(x \leq 80\)).
x <- 80
if (x > 90) {
print("A")
} else if (x > 80) {
print("B")
} else
print("C")
[1] "C"
Try running the above with different values for x to confirm that the nested if statements do what
we want them to.
Note that the condition should be or evaluate to Boolean type, i.e. either TRUE or FALSE. Most
other inputs will generate an error.
A similar conditional statement in R is the switch() statement. It lets us replace code like:
x <- "c"
if (x == "a") {
"A"
} else if (x == "b") {
"B"
} else if (x == "c") {
"C"
} else {
stop("Invalid `x` value")
}
[1] "C"
x <- "b"
switch(x,
"a" = "A",
"b" = "B",
"c" = "C",
stop("Invalid `x` value")
)
[1] "B"
Iterations
Iterations (also known as loops) are programming structures that repeat a sequence of
instructions until a specific condition is met. Programmers use iterations extensively to cycle
through values, compute sums of numbers, repeat actions, and many other things. Oftentimes, a
certain operation needs to be repeated many times with little variation in order to achieve a goal.
Rather than copying and pasting the commands over and over again, a loop can be used
instead. In addition to simplifying writing programs, loops offer another advantage. Imagine there
is an error in the code you copied and pasted several times. You will then have to go and correct
this error at every instance of this code. But if a loop had been used, the correction would only
need to be made once!
There are three main types of iterations: the for loop, the while loop and the repeat loop. The
for loops are used to iterate over items in a vector. They have the following basic form:
for (item in vector) perform_action
For each item in vector, perform_action is called once; then the loop moves to the next item
of in the vector, and so on. For example, the following code prints all the numbers between 1
and 10:
for (i in 1:10) {
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
To illustrate the points made earlier consider the naive alternative code for printing all the
numbers between 1 and 10 as above.
print(1);print(2);print(3);print(4);print(5);print(6);print(7);print(8);print(9);pr
int(10)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Now imagine what would happen if we wanted to print all numbers up to 10000; or if we
accidentally copied and pasted prin instead of print. A handy rule of thumb in programming is
to try to avoid copying and pasting more than twice.
It is possible to also terminate a for loop early. There are two ways to do it:
• next stops the current iteration and moves to the next item in vector.
• break stops the loop.
for (i in 1:10) {
if (i < 3)
next
print(i)
if (i >= 5)
break
}
[1] 3
[1] 4
[1] 5
It is a good idea to pre-define the objects that will hold the outputs of the loop, otherwise the loop
can be very slow. In the code below we begin by defining the empty vector k as the vector that
will hold the results from each iteration of the for loop. Note that it has dimension 3 which the
same as that of the vector we iterate over.
k <- numeric(3)
for (i in 1:3) {
k[i] = i^4
}
print(k)
[1] 1 16 81
The for loops are useful if you know in advance the set of values that you want to iterate over. If
you don’t know, there are two related tools with more flexible specifications:
We can rewrite any for loop and use while instead. We can also rewrite any while loop and
use repeat instead. But the converses are not true. More specifically, while is more flexible
than for, and repeat is more flexible than while. It is good practice, however, to use the
simplest solution to a problem, so it is usually recommended to use for if it suits the purpose of
the program and only consider alternatives if they are really needed.
Functions
Functions are essential in splitting the program into smaller parts and allow to automate common
tasks in a more powerful and general way than copying and pasting; recall the rule of avoiding
copy and paste more than twice. Writing a function has three big advantages over using copy
and paste:
• You can give a function a name that makes your code easier to understand.
• As requirements change, you only need to update code in one place, instead of many.
• You eliminate the chance of making accidental mistakes when you copy and paste
(e.g. updating a variable name in one place, but not in another).
Writing a Function
A function can be defined with the following code:
The reserved word function is used to declare a function in R. The statements within the curly
brackets form the body of the function. As with loops, these brackets are optional if the body
contains only a single statement. Also, a function has inputs that are termed as arguments.
Finally, the function above is given the name func_name.
As an example let us create a function a function called raise(). It takes two arguments,
computes the first argument raised to the power specified in the second argument, and prints the
result accompanied with a few words:
The above uses the base R function paste() to concatenate string variables. It is good practice
to check the function after writing it by calling it a few times using different arguments, and testing
that the output is as expected.
raise(8, 2)
[1] "8 to the power of 2 is 64"
8^2
[1] 64
raise(2, 8)
[1] "2 to the power of 8 is 256"
2^8
[1] 256
Note that the default value of y is used only when no information for y is provided.
Note the in R, it is not necessary to include the return statement. R automatically returns
whichever variable is on the last line of the body of the function.
In a similar manner to getting inches from centimetres, we can transform inches into yards:
What about converting centimetres to yards? In addition to the previous approach, we can also
do it by composing the two functions we have already created:
The above can also be achieved with a single line of code by nesting functions:
inches_to_yards(cm_to_inches(91.44))
[1] 1
This is a demonstration of how larger programs are built: basic operations are defined and then
combined.
Naming Functions
As with variables it is important to remember that functions are not only read by computers but
also by humans; hence their name or the comments they contain are important. Below are some
good practice recommendations:
• Aim for short function names as long as they are informative of what the function does. If
this is hard being informative is more important than being short, as IDEs such as
RStudio typically offer autocomplete, thus making it easy to type long names.
• A frequently used convention is to use verbs for function names and nouns for their
arguments although there are exceptions, e.g. mean() is more convenient than
compute_mean().
References
Guttag, J. V. (2013). Introduction to computation and programming using python. MIT Press.
Wickham, H. (2019). Advanced R. Taylor & Francis Inc. CRC Press Inc.
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and
model data (1st ed.). O’Reilly Media.
Exceptions, Error Handling and
Debugging in R
Outline
This page focuses on what happens when there are errors in our code. It has adapted material
from chapters 8 and 22 of Wickham (2019). The aim is to present actions for addressing the
errors and the tools that are available to help. We will cover ways to fix unanticipated
problems (debugging), show how functions can be designed to communicate problems and
what actions can be taken based on those communications (condition handling). Finally, we
will go over strategies to avoid common problems before they occur (defensive
programming). We will illustrate concepts in R; the material of next week will focus on how
to perform these actions in Python.
Debugging is the task of fixing unexpected problems in your code. Although there are some
guidelines on performing it, it is usually considered an art. Of course, not all errors are
unexpected. Some common problems, for example, include a non-existent file or the wrong
type of input. Using conditions, such as errors, warnings, and messages, when writing
functions offers a great help:
• Errors are raised by stop() and force all execution in a function to terminate. They
are used when there is no way for a function to continue.
• Warnings are generated by warning() and are used to display potential problems,
e.g. log(-1).
• Messages are generated by message() and are used to give informative output in a
way that can easily be suppressed by the user (see ?suppressMessages).
1. Realize that you have a bug: You cannot fix a bug until you know it exists. Hence, it
is always a good idea to test your code before you finish. Automated testing is a more
involved option in that direction.
2. Make it repeatable: Once you have determined you have a bug in your code, you
need to be able to reproduce it on demand. Without this, it becomes extremely
difficult to isolate its cause and to confirm that you have successfully fixed it.
Generally, you will start with a big block of code that you know causes the error and
then slowly whittle it down to get to the smallest possible snippet that causes the
error. Binary search is particularly useful for this. To do a binary search, you
repeatedly remove half of the code until you find the bug. This is fast because, with
each step, you reduce the amount of code to look through by half. As you work on
creating a minimal example, you will also discover similar inputs that do not trigger
the bug. Make sure you take note of them as they will be helpful when diagnosing the
cause of the bug.
3. Figure out where it is: If you are lucky, one of the tools in the following section will
help you to quickly identify the line of code that is causing the bug. Usually, however,
you will have to think a bit more about the problem. It is a good idea to adopt the
scientific method. Generate hypotheses, design experiments to test them, and record
your results. This may seem like a lot of work, but a systematic approach will end up
saving you time.
4. Fix it and test it: Once you have found the bug, you need to figure out how to fix it
and to check that the fix actually worked. Again, it is very useful to have automated
tests in place. Not only does this help to ensure that you have actually fixed the bug, it
also helps to ensure you have not introduced any new bugs in the process. In the
absence of automated tests, make sure to carefully record the correct output, and
check against the inputs that previously failed.
5. Internet search: In some cases, it may be a good idea to do an internet search on the
error message. There is a chance that this is a common error with a known solution.
Debugging Tools
There are three key debugging tools in R:
• RStudio’s error inspector and traceback(), which list the sequence of calls that lead
to the error.
• RStudio’s “Rerun with Debug” tool and options(error = browser), which open
an interactive session where the error occurred.
• RStudio’s breakpoints and the browser() function, which open an interactive session
at an arbitrary location in the code.
Two options appear to the right of the error message: “Show Traceback” and “Rerun with
Debug.” If you click “Show Traceback” you see:
Sometimes this is enough information to let you track down the error and fix it. However,
sometimes it is not. traceback() shows where the error occurred, but not why. The next
useful tool is the interactive debugger, which allows you to pause execution of a function and
interactively explore its state.
Browsing on Error
The easiest way to enter the interactive debugger is through RStudio’s “Rerun with Debug”
tool. This reruns the command that created the error, pausing execution where the error
occurred. You are now in an interactive state inside the function, and you can interact with
any object defined there. You will see the corresponding code in the editor (with the
statement that will be run next highlighted), objects in the current environment in the
“Environment” pane, the call stack in a “Traceback” pane, and you can run arbitrary R code
in the console. There are a few special commands you can use in debug mode. You can
access them either with the RStudio toolbar or with the keyboard:
• Next: Executes the next step in the function. Be careful if you have a variable named
n; to print it you will need to do print(n).
• Step into: Works like Next, but if the next step is a function, it will step into that
function so you can work through each of that functions lines.
• Finish: Finishes execution of the current loop or function.
• Continue: Leaves interactive debugging and continues regular execution of the
function. This is useful if you have fixed the bad state and want to check that the
function proceeds correctly.
• Stop: Stops debugging, terminates the function, and returns to the global workspace.
Use this once you have figured out where the problem is, and you are ready to fix it
and reload the code.
As well as entering an interactive console on error, you can enter it at an arbitrary code
location by using either an RStudio breakpoint or browser().
• try() gives you the ability to continue execution even when an error occurs.
• tryCatch() lets you specify handler functions that control what happens when a
condition is signalled.
• withCallingHandlers() is a variant of tryCatch() that establishes local handlers,
whereas tryCatch() registers exiting handlers.
f1 <- function(x) {
log(x)
10
}
f1("x")
stops with the error message “Error in log(x): non-numeric argument to mathematical
function.” However, if you wrap the statement that creates the error in try(), the error
message will be printed but execution will continue:
f2 <- function(x) {
try(log(x))
10
}
f2("a")
Error in log(x) : non-numeric argument to mathematical function
[1] 10
As an example, consider a simple function that checks if an argument is an even number. You
might write the following
You can see that providing a string causes this function to raise an error. You could imagine
though that you want to use this function across a list of different data types, and you only
want to know which elements of that list are even numbers. You might think to write the
following:
Defensive Programming
Defensive programming is the strategy of making code fail in a well-defined manner even
when something unexpected occurs. A key principle of defensive programming is to signal an
error as soon as something wrong is discovered. In R, the “fail fast” principle is implemented
in three ways:
• Be strict about what you accept. For example, if your function is not vectorized in its
inputs, but uses functions that are, make sure to check that the inputs are scalars.
• Avoid functions that use non-standard evaluation, like subset, transform, and with.
These functions save time when used interactively, but because they make
assumptions to reduce typing, when they fail, they often fail with uninformative error
messages.
• Avoid functions that return different types of output depending on their input. The
two biggest offenders are [ and sapply().
References
Wickham, H. (2019). Advanced R. Taylor & Francis Inc. CRC Press Inc.
ST2195 Programming for Data
Science
Block 5 Variables, Mutability and
Aliasing in Python and R
VLE Resources
There are extra resources for this Block on the VLE – including quizzes and videos please
make sure you review them as part of your learning. You can find them here:
https://emfss.elearning.london.ac.uk/course/view.php?id=382#section-5
** Note: The code chunks below should be run in the following order **
Outline
In this page we focus on presenting some programming concepts in Python. Before doing this,
we first return to the definitions of data types we have learned previously in the notes on Python
data structures, and describe their association with objects. We then talk about mutability and
aliasing in Python. Finally, we compare how objects in Python and R behave differently.
type(1)
<class 'int'>
type(1.0)
<class 'float'>
type("1")
<class 'str'>
We can change the type of an object to another type by casting. For example:
type("1")
<class 'str'>
type(int("1"))
<class 'int'>
The first object "1" has the type string whereas the second object int("1") has the type int
as the function int() converts the argument value to an integer (i.e. a object with type int).
What do you expect the outcome of the following code?
int("a")
1 + 2
3
we get another object with int type and value 3, which is the sum of 1 and 2.
If we do:
1 + "a"
we will get the error message: TypeError: unsupported operand type(s) for +: 'int'
and 'str', as we are not allow to “add” int and string together. The type of the objects
decides what you can do and what you cannot do on them.
The type of an object also decides the behaviour of the operation. If we apply the summation
operator on two strings:
"hello"+"world"
'helloworld'
we are not “summing” the strings in the numerical sense but we concatenate the strings to
get a new string. The type therefore dictates the behaviour of the objects.
What if we run the same code in R?
## R code
"hello"+"world"
Does it work?
pi = 3.14
which we assign the object with type float and value 3.14 to the variable with name pi. So now
we can retrieve the value by using the variable pi:
pi
3.14
The variable will have the same type as the object that assigned to it:
print(type(pi))
<class 'float'>
print(type(3.14))
<class 'float'>
Similar to R, we can assign value to a variable using an expression, see the third line of the
code:
pi = 3.14
radius = 11
area = pi * (radius**2)
area
379.94
We can rebind the variable to another object by simply assigning another object to it. For
example:
pi = "hello"
pi
'hello'
Now pi is bound to the object with value "hello".
The type of a variable depends on the object assigned to it. Now the type of pi is string, as it
binds to the object "hello".
type(pi)
<class 'str'>
Note that this assignment has no effect on the value of the variable area, which remains as
3*(11**2) = 379.94
area
379.94
Naming a Variable
Similar to R, variable names can contain upper-case and lower-case letters, digits (but they
cannot start with a digit) and underline characters, and variable names are case-sensitive. Unlike
R, and you cannot use dots as part of the variable name. We will see later that dot is used to get
attributes from an object.
In Python, some of the words are reserved for specific purposes (known as keywords) and you
are not allowed to use them as variable names. Example of the keywords are if, else, for, def,
etc. If you run the following:
if = 3
you will get an error with the error message: SyntaxError: invalid syntax.
Mutability
In the page “Python and Data Structures” we have seen that we can modify objects like list.
For objects that can be modified, we say they are mutable. Examples of type of mutable objects
are list and dict.
For example, we can modify a list:
Remember in Python the indices start from 0 (in R indices start from 1).
On the other hand, we cannot modify immutable objects like tuple or string. For example if we
try to modify an element in a tuple:
It will give the error message TypeError: 'tuple' object does not support item
assignment.
Aliasing
Consider the following code:
While you probably expected colours_2 has "orange" as the first element, you probably did not
expect thatcolours has "orange" as the first element too! What is wrong? After all if we write
similar code in R:
# R code
r_colours <- c("red", "blue", "green")
r_colours_2 <- r_colours
r_colours_2[1] <- "orange"
print(r_colours_2)
print(r_colours)
[1] "orange" "blue" "green"
[1] "red" "blue" "green"
r_colours would not be changed! Why similar blocks of code behave differently in R and
Python?
To understand further, it is worth taking a closer look what assignment of an object to a variable
is actually doing behind the scene in Python. Below we use Python Tutor to visualise how the
first 3 lines of the Python code bind the object (the list) with variable names colours and
colours_2, and change the object through colours_2. Please press the next button to see
what actually is happening for each line of the code.
As we can see in the visualisation above, when we run colours_2 = colours, what we are
copying is the reference to the list but not the list itself. In other words, what we are doing is
assigning the list that the variable colours referred to to colours_2. Effectively now the
variables colours and colours_2 are binding to the same list. You can think colours and
colours_2 are just two names of the same object. The list itself is not copied. When we run
the third line, we modify the list through the variable name colours_2. As the variable colours
refers to the same list as colours_2, when we retrieve the list by the variable name
colours, we will see the first element is "orange" instead of "red".
We can verify that that the variables colours and colours_2 refer to the same object by using
the function id():
id(colours)
5441935496
id(colours_2)
5441935496
The function id() returns a unique identify of an object and we can see that id(colours) and
id(colours_2) return the same value.
If we want to actually make a copy of the list (the object), we need to do it explicitly by using
the copy() function:
colours_3 = colours.copy()
colours_3[0] = "black"
print(colours_3)
['black', 'blue', 'green']
print(colours)
['orange', 'blue', 'green']
When working on Python, remember that a container can be modified by any variables binding to
them. If you want to make sure no other variables can modify the container, explicitly create a
copy when assigning to another variable, as shown in the code block above.
pi = 3.14
print(pi)
3.14
pi = "hello"
print(pi)
hello
Are we not changing the object in the third line? While we indeed “update” the value associate
with the variable pi by the second assignment, object created from the first assignment is not
changed. What is actually happening in the second assignment is that a new object associated
with the value "hello" is created and is bound to the variable pi. We can confirm this by printing
the identity of the object bound to pi:
pi = 3.14
print(id(pi))
5441320784
pi = "hello"
print(id(pi))
5441941320
We can see that the identity has changed, showing that pi on line 1 and 3 are binding to different
objects. We did not change the immutable object, instead we created a new immutable object.
Therefore, we do not worry about the aliasing issue, as shown in the example below:
pi = 3.14
a = pi
print(id(pi))
5441320904
print(id(a))
5441320904
pi = "hello"
print(a)
3.14
print(id(pi))
5441416304
print(id(a))
5441320904
# R code
r_colours <- c("red", "blue", "green")
r_colours[1] <- "orange"
r_colours
[1] "orange" "blue" "green"
# R code
r_colours <- c("red", "blue", "green")
tracemem(r_colours)
r_colours_2 <- r_colours
tracemem(r_colours_2)
[1] "<0x7fbb2ee79008>"
[1] "<0x7fbb2ee79008>"
Note that no message is shown as no object is copied. Both r_colours and r_colours_2 have
the same memory address.
However, when we change r_colours,
# R code
r_colours[1] <- "orange"
tracemem[0x7fbb2ee79008 -> 0x7fbb2f304308]: eval eval withVisible withCallingHandle
rs handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block
process_group.block process_group withCallingHandlers process_file <Anonymous> <Ano
nymous>
We can see the message from the tracemem() function printing out tracemem[old_address ->
new_address], which tells us that r_colours is now pointing to a new object. When there are
more than one variables pointing to an object and we want to modify one of the variables
(e.g. r_colours[1] <- "orange"), R will make a copy of the object, make a change and then
assign to the variable which trigger the change. For other objects, they are still pointing to the old
object.
We can verify that r_colours is printing to a new object and r_colours_2 is pointing to the old
object by the tracemem() function:
# R code
tracemem(r_colours)
tracemem(r_colours_2)
[1] "<0x7fbb2f304308>"
[1] "<0x7fbb2ee79008>"
Note that the address for r_colours has changed, but the address for r_colours_2 has not
changed. It is still pointing the old object.
Therefore, the first element of r_colours has changed to "orange" but not r_colours_2:
# R code
print(r_colours)
print(r_colours_2)
[1] "orange" "blue" "green"
[1] "red" "blue" "green"
** Note: The code chunks below should be run in the following order **
Conditional Statements
We start with the if statement. Its basic form follows:
if condition:
true_action_1
true_action_2
...
if condition:
true_action_1
true_action_2
...
else:
false_action_1
false_action_2
...
For example:
mark = 80
if mark >= 50:
print("pass")
pass
While the control statements between R and Python are quite similar, there are several
differences between them:
More on Indentation
In Python, each line of code in the the block must be indented by the same amount. This is in
contrast to many other languages (like R) where indentation is optional, and curly brackets {}
are used to define blocks.
In Python, different indentation may provide different outcome, and some indentation may result
in errors. For example:
mark = 30
if mark >= 50:
print("pass")
print("congrats!")
and
mark = 30
if mark >= 50:
print("pass")
print("congrats!")
congrats!
provide different results. Why is this the case? Note that the line print("congrats!") is not part
of the control flow in the second example. Therefore, it will be executed no matter whether the
condition mark >= 50 is satisfied.
The code below will trigger the error message IndentationError: unexpected indent, as the
if statement should not be indented.
mark = 80
if mark >= 50:
print("pass")
Similarly, the code below results in the same error message, as the indentation should be the
same for both print statements.
The code below results in the same error message as well, as the if and the else statement
should be at the same level.
if mark >= 50:
print("pass")
else:
print("fail")
Again, the code below results in the same error message. Why is it the case?
Indentation in R vs Python
In R the use of indentation is optional. You have the freedom to choose whether to indent and
how to indent your code. Often indentation is used in R to make the code easier to be read. If we
rewrite the examples which result in error above in R syntax they will run fine in R:
## R code
mark = 80
if (mark >= 50)
print("pass")
[1] "pass"
## R code
if (mark >= 50)
print("pass")
[1] "pass"
## R code
if (mark >= 50) {
print("pass")
print("congrats")
}
[1] "pass"
[1] "congrats"
While the above R code is valid, it is not recommended to write the code with non-standard
indentation as it makes it more difficult to understand the code and likely to cause confusion.
Nested if
As in R, we can nest an if statement within an if statement. In Python we have the elif
keyword which is short for ‘else if’. The following two codes will give the same result:
mark = 65
if mark >= 70:
print("distinction")
elif mark >= 60:
print("merit")
else:
print("pass")
merit
if mark >= 70:
print("distinction")
else:
if mark >= 60:
print("merit")
else:
print("pass")
merit
Iterations
There are two main types of iterations in Python, the for and while loops.
for Loop
The for loops are used to iterate a collection of objects. They have the following basic form:
for i in [1,2,3,4,5]:
print(i)
1
2
3
4
5
Similar to R, it is possible to also terminate a for loop early. There are two ways to do it:
for i in [1,2,3,4,5]:
if i < 2:
continue
# for i < 2 the code below will not be executed
print(i)
if i >= 4:
break # the loop stops here
2
3
4
while Loop
The for loops are useful if you know in advance the set of values that you want to iterate over. If
you don’t know, you can use while loop:
while condition:
action
For example, it is more appropriate to use while loop for the following example:
As we do not know how many times we need to wait until the user gives us a required input.
Useful Links and Resources
• Guttag, J.V. (2013). Introduction to computation and programming using Python. Section
3.2.
• Official Python tutorial on control flow
Function and Scope in Python
** Note: The code chunks below should be run in the following order **
Writing a Function
A function can be defined with the following code:
The keyword def tells Python that the following block of code defines a function. Again
indentation is important. The explanation in the beginning of the block tells users what the
function does. The return statement can be omitted if the function does not return any value.
def circle_area(radius):
'''
This function calculates the area of a cirle given the radius
input: radius
output: area = pi * radius ^2
'''
pi = 3.14
return (radius**2)*pi
We can call the function by using the name of the function with ():
circle_area(2)
12.56
In contrast to R, if you want to return a value from a function, you have to use the return. If you
write the function circle_area in the following way:
def circle_area(radius):
'''
This function calculates the area of a cirle given the radius
input: radius
output: area = pi * radius ^2
'''
pi = 3.14
(radius**2)*pi
When you call the function nothing is returned or, in other words, the value None is returned.
circle_area(2)
Scope
In Python, variables are only available within the corresponding region it is created. Such region
is called a scope.
# Example 1
def add_a(x):
x = x+a
return x
a = 8
z = add_a(10)
print(z)
18
Does the following code work, despite not defining a in the function add_a? Yes it works because
a is global and can be used anywhere (including inside a function).
Press next to visualise the code and understand what is happening when we execute each line
of the code.
As we can see, we first have the function add_a() defined in the global frame. By running the
fifth line we have a = 8 in the global frame. When we call the function add_a() in the sixth line,
we created a local frame add_a. Note that it is created when the function add_a() is called but
not when it is defined. The object x is in this local frame with value 10 and it then is updated to 18
after executing the line x = x + a. a is available within the function as a is global so it is
available everywhere. The function add_a hits the return statement with the return value 18. The
return value is then assigned to z in the global frame and the local frame add_a is gone.
# Example 2
def add_b(x):
b = 8
x = x + b
return x
z = add_b(10)
print(b)
If you run the code above, you will get the error NameError: name 'b' is not defined. Why
is it the case? Press next to visualise the code and understand what is happening when we
execute each line of the code.
Like the previous example, we first have the function add_b() defined in the global frame. When
we call the function add_b() in the sixth line, we created a local frame add_b. This time the
object b is in this local frame with value 8. The function add_b hits the return statement. The local
frame add_b is gone and so is the variable b. Therefore, we get an error when we try to print
(b) in the main body.
LEGB Rule
Previous, we saw that if we are trying to get a variable in the local frame but it is not defined
there, the global one will be used if it is available. What if the same variable name is available for
the local and global scope? Consider the following code (Example 3):
# Example 3
def add_1(x):
x = x + 1
return x
x = 10
z = add_1(12)
print(z)
13
What is the value of z? What is the value of x? Let us look at the visualisation of the above code.
Please press next to see how each line works behind the scene.
As we can see, we first have the function add_1 defined in the global frame. By running the fifth
line we have x = 10 in the global frame. When we call the function add_1() in the sixth line, we
first created a local frame add_1. The x in this local frame with value 12 is different from the x
with value 10 in the global frame. They are two different objects. Therefore when we do x = x+1
in the function add_1, we see that x in the add_1 frame has changed to 13 but x in the global
frame stays as 10. The function add_1 hits the return statement with the return value 13. The
return value is then assigned to z in the global frame and the local frame add_1 (together with
the local x) is gone.
The LEGB rule can be thought of as a name look up procedure. LEGB stands for:
• L: Local scope
• E: Enclosing scope, this exists for nested functions. The outer function is the enclosing
scope for the inner function.
• G: Global scope
• B: Built-in scope, a scope that is available and automatically loaded whenever you run
Python. It contains names of built-in functions, keywords, exceptions, etc. Examples are:
o Keywords: if, def, etc.
o Functions: print, len, etc.
Python code will find the name sequentially in the local scope (“L”), the enclosing scope (“E”), the
global scope (“G”), the built-in scope (“B”) based on where the executed line is in. If nothing can
be found after searching all these four scopes, Python will throw an error. We can use the LEGB
rule to explain the Examples 1 to 3 above:
• Example 1: We search for the variable a when we are in the local scope. As the variable
name a is only available in the global scope, the a in the global scope is used.
• Example 2: We search for the variable b when we are in global scope. As the variable
name b is not available (b was in the local scope but it is not available in global scope
and was not retrievable after the function has returned), Python cannot find the variable
b.
• Example 3: The variable x is available in both local and global scope. Following the
LEGB rule, when running the line x = x + 1, we first search it in the local scope for x.
As x is found in the local scope, the value of x in the global scope will not be used inside
the function.
In R, a similar rule applies. See the same examples above implemented in R:
# Example 1
add_a <- function(x) {
x <- x+a
return (x)
}
a <- 8
z <- add_a(10)
print(z)
[1] 18
# Example 2
add_b <- function(x){
b <- 8
x <- x + b
return (x)
}
z <- add_b(10)
print(b)
Error in print(b): object 'b' not found
# Example 3
add_1 <- function(x) {
x <- x + 1
return (x)
}
x <- 10
z <- add_1(12)
print(z)
[1] 13
Functions With Mutable Object
Consider the following code:
nums = [1,3,2,4]
new_nums = add_one_number(nums, 5)
print(new_nums)
[1, 3, 2, 4, 5]
print(nums)
[1, 3, 2, 4, 5]
Given what you have learned about mutable objects, aliasing and scoping, are you surprised by
the result? If you are confused, take a look of the following visualisation:
When dealing with mutable objects in a function, you need to be careful if there are any
unintentional side-effects raised. For the example above, if you do not wish to change the original
list, the function should be written as:
nums = [1,3,2,4]
new_nums = add_one_number(nums, 5)
print(new_nums)
[1, 3, 2, 4, 5]
print(nums)
[1, 3, 2, 4]
Exceptions in Python
Until now, you probably have seen different error messages when you write a Python program.
Reading the error messages is often the first step to figuring out what is wrong in your code.
There are two main types of error:
• SyntaxError: When your syntax is not correct and Python cannot parse your program.
IndentationError is one example of SyntaxError. It occurs when you do not indent
blocks of code appropriately.
• Exceptions: Something else is wrong even your syntax is correct. There are many
different types of exceptions and below we list some of the common ones:
o NameError: Attempt to retrieve a variable that is not defined in the local or global
scope.
o TypeError: Providing arguments with some inappropriate types.
o ValueError: Type of the argument is correct but the value is not.
o IndexError: When you try to access an element with the index that is out of
range.
There are many more types of exceptions, and you can learn more from here.
The following blocks of code trigger different error messages in Python:
# SyntaxError
print "hello"
which gives the error message SyntaxError: Missing parentheses in call to 'print'.
Did you mean print("hello")?.
# IndentationError
def print_hello():
print("hello")
# NameError
hello
Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'hello'
is not defined
Detailed traceback:
File "<string>", line 1, in <module>
# TypeError
1+'2'
Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: unsupported o
perand type(s) for +: 'int' and 'str'
Detailed traceback:
File "<string>", line 1, in <module>
# ValueError
int('a')
Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: invalid lite
ral for int() with base 10: 'a'
Detailed traceback:
File "<string>", line 1, in <module>
# IndexError
num = [1,2,3]
num[3]
Error in py_call_impl(callable, dots$args, dots$keywords): IndexError: list index o
ut of range
Detailed traceback:
File "<string>", line 1, in <module>
Handling Exceptions
Until now, whenever we encounter an exception, we let the program crash and we go back to the
code to try out the problem and to fix it. It is not ideal and for some situations, it is not even a
feasible strategy. For example, consider the following example:
As the input is from the users, we cannot guarantee that they will always provide a valid input. If
they accidentally give "a" as the input, then you will get the error ValueError: invalid
literal for int() with base 10: 'a' and the program will crash. Even if numbers are
provided as the input, if the second number provided is 0, then we will get a
ZeroDivisionError.
Using try-except to Handle Exceptions
A better approach is to handle the exception by the try and except blocks:
try:
nomin_str = input("Enter a number: ")
nomin = int(nomin_str)
denom_str = input("Enter another number: ")
denom = int(denom_str)
num = nomin / denom
print(f"Dividing {nomin} by {denom} gives {num}.")
except:
print("Wrong input.")
If an exception is raised in the try clause during execution, the rest of the try clause will be
skipped. The exception will be handled by the except statement and the except clause will be
executed. Now the program does not crash even if the user provides a wrong input, only the
printout "Wrong input." is given.
We can use separate except clauses to handle different types of exceptions:
try:
nomin_str = input("Enter an integer: ")
nomin = int(nomin_str)
denom_str = input("Enter another integer: ")
denom = int(denom_str)
num = nomin / denom
print(f"Dividing {nomin} by {denom} gives {num}.")
except ValueError:
print("Input provided cannot be converted into a number.")
except ZeroDivisionError:
print("Zero cannot be the denominator.")
except:
print("Some other error.")
The ValueError handles the case when users provide non-numerical input.
ZeroDivisionError handles when the second input is zero. The last except handles all other
possible exceptions. Try to run the code above with different inputs, and see what printout you
will get.
We can also use the following with the try and except clauses:
count = 0
while True:
try:
nomin_str = input("Enter an integer: ")
nomin = int(nomin_str)
denom_str = input("Enter another integer: ")
denom = int(denom_str)
num = nomin / denom
except ValueError:
print("Input provided cannot be converted into an integer.")
except ZeroDivisionError:
print("Zero cannot be the denominator.")
except:
print("Some other error.")
else:
print(f"Dividing {nomin} by {denom} gives {num}.")
break
finally:
count += 1
print(f"Attempt {count}.")
Raise Exceptions
Above we have shown that we can avoid raising exceptions by catching them. Sometimes we
want to raise our own exceptions to give more precise information about what is wrong.
Consider the following code:
try:
nomin_str = input("Enter an integer: ")
nomin = int(nomin_str)
denom_str = input("Enter another integer: ")
denom = int(denom_str)
num = nomin / denom
print(f"Dividing {nomin} by {denom} gives {num}.")
except ValueError:
raise ValueError("Input provided cannot be converted into a number.")
except ZeroDivisionError:
raise ZeroDivisionError("Zero cannot be the denominator.")
except:
raise ValueError("Wrong input.")
If you give "a" as the first input, you will get the error ValueError: Input provided cannot
be converted into a number. This error message gives additional information to the generic
error message ValueError: invalid literal for int() with base 10: 'a'.
• Spyder
• Python Tutor
• Put print() into where you suspect there may be a bug in the program
Below we will illustrate how to use Python Tutor and function print() for debugging. For
Spyder, we will show it in the live section. Let us consider the following code, with the function
get_sorted_numbers() aims to return a sorted list. The list returned from the function does not
seem to be sorted:
def get_sorted_numbers(nums):
'''
input: a list of numbers
output: a list of sorted numbers
'''
sorted(nums)
return nums
nums = [1,3,2,1]
sorted_num = get_sorted_numbers(nums)
print(sorted_num)
[1, 3, 2, 1]
What is wrong? Let us try to examine the code using Python Tutor:
We can see visually from Python Tutor that the function sorted() does not change the order of
the list num. Such visualisation power is very helpful for us to understand each line of the code
and see if any of the lines does not work as we have expected.
Next, we use the print() function to see if nums is actually sorted by the function sorted(). We
put the print() function before and after the line sorted(nums) to see if the line has actually
changed the list nums:
def get_sorted_numbers(nums):
'''
input: a list of numbers
output: a list of sorted numbers
'''
print("before sort", nums)
sorted(nums)
print("after sort", nums)
return nums
nums = [1,3,2,1]
sorted_num = get_sorted_numbers(nums)
before sort [1, 3, 2, 1]
after sort [1, 3, 2, 1]
print(sorted_num)
[1, 3, 2, 1]
We can see that the list nums is the same before and after sorted(), showing us that something
must be wrong at the line sorted(nums). Now we can proceed and correct the problem using
something like:
def get_sorted_numbers(nums):
'''
input: a list of numbers
output: a list of sorted numbers
'''
return sorted(nums)
nums = [1,3,2,1]
sorted_num = get_sorted_numbers(nums)
print(sorted_num)
[1, 1, 2, 3]
Assertions
Previously in “Exceptions, Error Handling and Debugging in R”, we talked about the “defensive
programming” strategy. One of the ways to implement this strategy is to use assert statement to
check if the program is running as expected. If it is not, AssertionError exception is raised.
Assertions can be used to:
• Halt the program as soon as some unexpected conditions occur. This makes it easier to
discover and locate the bugs.
• Check if the output is appropriate before returning the bad value and causing unexpected
consequences.
The following function aims to sort the list of numbers:
def get_sorted_numbers(nums):
sorted(nums)
assert(len(nums)<2 or all([nums[i] <= nums[i+1] for i in range(len(nums)-1)])),
"The list " + str(nums) + " is not sorted"
return nums
The assert statement check if the next number is at larger or equal to the previous number. If
the condition is not satisfied, the AssertionError is raised and the error message The list
... is not sorted is shown. This tells us that there must be a bug in this function.
** Note: The code chunks below should be run in the following order **
Classes
In the previous sections, we have defined our own functions. Here we will define our own type of
objects by creating a new class. Classes allow us to bundle data and functionality together
(encapsulation).
def dist_from_origin(point):
'''
return the distance of a point from the origin.
point should be in the form of (x, y) with x, y are numbers.
'''
assert len(point) == 2, "points must be 2 dimensional"
try:
return (point[0] ** 2 + point[1] ** 2) ** (1/2)
except:
raise ValueError("Input is not in the form of (x, y), with x,y are numbers.
")
The above code chunks only defined the functions required for the calculations. We have to call
the functions to execute them, for example:
point = (3,4)
dist_from_origin(point)
5.0
With the implementation above, the calculations and data are separated. Also note that any other
objects not in the form of a point can also use the functions, although it is likely that an error will
be raised. This implementation is in the “style” of functional programming, for which we solve a
problem by functions:
point_1->[dist_from_origin]-> distance
Also, the function does not have any side effect (e.g. does not mutate the input, modify the global
variables or have any print out), and the output of the function only depends on the input.
This is very similar to how we solve the mathematics problem \(y = f(x)\). For the example above,
we have x = (3,4) (a point), f = dist_from_origin (function to calculate the distance from
the origin) and y = 5 (distance).
class Point:
'''
The Point class represents a point in 2 dimensions.
'''
def __init__(self, x, y):
if isinstance(x, numeric) & isinstance(y, numeric):
self.x = x
self.y = y
else:
raise ValueError('x and y must be numbers')
def dist_from_origin(self):
'''
return the distance of the point from the origin.
'''
return (self.x ** 2 + self.y ** 2) ** (1/2)
The class Point defined above bundles the data (x, y coordinates) and the functionalities
(calculate the distance between two points, find the midpoint, etc.) of points together:
• Data: self.x and self.y store the x and y coordinate information of a given point. The
values are assigned in the __init__() method.
• Functionalities: four different methods are defined. The method distance_from_origin,
for example, calculate the distance of the point from the origin, using the point data
stored in self.x and self.y.
The code in the above code chunk is written in a object-oriented programming (OOP) style, for
which we solve the problem by creating classes.
pt = Point(3,4)
This create an object with type Point representing a point with x and y coordinates 3 and 4.
What does Python do when we call pt = Point(3,4)? Python first creates an instance of the
class Point (which is self) and the special initialising function __init__() is called automatically
when an object of the class is created. __init__() allows the class to initialise the attributes of
the instance self, and in our case we assign the value 3 to self.x and 4 to self.y. The
instance of Point created is then bound to the variable name pt. As the instance of Point has
the attributes x and y, we can retrieve them by using the dot notation:
print(pt.x)
3
print(pt.y)
4
Calling the Methods in Point
Methods are functions of a class. Every method (including __init__()) requires self to be the
first argument. To use a method, we again use the dot notation. For example:
pt.dist_from_origin()
5.0
Note that while the method dist_from_origin() requires an argument self, we do not need to
provide such argument when calling it, as self is automatically provided to the method. In terms
of the result, calling pt.dist_from_origin() is the same as calling:
Point.dist_from_origin(pt)
5.0
And pt is given as self for the method. Note that you should always use object.method()
(e.g. pt.dist_from_origin()) instead of Class.method(object)
(e.g. Point.dist_from_origin(pt)), although they should in principle provide the same result.
Encapsulation
Encapsulation With Classes
Encapsulation is a concept of bundling data and methods together and restricting the direct
access to data of the object. We have seen that the use of classes allows us to bundle data and
methods together. We now show how we can restrict the direct access of the data with the
example below:
class BankAccount:
'''
BankAccount represents the bank account, with methods to deposit and withdraw m
oney from it
'''
def __init__(self, balance = 0):
assert balance >= 0, "balance has to be non-negative."
self._balance = balance
def get_balance(self):
'''
Return current balance
'''
return self._balance
One thing you may have noticed is that we defined the class attribute for the account balance as
_balance. Any variable name with one underscore (_) indicates that such variable is intended for
internal use. Pressing tab after typing object. in a Python terminal lists all the methods and
data that are visible to users. For example for the Point object pt, we can see all the methods
and data that we have defined:
>>> pt = Point(3,4)
>>> pt.
pt.dist( pt.dist_from_origin( pt.is_same( pt.mid_point(
pt.x pt.y
account_a = BankAccount(100)
account_a.deposit(20)
20
print(account_a.get_balance())
120
money = account_a.withdraw(200)
Withdraw fail: withdraw amount is more than the balance.
print(account_a.get_balance())
120
money = account_a.withdraw(50)
print(account_a.get_balance())
70
By encapsulation and data hiding with class, we prevent users from manipulating the data in an
arbitrary way and ensure data integrity.
One thing to notice is that while _balance is marked for internal use, Python still allows you to
change the value:
account_a._balance = -100
print(account_a.get_balance())
-100
You should not do it in this way (even you are able to do it).
Functions
If we want to rewrite the solution with functions instead of creating a new class, we can do it in
this way:
balance = 100
balance = deposit(balance, 20)
print(balance)
120
balance = withdraw(balance, 200)
Withdraw fail: withdraw amount is more than the balance.
print(balance)
120
balance = withdraw(balance, 50)
print(balance)
70
Note data can be directly accessed, and functions and data are separated.
class OverdraftAccount(BankAccount):
'''
OverdraftAccount represents a bank account with overdraft limit
'''
def __init__(self, balance, overdraft_limit):
assert overdraft_limit > 0, "overdraft limit has to be non-negative."
assert balance > -overdraft_limit, "balance exceeds overdraft limit"
self._balance = balance
self._overdraft_limit = overdraft_limit
def get_overdraft_limit(self):
'''
Return the overdraft limit
'''
return self._overdraft_limit
The first line of the code tells Python to define a new class OverdraftAccount based on the
BankAccount. It inherits all the methods from BankAccount, and that is why we do not need to
define deposit() or get_balance().
Note we do define other methods that are already available in BankAccount like withdraw().
This is because we want the withdraw() method for OverdraftAccount to behave differently to
the withdraw() method for BankAccount to reflect that the OverdraftAccount allows negative
balance as long as it is under the overdraft limit. This is called polymorphism, for which a single
interface to entities of different types.
Let us rerun the code above except we change the account type from BankAccount to
OverdraftAccount:
Functions
For the function approach, we modify the function withdraw() in order to accommodate
overdraft limit.
balance = 100
overdraft_amount = 300
balance = deposit(balance, 20)
print(balance)
120
balance = withdraw(balance, 200, overdraft_amount)
print(balance)
-80
balance = withdraw(balance, 50, overdraft_amount)
print(balance)
-130
When using class for this question, we define a new child class OverdraftAccount to
accommodate the other type of bank account that has overdraft limit. We do not need to change
the original implementation of BankAccount. However, with the use of function, we do need to
write a new function but we need to modify the original implementation of BankAccount.
Programming Paradigms
Programming paradigm is a way or style to program. The programs we have written so far can
be described as:
• Procedural: Programs are lists of instructions that tell the computer what to do with the
program’s input.
• Functional: Programming decomposes a problem into a set of functions, and avoids
change of states as much as possible
• Object oriented: Programming decomposes a problem into a set of objects.
All programs we have written before this week were mostly procedural with functional style. We
solve the problems by writing functions or a list of procedures.
The code we have written so far in Python is not actually functional. For functional programming,
functions should only take inputs and generate outputs. It should not have any side effect (like
print() or mutate a list). It also should not have any internal state that affects the output
produced for a given input. While we have highlighted the danger of mutating the input and using
global variables in a function, we do not prohibit the use of them. Also, we often have some
internal states in a function (e.g. x = (point_1[0] + point_2[0]) / 2 in the function
mid_point()). Nevertheless, we have written our code in the functional style, in the sense that
we solve the problem by a flow of functions. For the rest of the note, the word functional
programming refers to the style of decomposing the problem into a series of functions, and avoid
the side effect if possible.
This week we introduced how to write our own class, and we can solve problems by writing
different objects. Above we have seen that in Python we can solve the same problem with a
more procedural/functional programming approach or a more object oriented approach. Python is
a multiple programming paradigm language, and it supports all three programming paradigms
mentioned above.
Which paradigm to use depends on the situation:
• How the problem can be naturally represented: If the problem can be easily / naturally to
be decomposed into objects, then object-oriented programming (OOP) is likely to be
more suitable. If the problem can be easily/naturally be decomposed into functions, then
functional programming style is likely to be more suitable.
• Readiness to use: OOP tends to require more code that takes longer to write compared
to functional programming style. So if you want something fast, functional programming
style may be more suitable
• Future update: If you foresee that you will have some more new types of objects with the
same interface (e.g. to have more different types of bank accounts in the future with the
same withdraw(), deposit()) interface, then OOP is more suitable as what you need
to do is to create a new child class. If you foresee that you will have the same type of
objects but likely to have new functionalities (e.g. data exploration when you are unsure
what analysis to do on a piece of data), then functional programming style is more
suitable as what you need is to create a new function.
• Data hiding: It is easier to do data hiding in OOP than functional programming style.
Remember different programming paradigms are different tools for you to solve a problem. You
do not have to stick to one paradigm, and it is likely that you will use a mixture of paradigms for a
program.
Object-Oriented Programming in R
R provides multiple systems for OOP (e.g. S3, S4, R6). S3 and S4 are provided by base R, while
R6 is provided by the R6 R package. Generally in R, functional programming is much more
important than OOP because it is easier to solve complex problems by breaking them down into
simpler functions than simple objects. See Chapter 10 of the Advanced R textbook for more
details.
The example below illustrates how polymorphism is provided by the default S3 OOP system in R:
x <- rnorm(100)
y <- x + rnorm(100)
mod <- lm(y~x) # fit a linear model
print(class(x))
[1] "numeric"
print(class(mod))
[1] "lm"
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.290494 -0.734283 -0.008068 -0.050195 0.743667 2.429998
summary(mod)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.3788 -0.7515 -0.0380 0.8287 2.3064
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.007164 0.106501 0.067 0.947
x 1.006858 0.105160 9.574 1.02e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.064 on 98 degrees of freedom
Multiple R-squared: 0.4833, Adjusted R-squared: 0.478
F-statistic: 91.67 on 1 and 98 DF, p-value: 1.015e-15
Note here the function summary() behaves differently depending on the type of objects passed
to it.
Data Wrangling
Data wrangling (also known as data munging or data carpentry) refers to the design,
implementation, and execution of processes that take us from raw, typically unstructured,
data to data that are more appropriate for subsequent data-analytic processes. We have
already seen an instance of data wrangling in “Structured, Semi-Structured and Unstructured
Data,” where we used the tm R package to convert unstructured text to a document-term
matrix, which is a structured data format that can be used for topic modelling.
• Think carefully and design what data wrangling processes are appropriate for what
you need to do;
• Implement the processes in a way that is reproducible, reusable and shareable.
Investing time to implement and share the data wrangling processes using open-source tools
(e.g. git), technologies (e.g. Markdown and R Markdown), and programming languages (like
Python and R) will get you a long way in the above directions.
The article “Re-run, Repeat, Reproduce, Reuse, Replicate: Transforming Code into Scientific
Contributions” by Benureau, F. C. Y and Rougier, N. P (Benureau & Rougier, 2018) is a
highly-recommended read about reproducibility and reusability of code.
The main data wrangling activities involve at least a combination of the following:
• Discovering patterns in data: For example, identifying correlations and patterns
through basic data analysis and visualizations
• Structuring data: For example, subsetting, merging, re-ordering, transforming,
reshaping, etc
• Cleaning and validating data: For example, identifying missing or misrecorded data
that can impact the accuracy of subsequent data-analytic processes
• Enriching data: For example, thinking of what other data sources should be used to
maximize the value of the data
You have already learned how to go about most of those operations in previous weeks, and
you will learn more in the upcoming weeks. It is important to know that during data
wrangling activities most often ad-hoc decisions need to be taken! It is good practice to
record what decisions have been taken, or even “parameterize” those decisions in order to
come back to them, for example to test how they impact the outcome of an analysis.
For example, the data in heart_rates.csv is a runner’s heart rates as recorded by a smart
watch during jogging.
From a quick look using str() and head() we see that heart_rate is a data frame with
times and the corresponding heart rate recordings. However, a closer look reveals some
issues
summary(heart_rates)
time heart_rate
Min. :2013-06-08 08:04:42 Min. : 7.0
1st Qu.:2013-06-08 08:09:38 1st Qu.:153.0
Median :2013-06-08 08:14:35 Median :156.0
Mean :2013-06-08 08:14:34 Mean :151.5
3rd Qu.:2013-06-08 08:19:31 3rd Qu.:158.0
Max. :2013-06-08 08:24:27 Max. :266.0
NA's :501
There are 501 missing heart rate values, which may be because the heart rate monitor could
not take a measurement at that particular timestamp. A more serious issue, though, is that the
minimum recorded heart rate is r min(heart_rates$heart_rate, na.rm = TRUE), which
seems a bit low for someone jogging, and the maximum recorded heart rate is r
max(heart_rates$heart_rate, na.rm = TRUE), which is a bit high for a human! A plot
of the heart rates versus time reveals some issues.
plot(heart_rates)
First, there seems to be a clear evolution of the heart rates for about 10 minutes from around
70 to 180, which seems normal. But the clear evolution stops abruptly after minute 15,
possibly because the heart rate monitor fell off. Also there appears to be some noisy heart rate
observations during the workout, which may point to an issue with the device.
Depending on what we want to do with this data, using it as is in statistical modelling may
lead to misleading conclusions. We can alleviate some of those problems by writing a short
function that removes “implausible” observations. For example, the below function replaces
any heart rate outside the interval (min_hr, max_hr) with NA.
So, we can now NA heart rates that are less than 70 and above 200 as implausible, before
passing the data to further analysis and modelling.
hr <- trim(heart_rates, min_hr = 70, max_hr = 200)
plot(hr)
In addition to that, after we complete the analysis and modelling, we can come back and
adjust the values of min_hr and max_hr and repeat the analyses to see how our decisions
regarding the data impact our conclusions!
In the following weeks we will discuss how missing values can be handled also as part of
machine learning pipelines.
References
Benureau, F. C. Y., & Rougier, N. P. (2018). Re-run, repeat, reproduce, reuse, replicate:
Transforming code into scientific contributions. Frontiers in Neuroinformatics, 11, 69.
Data Wrangling Operations in R
Introduction
In this page, we will focus on some R packages and functionality that is extremely useful
during data wrangling processes. In particular, we will introduce:
*apply Functions
The *apply() family is a collection of functions in R that is used to manipulate slices of data
from matrices, arrays, lists and data frames in a repetitive way without explicitly writing
loops. These functions apply a function to each element of the selected input data. Below, we
focus on the following *apply() functions:
Please check the help pages of each of these functions for a complete description of their
arguments.
apply()
The apply() function takes a data frame or matrix as input and returns an output of vector,
list or array type. The apply() function is mainly used to avoid explicit uses of loops.
In the above code chunk, we use the apply() function to calculate the sum of each row of the
matrix by setting MARGIN = 1, calculate the sum of each column of the matrix by setting
MARGIN = 2.
In both examples, we applied the sum() function to the margins. However, we could have
used any other function, even user-defined ones, as long as they can work with the inputs
specified by the margin we use. For example, if we want to get basic summaries of each of
the columns we can do
apply(my_matrix, 2, summary)
[,1] [,2] [,3] [,4]
Min. 1 6 11 16
1st Qu. 2 7 12 17
Median 3 8 13 18
Mean 3 8 13 18
3rd Qu. 4 9 14 19
Max. 5 10 15 20
As we would expect from the summary() function applied to a vector, the third column of the
output above has
min(my_matrix[, 3])
[1] 11
quantile(my_matrix[, 3], 0.25)
25%
12
median(my_matrix[, 3])
[1] 13
mean(my_matrix[, 3])
[1] 13
quantile(my_matrix[, 3], 0.75)
75%
14
max(my_matrix[, 3])
[1] 15
lapply()
The function lapply() takes a vector or list X and applies the function FUN to each of its
elements. Then, lapply() will output a list which is of the same length as X, where each
element is the outcome of applying the function FUN on the corresponding element of X.
[[2]]
[1] "Doe" "Smith" "Williams" "Taylor" "Wilson"
# Convert to lower case
Names_lower <- lapply(X = Name_list, FUN = tolower)
Names_lower
[[1]]
[1] "john" "jane" "tim" "michael" "emma"
[[2]]
[1] "doe" "smith" "williams" "taylor" "wilson"
str(Names_lower)
List of 2
$ : chr [1:5] "john" "jane" "tim" "michael" ...
$ : chr [1:5] "doe" "smith" "williams" "taylor" ...
In the above code chunk, we worked with a list of strings containing the first and last name of
five people. We used lapply() together with the tolower() function to convert the names
to lower case.
sapply()
The function sapply() works in the same way as lapply() but simplifies (hence the s) the
output to return a vector or a matrix.
[[2]]
[1] 6 7 8 9 10
# Using the sapply() function to raise each element of the numbers vector
to the power of 3
sapply(X = numbers, FUN = raise_power, power = 3)
[,1] [,2]
[1,] 1 216
[2,] 8 343
[3,] 27 512
[4,] 64 729
[5,] 125 1000
In the above example, we created a function called raise_power() that takes the arguments
x and power and returns x raised to power.
We then created a list called numbers, which contains two vectors, and used sapply() to
apply the raise_power() function to each element of numbers. We set power = 3 just after
we have specified X = numbers and FUN = raise_power. The output is a \(5 \times 2\)
matrix, where each column represents the elements of the first and second vectors,
respectively, raised to the power of three.
mapply()
In this example, we created a matrix mat, using the rep() function thrice. We then create the
same matrix mat1 by using mapply(). Basically, mapply() applies the rep() function to a
vector containing the numbers 1 to 3 with additional argument 3 specifying how many times
each number should be repeated. Another example is
[[2]]
[1] 2 2 2 2
[[3]]
[1] 3 3 3 3 3 3 3 3 3
The reshape2 package makes it easy to transform data between wide and long formats:
As you may have already figured out, you may need wide-format data for some type of
analyses and long-format data for others. For example, it is easy to work with ggplot2 (we
will discuss ggplot2 graphics) with long-format data. Also, most statistical modelling
functions like lm(), glm() and gam() work well with long-format data. However, many
people often find wide-format data easier and more intuitive for recording data. Therefore, it
is important to be able to work with both and be able to transform data from and into each of
these two formats.
Wide to Long
To illustrate how to use the melt() function in the reshape2 package, we will consider the
built-in R dataset airquality. This data set consists of daily air quality measurements in
New York, from May to September 1973. This data is in wide format because for each data
point, we have a column for each variable. In reshape2, to convert wide-format data to long-
format data, we use the melt() function:
library("reshape2")
# print out the first few records of the airquality data frame
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
# use the melt function to convert the data into long format
airquality_long <- melt(airquality)
No id variables; using all as measure variables
# print out the first few records of the new airquality_long data frame
head(airquality_long)
variable value
1 Ozone 41
2 Ozone 36
3 Ozone 12
4 Ozone 18
5 Ozone NA
6 Ozone 28
# print out the last few records of the new airquality_long data frame
tail(airquality_long)
variable value
913 Day 25
914 Day 26
915 Day 27
916 Day 28
917 Day 29
918 Day 30
We see now our data is in long-format where we have one column named variable which
records the name of the variable and a column named value which records the value of that
variable. By default, the melt() function assumes that all columns are numeric with numeric
values are variables with values. Often, we may want to know specific values and keep them
as columns. We can identify these by using the id.vars argument. Here we can specify the
ID variables, which are variables that identify individual rows of the data. Suppose we want
to have the month and the day as ID variables:
# use the melt function to convert the data into long format with
specifying ID variables Month and Day
airquality_long <- melt(airquality, id.vars = c('Month', 'Day'))
head(airquality_long)
Month Day variable value
1 5 1 Ozone 41
2 5 2 Ozone 36
3 5 3 Ozone 12
4 5 4 Ozone 18
5 5 5 Ozone NA
6 5 6 Ozone 28
We can see that by passing in the Month and Day variables into the id.vars argument, we
kept Month and Day as columns and the rest of the data was converted into long format. We
may also wish to control the column names in our long format by passing in the column
names for the variable and value into the variable.name and value.name arguments
respectively:
# use the melt function to convert the data into long format with
specifying ID variables Month and Day
airquality_long <- melt(airquality, id.vars = c('Month', 'Day'),
variable.name = 'climate_var',
value.name = 'climate_value')
head(airquality_long)
Month Day climate_var climate_value
1 5 1 Ozone 41
2 5 2 Ozone 36
3 5 3 Ozone 12
4 5 4 Ozone 18
5 5 5 Ozone NA
6 5 6 Ozone 28
tail(airquality_long)
Month Day climate_var climate_value
607 9 25 Temp 63
608 9 26 Temp 70
609 9 27 Temp 77
610 9 28 Temp 75
611 9 29 Temp 76
612 9 30 Temp 68
Long to Wide
Going from long to wide format can take a bit more thought than going from wide- to long-
format data. In reshape2, there are multiple cast functions. To work with data frames, we use
the dcast() function—there are also the acast() function to return a vector, matrix or array.
To illustrate how to use the dcast() function, we work with the airquality_long data
frame we created earlier.
The dcast() function uses a formula to describe the shape of the data. In this function we
need to tell dcast() what are ID variables and what is the variable column that describes the
measured variables. dcast() will use the last remaining column as the column that contains
the values by default, but we can also pass this into the value.var argument. We can print
this out and see that we can recover the original airquality data frame (but with the
columns in different order):
# use the melt function to convert the data into long format with
specifying ID variables Month and Day
airquality_long <- melt(airquality, id.vars = c('Month', 'Day'),
variable.name = 'climate_var',
value.name = 'climate_value')
head(airquality_long)
Month Day climate_var climate_value
1 5 1 Ozone 41
2 5 2 Ozone 36
3 5 3 Ozone 12
4 5 4 Ozone 18
5 5 5 Ozone NA
6 5 6 Ozone 28
# use the dcast function to convert the data back into wide format
airquality_wide <- dcast(airquality_long,
formula = Month + Day ~ climate_var,
value.var = 'climate_value')
head(airquality_wide)
Month Day Ozone Solar.R Wind Temp
1 5 1 41 190 7.4 67
2 5 2 36 118 8.0 72
3 5 3 12 149 12.6 74
4 5 4 18 313 11.5 62
5 5 5 NA NA 14.3 56
6 5 6 28 NA 14.9 66
A common problem users may face is when you cast a data set where there is more than one
value per data cell. For instance, consider only using Month as an ID variable:
# use the melt function to convert the data into long format with
specifying ID variables Month and Day
airquality_long <- melt(airquality, id.vars = c('Month', 'Day'),
variable.name = 'climate_var',
value.name = 'climate_value')
head(airquality_long)
Month Day climate_var climate_value
1 5 1 Ozone 41
2 5 2 Ozone 36
3 5 3 Ozone 12
4 5 4 Ozone 18
5 5 5 Ozone NA
6 5 6 Ozone 28
# use the dcast function to convert the data back into wide format
airquality_wide <- dcast(airquality_long,
formula = Month ~ climate_var,
value.var = 'climate_value')
Aggregation function missing: defaulting to length
head(airquality_wide)
Month Ozone Solar.R Wind Temp
1 5 31 31 31 31
2 6 30 30 30 30
3 7 31 31 31 31
4 8 31 31 31 31
5 9 30 30 30 30
Here we got the warning: “Aggregation function missing: defaulting to length.” By only
using Month as the ID variable, we can see here that the cells are now filled with the number
of data rows for each month. This is because we did not pass an aggregation function into the
fun.aggregate argument which tells R how to aggregate the data.
If you cast your data and there are multiple values per cell, you will also need to tell dcast()
how to aggregate the data. Depending on the aim of the analysis, we can use a range of scalar
summaries, like the mean(), median(), sum(), etc. Here, we take the mean using the mean()
function and we also use na.rm = TRUE to remove the NA values:
# use the melt function to convert the data into long format with
specifying ID variables Month and Day
airquality_long <- melt(airquality, id.vars = c('Month', 'Day'),
variable.name = 'climate_var',
value.name = 'climate_value')
head(airquality_long)
Month Day climate_var climate_value
1 5 1 Ozone 41
2 5 2 Ozone 36
3 5 3 Ozone 12
4 5 4 Ozone 18
5 5 5 Ozone NA
6 5 6 Ozone 28
# use the dcast function to convert the data back into wide format
airquality_wide <- dcast(airquality_long,
formula = Month ~ climate_var,
value.var = 'climate_value',
fun.aggregate = mean,
na.rm = TRUE)
head(airquality_wide)
Month Ozone Solar.R Wind Temp
1 5 23.61538 181.2963 11.622581 65.54839
2 6 29.44444 190.1667 10.266667 79.10000
3 7 59.11538 216.4839 8.941935 83.90323
4 8 59.96154 171.8571 8.793548 83.96774
5 9 31.44828 167.4333 10.180000 76.90000
We see here that we now obtained the mean value of the variables for each month.
tidyr Package
There is also another popular package for data reshaping and more general data tidying,
called tidyr. tidyr provides functionality for:
Useful Links and Resources, below, provides some links of the basic functionality from
tidyr.
Stacking
The stack() and unstack() functions in R are handy when working with data frames.
Applying stack() to a data frame simply stacks the columns vectors on top of each other to
form a single vector along with a factor indicating where each observation came from. As
expected, the unstack() function does the complete opposite.
Below is a demonstration using the built-in PlantGrowth dataset. The dataset contains the
weights of 30 plants from 3 different groups, namely ctrl (control), trt1 (treatment 1) and
trt2 (treatment 2). Because of the way stack() works, we first convert the group variable
from factor to character to avoid getting a warning message.
Check the help file of stack() for more options and examples.
Subsetting
R has some powerful indexing features that allows you to quickly access object elements. In
this subsection, we will learn how to efficiently select or exclude particular elements in
vectors or data frames.
Subsetting Vectors
Consider a simple numeric vector:
• Using positive integers: We use a vector of positive integers to specify the index of
the elements we want to return / keep.
• Using negative integers: We use a vector of negative integers to specify the index of
the elements we want to exclude.
• Using logical vectors: We use a vector of logical values where elements are selected if
the corresponding logical value is TRUE.
# define x
x <- c(1.4, 5.6, 7.8, 2.6)
# select the 2nd and 4th element
x[c(2,4)]
[1] 5.6 2.6
# exclude the 2nd and 4th element
x[-c(2,4)]
[1] 1.4 7.8
# select the 2nd and 4th element
x[c(FALSE, TRUE, FALSE, TRUE)]
[1] 5.6 2.6
# select the elements that are strictly greater than 5
x[x > 5]
[1] 5.6 7.8
By running the above code, we can see that a very useful way to subset vectors is by using
logical vectors. In the last line, we first evaluated the conditional statement x > 5 and then
we used the result to subset x.
where dataframe is a data frame in R, x is the rows that we want to be returned, and y is a
vector of the columns that we want returned.
# create dataframe
dataframe <- data.frame(v1 = 1:5, v2 = 6:10, v3 = 11:15)
# select the first 3 rows
dataframe[1:3, ]
v1 v2 v3
1 1 6 11
2 2 7 12
3 3 8 13
# exclude the first 3 rows
dataframe[-(1:3),]
v1 v2 v3
4 4 9 14
5 5 10 15
# select the first 3 rows and first 2 columns
dataframe[1:3, 1:2]
v1 v2
1 1 6
2 2 7
3 3 8
# select the rows using a logical vector (condition: if the variable v3 is
divisible by 3
dataframe[dataframe$v3 %% 3 == 0,]
v1 v2 v3
2 2 7 12
5 5 10 15
# select columns v1 and v3
dataframe[, c("v1", "v3")]
v1 v3
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
# select columns v1 and v3 (alternative)
dataframe[c("v1", "v3")]
v1 v3
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
As we have seen in “Data Structures in R,” we can also subset the rows of a data frame using
the subset() function.
dplyr Verbs
One of the most useful packages to manipulate data is dplyr. This package contains the so-
called dplyr verbs:
When working with dplyr verbs, we typically want to use the result of a verb applied to a
data frame as input for another verb. In such cases, it is more intuitive to use the pipe %>%
operator from the magrittr R package to chain operations. The magrittr package is loaded
automatically when dplyr is loaded.
The variable to the left of %>% operator is passed as the first argument to the function on the
right of the %>% operator.
We have already encountered some of the those verbs and functionality when working with
databases in R. Here, we take a more in-depth view of dplyr’s functionality.
Throughout this section, we use the mtcars data set to illustrate the capabilities of the dplyr
verbs. Here is a preview of the dataset and its structure:
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
mutate()
The variable mpg stands for miles per gallon. Suppose we want to add another variable called
kpg which stands for kilometres per gallon. We know that 1 mile is approximately equal to
1.61 kilometres. We can use mutate() to add the variable kpg:
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Adding kpg
mtcars1 <- mutate(mtcars, kpg = 1.61 * mpg)
head(mtcars1)
mpg cyl disp hp drat wt qsec vs am gear carb kpg
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 33.810
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 33.810
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 36.708
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 34.454
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 30.107
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 29.141
# Alternative approach using the pipe %>% operator
mtcars2 <- mtcars %>% mutate(kpg = 1.61 * mpg)
head(mtcars2)
mpg cyl disp hp drat wt qsec vs am gear carb kpg
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 33.810
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 33.810
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 36.708
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 34.454
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 30.107
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 29.141
# mtcars1 is identical to mtcars2
identical(mtcars1, mtcars2)
[1] TRUE
select()
Suppose we only want to select the mpg, cyl, hp, vs and gear columns. We can use the
select() verb to do so.
mtcars %>%
select(mpg, cyl, hp, vs, gear)
mpg cyl hp vs gear
Mazda RX4 21.0 6 110 0 4
Mazda RX4 Wag 21.0 6 110 0 4
Datsun 710 22.8 4 93 1 4
Hornet 4 Drive 21.4 6 110 1 3
Hornet Sportabout 18.7 8 175 0 3
Valiant 18.1 6 105 1 3
Duster 360 14.3 8 245 0 3
Merc 240D 24.4 4 62 1 4
Merc 230 22.8 4 95 1 4
Merc 280 19.2 6 123 1 4
Merc 280C 17.8 6 123 1 4
Merc 450SE 16.4 8 180 0 3
Merc 450SL 17.3 8 180 0 3
Merc 450SLC 15.2 8 180 0 3
Cadillac Fleetwood 10.4 8 205 0 3
Lincoln Continental 10.4 8 215 0 3
Chrysler Imperial 14.7 8 230 0 3
Fiat 128 32.4 4 66 1 4
Honda Civic 30.4 4 52 1 4
Toyota Corolla 33.9 4 65 1 4
Toyota Corona 21.5 4 97 1 3
Dodge Challenger 15.5 8 150 0 3
AMC Javelin 15.2 8 150 0 3
Camaro Z28 13.3 8 245 0 3
Pontiac Firebird 19.2 8 175 0 3
Fiat X1-9 27.3 4 66 1 4
Porsche 914-2 26.0 4 91 0 5
Lotus Europa 30.4 4 113 1 5
Ford Pantera L 15.8 8 264 0 5
Ferrari Dino 19.7 6 175 0 5
Maserati Bora 15.0 8 335 0 5
Volvo 142E 21.4 4 109 1 4
filter()
We can use the filter() function to get only observations that match a condition (or
multiple conditions). filter() is the dplyr equivalent to subset(), and allows us to write
more intuitive code.
arrange()
We can arrange our observations in any convenient way using the arrange() function:
summarize()
We can compute several summaries using the summarize() function. For example, if we
want to calculate the maximum mpg value, the minimum value disp, and the mean of qsec in
mtcars we do:
mtcars %>%
summarize(maximum_mpg = max(mpg),
minimum_disp = min(disp),
average_qsec = mean(qsec))
maximum_mpg minimum_disp average_qsec
1 33.9 71.1 17.84875
iris %>%
group_by(Species) %>%
summarize(average_Sepal_Length = mean(Sepal.Length),
Average_Sepal_Width = mean(Sepal.Width),
average_Petal_Length = mean(Petal.Length),
average_Petal_Width = mean(Petal.Width))
# A tibble: 3 x 5
Species average_Sepal_Le… Average_Sepal_Wi… average_Petal_L…
average_Petal_W…
<fct> <dbl> <dbl> <dbl>
<dbl>
1 setosa 5.01 3.43 1.46
0.246
2 versico… 5.94 2.77 4.26
1.33
3 virgini… 6.59 2.97 5.55
2.03
*_join()
dplyr also provides verbs for joining data sets. Below, we use the band_members and
band_instruments tibbles (which are data.frame structures) to illustrate.
band_members
# A tibble: 3 x 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
band_instruments
# A tibble: 3 x 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
Suppose that we want to find all rows of band_members for which there are matching values
in band_instruments, and return all columns from the two tibbles. We can achieve that
using inner_join:
band_members %>% inner_join(band_instruments)
Joining, by = "name"
# A tibble: 2 x 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
We can also use left_join() to return all rows of band_members and all columns from the
two tibbles, with NA in the new columns if there are no matches in the rows of
band_members and band_instruments:
We can use right_join() to return all rows of band_instruments and all columns from the
two tibbles, with NA in the new columns if there are no matches in the rows of
band_instruments and band_members
Similarly, we can return all rows and columns of both band_members and
band_instruments with NA in the new columns if there are no matches in the rows of
band_instruments and band_members
Notice that the same join (with different ordering of rows) can be obtained by
All examples above also work with data.frames. More information and examples on dplyr
verbs for joining data sets can be found at dplyr’s join pages.
** Note: The code chunks below should be run in the following order **
Introduction
In this page we will use the same dataset airquality, from the unit on data wrangling in R. We
first export the data from R to CSV files so that the dataset can be accessible for Python:
We then load the data to Python, and rearrange the columns so that Month and Day information
are in the first two columns:
import pandas as pd
import numpy as np
airquality = pd.read_csv("airquality.csv")
airquality = airquality[['Month','Day','Ozone','Solar.R','Temp','Wind']]
airquality.head()
Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67 7.4
1 5 2 36.0 118.0 72 8.0
2 5 3 12.0 149.0 74 12.6
3 5 4 18.0 313.0 62 11.5
4 5 5 NaN NaN 56 14.3
There is some missing data in the dataset. We can see the data as in the wide format as as it
has a column for each variable (Ozone, Solar.R, Temp, Wind). In each row, we get the values for
Ozone, Solar.R, Temp, Wind level for a particular day. For example, for example, the first row has
Ozone = 41.0, Solar.R = 190.0, Temp = 67, Wind = 7.4 for the day 1 May.
In this page we will first show how we can reshape the dataset. We then show how we can select
a subset of the data. Finally we show how we can handle the missing data.
Reshaping Dataset
Python pandas provides a variety of methods/functions for reshaping your data before analysis,
and many of the methods are very similar to the R functions. We will reshape the data with the
following methods:
Wide to Long
We can convert the data from wide to long by the pandas.melt() function:
pd.melt(airquality)
variable value
0 Month 5.0
1 Month 5.0
2 Month 5.0
3 Month 5.0
4 Month 5.0
.. ... ...
913 Wind 6.9
914 Wind 13.2
915 Wind 14.3
916 Wind 8.0
917 Wind 11.5
Similar to melt() in R, pandas.melt() in Python will set the first column named variable which
records the name of the variable and a second column named value which records the value of
that variable. For this data set it is not really useful to convert the data to the DataFrame above.
Instead we should keep Month and Day as columns:
Long to Wide
We can use the method pivot_table() from pandas.DataFrame to convert data from long to
wide format:
Note that while the resulting table airquality_wide is very similar to the original table
airquality, we now have Multindex which groups Month and Day together for
airquality_wide:
airquality_wide.index
MultiIndex([(5, 1),
(5, 2),
(5, 3),
(5, 4),
(5, 5),
(5, 6),
(5, 7),
(5, 8),
(5, 9),
(5, 10),
...
(9, 21),
(9, 22),
(9, 23),
(9, 24),
(9, 25),
(9, 26),
(9, 27),
(9, 28),
(9, 29),
(9, 30)],
names=['Month', 'Day'], length=153)
This is because we set index=['Month', 'Day'] so that two columns are used as key. For the
other arguments, columns=['climate_var'] specifies that the data should be grouped by the
values in the column climates_var (i.e. Ozone, Solar.R, Temp, Wind) for the columns, and
values='climate_value' specifies that the values from the column climate_value should be
used to fill in the table. In our example, there is only one instance for the combination of row and
column (e.g. Month, Day, Ozone), so the value from climate_value is used directly. For
airquality_wide, each cell in the table represents the reading of a air quality measure (read
from the column name) on a particular day (read from the MultiIndex in the row). For example,
the value 41.0in the first cell is the reading of Ozone on 1 May.
What if there is more than one instance for the combination of row and column? For example, if
we consider using only Month for the index, then for a given Month and Ozone, we will have
multiple values 41.0, 36.0, etc from different values of Day. What does pivot_table() return in
this case?
It seems that the value inside the table is the average value. We can confirm it by considering the
following code, which we specify using mean to aggregating the data by the argument
aggfunc=np.mean:
As we can see, the return values are the same, showing us indeed averaging is used by default if
we have multiple values for a given combination. For the table above, the value in the first cell
(23.6) represents the monthly average ozone reading for Month = 5 (i.e. May).
We can use other aggregating function, for example np.max to get the maximum value:
For the table above, the value in the first cell (115.0) represents the maximum ozone reading for
Month = 5 (i.e. May).
airquality_stack = airquality_wide.stack()
airquality_stack
Month Day climate_var
5 1 Ozone 41.0
Solar.R 190.0
Temp 67.0
Wind 7.4
2 Ozone 36.0
...
9 29 Wind 8.0
30 Ozone 20.0
Solar.R 223.0
Temp 68.0
Wind 11.5
Length: 568, dtype: float64
airquality_stack.unstack()
climate_var Ozone Solar.R Temp Wind
Month Day
5 1 41.0 190.0 67.0 7.4
2 36.0 118.0 72.0 8.0
3 12.0 149.0 74.0 12.6
4 18.0 313.0 62.0 11.5
5 NaN NaN 56.0 14.3
... ... ... ... ...
9 26 30.0 193.0 70.0 6.9
27 NaN 145.0 77.0 13.2
28 14.0 191.0 75.0 14.3
29 18.0 131.0 76.0 8.0
30 20.0 223.0 68.0 11.5
Subsetting
Like in R, pandas in Python allows users to access a subset of data very easily.
airquality['Ozone']
0 41.0
1 36.0
2 12.0
3 18.0
4 NaN
...
148 30.0
149 NaN
150 14.0
151 18.0
152 20.0
Name: Ozone, Length: 153, dtype: float64
Note that the above code chunk returns a Series (i.e. not a DataFrame)
type(airquality['Ozone'])
<class 'pandas.core.series.Series'>
airquality[['Ozone']]
Ozone
0 41.0
1 36.0
2 12.0
3 18.0
4 NaN
.. ...
148 30.0
149 NaN
150 14.0
151 18.0
152 20.0
airquality[['Ozone', 'Temp']]
Ozone Temp
0 41.0 67
1 36.0 72
2 12.0 74
3 18.0 62
4 NaN 56
.. ... ...
148 30.0 70
149 NaN 77
150 14.0 75
151 18.0 76
152 20.0 68
airquality.iloc[:,np.array([2,4])]
Ozone Temp
0 41.0 67
1 36.0 72
2 12.0 74
3 18.0 62
4 NaN 56
.. ... ...
148 30.0 70
149 NaN 77
150 14.0 75
151 18.0 76
152 20.0 68
In the above, inside the square brackets, the rows are specified first (here : means all) and, after
the coma, the columns are specified by (np.array([2,4]) that corresponds to the indices of
the Ozone and Temp columns.
airquality[:5]
Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67 7.4
1 5 2 36.0 118.0 72 8.0
2 5 3 12.0 149.0 74 12.6
3 5 4 18.0 313.0 62 11.5
4 5 5 NaN NaN 56 14.3
We can also use .loc[] and .iloc[] to select rows. .loc[] is a label-based way to get the
specific rows (or columns as well), whereas .iloc[] is an index-based way to get the specific
rows (or columns as well). For example, for the airquality dataset, we can get the first five
rows by
airquality.loc[:5]
Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67 7.4
1 5 2 36.0 118.0 72 8.0
2 5 3 12.0 149.0 74 12.6
3 5 4 18.0 313.0 62 11.5
4 5 5 NaN NaN 56 14.3
5 5 6 28.0 NaN 66 14.9
or
airquality.iloc[:5]
Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67 7.4
1 5 2 36.0 118.0 72 8.0
2 5 3 12.0 149.0 74 12.6
3 5 4 18.0 313.0 62 11.5
4 5 5 NaN NaN 56 14.3
However, if we use the same method on airquality_wide instead, the result is different for
.loc[]:
airquality_wide.loc[:5]
climate_var Ozone Solar.R Temp Wind
Month Day
5 1 41.0 190.0 67.0 7.4
2 36.0 118.0 72.0 8.0
3 12.0 149.0 74.0 12.6
4 18.0 313.0 62.0 11.5
5 NaN NaN 56.0 14.3
6 28.0 NaN 66.0 14.9
7 23.0 299.0 65.0 8.6
8 19.0 99.0 59.0 13.8
9 8.0 19.0 61.0 20.1
10 NaN 194.0 69.0 8.6
11 7.0 NaN 74.0 6.9
12 16.0 256.0 69.0 9.7
13 11.0 290.0 66.0 9.2
14 14.0 274.0 68.0 10.9
15 18.0 65.0 58.0 13.2
16 14.0 334.0 64.0 11.5
17 34.0 307.0 66.0 12.0
18 6.0 78.0 57.0 18.4
19 30.0 322.0 68.0 11.5
20 11.0 44.0 62.0 9.7
21 1.0 8.0 59.0 9.7
22 11.0 320.0 73.0 16.6
23 4.0 25.0 61.0 9.7
24 32.0 92.0 61.0 12.0
25 NaN 66.0 57.0 16.6
26 NaN 266.0 58.0 14.9
27 NaN NaN 57.0 8.0
28 23.0 13.0 67.0 12.0
29 45.0 252.0 81.0 14.9
30 115.0 223.0 79.0 5.7
31 37.0 279.0 76.0 7.4
airquality_wide.iloc[:5]
climate_var Ozone Solar.R Temp Wind
Month Day
5 1 41.0 190.0 67.0 7.4
2 36.0 118.0 72.0 8.0
3 12.0 149.0 74.0 12.6
4 18.0 313.0 62.0 11.5
5 NaN NaN 56.0 14.3
Why is this the case? Note that the label of the index for airquality_wide is:
airquality_wide.index
MultiIndex([(5, 1),
(5, 2),
(5, 3),
(5, 4),
(5, 5),
(5, 6),
(5, 7),
(5, 8),
(5, 9),
(5, 10),
...
(9, 21),
(9, 22),
(9, 23),
(9, 24),
(9, 25),
(9, 26),
(9, 27),
(9, 28),
(9, 29),
(9, 30)],
names=['Month', 'Day'], length=153)
so airquality_wide.loc[:5] gets all rows with Month = 5. See here for more details.
We can also get the rows by condition, for example:
airquality.loc[airquality.Day == 3, 'Ozone']
2 12.0
33 NaN
63 32.0
94 16.0
125 73.0
Name: Ozone, dtype: float64
The above code chunk returns the ozone reading on 3 May, 3 June, …, 3 September.
airquality.loc[:5, 'Ozone']
0 41.0
1 36.0
2 12.0
3 18.0
4 NaN
5 28.0
Name: Ozone, dtype: float64
Or the .iloc[]:
airquality.iloc[:5, 2]
0 41.0
1 36.0
2 12.0
3 18.0
4 NaN
Name: Ozone, dtype: float64
airquality.head(10)
Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67 7.4
1 5 2 36.0 118.0 72 8.0
2 5 3 12.0 149.0 74 12.6
3 5 4 18.0 313.0 62 11.5
4 5 5 NaN NaN 56 14.3
5 5 6 28.0 NaN 66 14.9
6 5 7 23.0 299.0 65 8.6
7 5 8 19.0 99.0 59 13.8
8 5 9 8.0 19.0 61 20.1
9 5 10 NaN 194.0 69 8.6
We can remove them by .dropna(), which removes rows containing any missing data.
airquality_no_na = airquality.dropna()
airquality_no_na.head(10)
Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67 7.4
1 5 2 36.0 118.0 72 8.0
2 5 3 12.0 149.0 74 12.6
3 5 4 18.0 313.0 62 11.5
6 5 7 23.0 299.0 65 8.6
7 5 8 19.0 99.0 59 13.8
8 5 9 8.0 19.0 61 20.1
11 5 12 16.0 256.0 69 9.7
12 5 13 11.0 290.0 66 9.2
13 5 14 14.0 274.0 68 10.9
Sometimes, removing the rows with missing data may not be appropriate. Instead we may want
to fill in the missing values in a systematic way. For example, we can replace the missing values
with previous value:
airquality_fill_na = airquality.fillna(method='ffill')
airquality_fill_na.head(10)
Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67 7.4
1 5 2 36.0 118.0 72 8.0
2 5 3 12.0 149.0 74 12.6
3 5 4 18.0 313.0 62 11.5
4 5 5 18.0 313.0 56 14.3
5 5 6 28.0 313.0 66 14.9
6 5 7 23.0 299.0 65 8.6
7 5 8 19.0 99.0 59 13.8
8 5 9 8.0 19.0 61 20.1
9 5 10 8.0 194.0 69 8.6
There are different settings for the fill rule. See the pandas fillna() method for DataFrame for
more details. Alternatively we can use interpolation to fill the missing values. For example:
airquality_linear = airquality.interpolate(method='linear')
airquality_linear.head(10)
Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.000000 67 7.4
1 5 2 36.0 118.000000 72 8.0
2 5 3 12.0 149.000000 74 12.6
3 5 4 18.0 313.000000 62 11.5
4 5 5 23.0 308.333333 56 14.3
5 5 6 28.0 303.666667 66 14.9
6 5 7 23.0 299.000000 65 8.6
7 5 8 19.0 99.000000 59 13.8
8 5 9 8.0 19.000000 61 20.1
9 5 10 7.5 194.000000 69 8.6
Here we use linear method to interpolate, with the missing ozone data on 5 May calculated as
(18+28)/2 = 23.
import pandas as pd
cases_df = pd.read_csv("covid_new_cases.csv", index_col = 0)
deaths_df = pd.read_csv("covid_new_deaths.csv", index_col = 0)
Both DataFrame has date as the index (i.e. the row labels), although the case_df has more
data.
print(cases_df)
new_cases
date
2020-01-30 2
2020-01-31 0
2020-02-01 0
2020-02-02 0
2020-02-03 0
... ...
2020-10-25 15654
2020-10-26 26467
2020-10-27 23757
2020-10-28 22887
2020-10-29 19525
Here we want to join the two datasets together. pandas has fast and full-featured functions for
combining datasets via the method join(), and these joins can be one-to-one, many-to-one or
many-to-many. These merges can be done in a number of ways shown in the table below:
Method Behaviour
left Use calling frame’s index (or column if on is specified)
right Use other’s index
outer Use union of keys from both DataFrame
inner Use intersection of keys from both DataFrame
For example, we can join the cases_df and death_df by:
df_1 = cases_df.join(deaths_df)
print(df_1)
new_cases new_deaths
date
2020-01-30 2 NaN
2020-01-31 0 NaN
2020-02-01 0 NaN
2020-02-02 0 NaN
2020-02-03 0 NaN
... ... ...
2020-10-25 15654 234.0
2020-10-26 26467 253.0
2020-10-27 23757 227.0
2020-10-28 22887 216.0
2020-10-29 19525 217.0
In the above we use all the row labels from cases_df, as the default is left join. We can see
that the number of rows of the resulting DataFrame is the same as the number of rows in
cases_df, as we use all the row labels from cases_df. We also see that there are some NA data
for the new_deaths column in df_1 , as for the dates like 2020-01-30 there is no data available
from the deaths_df.
If we want to use all the keys of deaths_df instead, we can add the argument how = 'right':
Now we have the number of rows of df_2 equals to the number of rows in deaths_df, as we use
all the row labels from deaths_df.
If we want to the keys that are found in both cases_df and deaths_df, we can use `how =
‘inner’):
For this particular example, df_2 and df_3 are the same as the intersection of the dates (rows)
from cases_df and deaths_df, is the same as the dates in deaths_df.
1. Importing the data, cleaning them and preparing them for use (data wrangling)
2. Get some basic level of understanding on the data by exploration and visualisation.
3. Consider several models as well as combinations thereof to separate signal from noise.
4. Compare models and inform future decisions.
So far we have covered the first step. In this week, as well as the next three weeks that follow,
we will focus on the second of the steps above. The aim is to identify patterns in the data that
can potentially help us take more informed decisions.
• Graphs can be found in the early work of the Scottish engineer and political economist,
William Playfair, who essentially introduced the bar plot. See below for an example
depicting the imports and exports in Scotland in 1781:
The picture can't be displayed.
Playfair also produced pie charts, see below for one example providing information on the
Turkish empire:
(Images taken from Wikipedia, visit here (bar plot) and here (pie chart) for more information)
• Charles Joseph Minard, a French civil engineer produced several graphs, with one
example being the illustration of Napoleon’s campaign in Russia:
(Image taken from Wikipedia, go here for more information)
• Florence Nightingale, while investigating the causes of deaths among British troops in the
Crimean war, produced the following area diagram:
(Image taken from Wikipedia, go here for more information)
• Finally, John Snow, while researching the 1854 London cholera epidemic, used the
following map:
(Image taken from Wikipedia, see here for more information)
• Identify the key variables of interest in the data. It is essential that these are chosen
after careful thought about the nature of the problem and aims of the project.
• Determine the scale of each of these variables. They could be continuous (measurable),
categorical with ordered categories (ordinal scale) or they could be categorical with
unordered categories (nominal scale). Specific graphs are suitable for each of these
scales.
Once the above have been decided, data visualisation proceeds by looking at each of these
variables individually or checking some of them together, mostly in pairs, to explore potential
associations between them.
Next, we outline the main types of graphs for different types of variables. Note that the list of
graphs mentioned in by no means exhaustive; we will see more in the weeks to follow. Note also
that exploratory analysis can also be done without visualisation, more specifically with numerical
summaries as well as tables. Nevertheless these are beyond the scope of this course.
We see that people aged between 20 and 30 appear in higher proportions although other ages
are also present. A smoother version of a histogram is offered by a kernel density plot, shown
below for the previous example:
Kernel density plot of the ages of Titanic passengers; Figure 7.3 taken from
Wilke (2017)
Other examples of graphs for continuous variables are boxplots and violin plots
For categorical ordinal variables the standard option is a bar plot. Below is an example depicting
information on the United States annual household income versus age group in 2016, provided
by United States Census Bureau.
Bar plot on US annual household income versus age group in 2016. Figure
6.5 taken from Wilke (2017)
Note that the above appears to be very similar to a histogram. However a key difference is that
the bars are not joined, thus breaking the continuum between the different categories. Actually in
the histogram above the bars are not joined either but this is done for stylistic purposes only; they
are typically meant to be joined. In categorical ordinal variables it is also essential to respect the
order of the categories in the x-axis.
Bar plots can also be used for categorical nominal variables. In this case the order of values in
the x-axis can be determined by their corresponding y-values, so that these are in increasing or
decreasing order. An alternative graphical summary is the pie chart, an example of which was
shown earlier via the work of Playfair.
An interesting special case is when one of the two variables may be viewed as time. In this case
there is a specific temporal order in the data and quite often we want to visualise it. This can be
done using line graphs. Below is an example of a line graph consisting of the number of monthly
submissions to bioRxiv, between November 2014 and April 2018.
In cases of a pair between a continuous variable \(y\) and a categorical variable \(x\), it is useful
to align separate plots summarising the \(y\) variable for each of the categories of the variable
\(x\). A simple option, often adopted in practice is to provide a single summary of the \(y\)
variable, such as its mean, for each of the \(x\) categories. This can be done via a boxplot as in
the previous example of the United States annual household income versus age group in 2016.
Another option, providing more information on the entire distribution of \(y\) across the categories
of \(x\), is by separate boxplots. Below, we provide an example on data consisting of mean daily
temperatures recorded in Lincoln, Nebraska (variable \(y\)), for several days of different months
(variable \(x\)):
Mean daily temperatures in Lincoln, Nebraska, visualized as boxplots; Figure
9.3 taken from Wilke (2017)
In the recent years boxplots tend to be losing ground over violin plots; examples of those will be
shown over the next few weeks. It is also possible to produce heat maps or to align kernel
density plots across different categories of the variable \(x\), the so called ridgeline plots. For the
previous example, a ridgeline plot is shown below:
The figure above actually goes one step further by assigning different colours to each point
depending on the sex of the bird; this is actually a case of three continuous and one categorical
variable. Looking at the plots we can identify that male birds tend to be bigger and heavier.
When two categorical variables are involved, an alternative to barplots that map data values onto
bar heights is to map data values to colours; in other words to use a heat map. Below we see an
example on the percentage of internet users across time, from 1994 to 2016, and 20 countries.
Internet use across countries and time; the colour reflects the percentage of
internet users for each country and year combination; Figure 6.14 taken from
Wilke (2017)
The above graph can provide information for things like identifying the countries where internet
use began early, the countries with high internet use in 2016, etc.
Finally, for situations involving three categorical variables, grouped or stacked barplots may be
used. An example on is shown below in order to examining the market share of five companies
(A,B,C,D,E) from 2015 to 2017.
Market share corresponding to five companies for the years 2015-17; Figure
10.6 taken from Wilke (2017)
From the plot above we can see that companies A and B seem to be increasing their market
share unlike companies D and E that seem to be losing ground. Other options for graphs of
several categorical variables include mosaic plots, treemaps and nested pies.
References
Wilke, C. O. (2017). Fundamentals of data visualization. O’Reilly Media.
Graphics and Data Visualization in
R
Data
We will illustrate in R most of the visualizations introduced this week on the airquality
dataset. The data consist of daily air quality measurements in New York air from May to
September 1973 (153 days). The data were obtained from the New York State Department of
Conservation (ozone data) and the National Weather Service (meteorological data).
In each of the 153 days, the following four variables were measured
Variable Description
Ozone Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R
Solar radiation in Langleys in the frequency band 4000–7700 angstroms from
0800 to 1200 hours at Central Park
Wind
Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia
Airport
Temp Maximum daily temperature in degrees Fahrenheit at La Guardia Airport
The month and the day of the month were also recorded. The data are available each time you
open RStudio. Below we copy them to the data frame df and inspect the first six rows.
df=airquality
head(df)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
For plots over time and other related operations it is useful to create a date variable from the
variables Month and Day. This can be done with the R function ISOdate as shown below:
plot(df$date,df$Temp, type='l')
plot(df$date,df$Temp, type='o')
Next we present some graphical parameters that are also relevant for other visualizations such
as colours, legends, titles, axis labels, types of lines and points.
Graphical Parameters
To add colours we recommend using colour-blind friendly palettes; see here for more on
accessibility. One way to do that is via the RColorBrewer package. The code below loads
this package and displays its colour-blind friendly palettes:
library("RColorBrewer")
display.brewer.all(colorblindFriendly = TRUE)
Below is an example of using the Paired palette that has 12 colours. Every element of the
object colours corresponds to each of these 12 colours.
plot(df$date,df$Temp,type='o',ylim=c(1,180),col=colours[8])
lines(df$date,df$Ozone,type='o',col=colours[4])
There are some gaps in the green line above, due to the missing values in the Ozone variable.
Also it is essential to use the ylim argument because the limits in the y-axis set by the plot()
in the first line of code are based on Temp and are therefore too narrow for Ozone. So, without
ylim the Ozone line will go out of bounds.
A legend can also be provided to display more information, i.e. inform which colour
corresponds to each variable. The function legend() typically require the location, labels (of
the different lines), colours and characters corresponding to points. For the points, different
characters are possible with the argument pch.
plot(df$date,df$Temp,type='o',ylim=c(1,180),col=colours[8])
lines(df$date,df$Ozone,type='o',col=colours[4])
legend('topright', legend=c('Temperature','Ozone'), col=colours[c(8,4)],
pch=1)
A title in the plot can be added by the argument main. Axes labels can be added with the
arguments xlab and ylab for x and y axes respectively. The style of the line can be
controlled by the argument lty, its thickness by lwd. Some examples are given below.
In some cases it may be informative to label points belonging in different categories. For
example, consider the binary variable depending on whether the measurement is between
May and June or afterwards. This can be done with the code below. The table() command
provides the frequencies corresponding to the categories of the binary variable season.
df$season[df$Month>6]<-1
df$season[df$Month<=6]<-2
table(df$season)
1 2
92 61
The following code provides a scatter plot with different colours for the May and June
measurements versus the other months. Note that we now select the colours from the palette
explicitly for the two categories to be the second colour (blue) and the sixth (red).
More details about the plot function and graphical parameters in general can be found here.
Histograms and Density Plots
The distributions of a continuous variable can be portrayed with a histogram. Below we
provide the code for constructing one. Using freq=FALSE results in a density histogram,
otherwise a frequency histogram is obtained.
hist(df$Temp,breaks=20,col=colours[2],freq=FALSE)
We can also compare the distributions of two (or more) continuous variables by overlaying
their histograms. In such cases it may be useful to set the limits of the x-axis with xlim.
A smoother alternative to the histogram is provided by a kernel density plot. This can be done
by using the density() function inside plot() or lines().
d1<-density(df$Temp[df$season==1])
d2<-density(df$Temp[df$season==2])
plot(d1,xlim=c(45,100), col=colours[2],lwd=2, xlab='height',
main='Distribution of Temperature before and after June' )
lines(d2, xlim=c(45,100), col=colours[6], lwd=2)
legend('topright', legend=c('Post June','May or June'),
col=colours[c(2,6)], lwd=2, pch=15)
More details can be found here.
Another way to compare distributions across categories is via box plots, see below for an
example. The use of a formula Temp ~ Month can be used to produce a separate box plot for
each month.
library(vioplot)
vioplot(Temp ~ Month, data=df,col=colours[c(2,6,10,8,4)],names =
c('May','Jun','Jul','Aug','Sep'))
For more on violin plots you can read here.
Bar Plots
To illustrate bar plots we need a meaningful categorical variable. For this purpose, we will
convert the temperature variable into categorical, reflecting a temperature level. More
specifically we will create the categories (‘51-60,’‘61-70,’‘71-80,’‘81-90’ and ‘91-100’)
using the cut() function:
# Uniform colour
categories<-c('51-60','61-70','71-80','81-90','91-100')
df$Temp_cat<-cut(df$Temp, seq(50,100,10),labels=categories)
Temp_counts<-table(df$Temp_cat)
Temp_counts
A bar plot for the variable Temp_cat can be produced with the code below:
Note that it is generally not recommended to convert continuous variables into categorical as
information is lost in that case. Here we did it to illustrate the R code.
In case of more than two categorical variables, for example Temperature level across months,
grouped or stacked bar plots can be used. This essentially uses a cross tabulation between
temperature level and month shown below:
Temp_counts<-table(df$Month,df$Temp_cat)
Temp_counts
The stacked and grouped (or else side by side) bar plots can be created with the following
code:
barplot(height=Temp_counts, names=categories,
col=colours[c(2,6,10,8,4)],legend = c('May','Jun','Jul','Aug','Sep'))
barplot(height=Temp_counts, names=categories,
col=colours[c(2,6,10,8,4)],legend =
c('May','Jun','Jul','Aug','Sep'),beside=TRUE)
More information on bar plots can be found here
Heat Maps
Finally, we illustrate the use of a heat map. For this we will use another dataset, called
mtcars; see here for some information on the data.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
For the heat map we will use the Blues palette via the code below. We will expand it to
contain 25 colours (different versions of blues), rather than the default which is 8, with the
function colorRampPalette. We will also need to convert the data in mtcars to a matrix
format using as.matrix.
First, you need to decide on the type of file and use on of the jpg(), png() or pdf() devices.
You may also want to provide the path to the file you want to create. The default is to save
the graphics in the working directory under a default name. You can also provide some
additional, device-specific parameters such as width and height if you want to control the
output more. Below is an example:
In the second step, one needs to run the code to produce the plot to be exported, for example:
plot(x = 1:10,
y = 1:10)
abline(v = 0) # Additional low-level plotting commands
text(x = 0, y = 1, labels = 'Random text')
Finally, an important step is to run the following line of code to create (of “close”) the file;
this is an essential step as without it a partial plot may be exported or no plot at all.
dev.off()
Useful Links and Resources
• Fundamentals of data visualization: An easy to read book with many examples of
good practice for producing graphs.
• The R graph gallery: A wide collection of charts made with the R programming.
• HTML accessibility A blog on accessibility of HTML pages from R Markdown.
The Grammar of Graphics and
ggplot2
ggplot2
From gplot2’s CRAN page
The Grammar of Graphics was introduced in Wilkinson (1999) and Wilkinson & Wills
(2005) and is a framework for creating graphics following a layered, structured approach.
ggplot2 is a package that evolved from the ideas in The Grammar of Graphics. The full
functionalities of ggplot2 are too extensive to cover in a week, however there are many
tutorials and examples in the links provided to learn from, including the online book ggplot2:
Elegant Graphics for Data Analysis.
In this section, we will cover some basics to get started on creating some plots with ggplot2.
As we have seen, base R graphics work with functions and methods that are dedicated to
particular graphics (e.g. hist(), boxplot, plot, etc). These functions have differences in
their syntax and sometimes differences on what they expect as inputs. ggplot2, instead,
introduces a consistent grammar (a set of rules) to specify the basic components of a graphic
(e.g. data, coordinate systems, scales, aesthetics, etc.), which once put together can result in a
wealth of graphics.
• Data: The data that we want to produce graphics from (typically a data.frame or
tibble object).
• Geometries: The kind of graphic we want tot produce, e.g. histogram, scatter plot, bar
plot, box plot, etc (constructed by geom_*() family of functions).
• Aesthetics: Aesthetics of the geometric and statistical objects, such as position,
colour, size, shape, and transparency (specified using the aes() function).
• Facets: The arrangement of graphics into grids (specified using the facet_*() family
of functions).
• Visual themes: The overall visual defaults of a plot, such as background, grids, axes,
default typeface, sizes, and colours (specified using the theme_*() family of
functions).
There are other layers that you can add that can greatly improve your plots.
For more extensive tutorials and descriptions on ggplot2, see the recommended resources in
Useful Links and Resources. In particular, RStudio’s “Data visualization with ggplot2” cheat
sheet, which is a good resource to keep handy while working with ggplot2, and the online
book ggplot2: Elegant Graphics for Data Analysis is the definitive resource on the philosophy
and operation of the package. Chapter 22 of Wickham & Grolemund (2017) also provides an
excellent walkthrough ggplot2’s capabilities.
ggplot()
All ggplot2 graphics begin with a call to ggplot(), where we can supply the data. After that
call, we can keep enhancing the resulting object by adding more layers. Optionally with the
ggplot() function, we can add whatever aesthetics we want to apply to the graphics, for
example, by specifying what variables from the data will form the axes.
As an example, we work with the data in the diamonds dataset in R, which provides prices
and other attributes of over 50000 diamonds. To initialize graphics, we start with the data
layer through the ggplot() function:
library(ggplot2)
# initialize ggplot
ggplot(diamonds)
This does not produce a plot yet. To specify what features of the data we want to look at, we
need to add some aesthetics. We specify aesthetics using the aes() function, and add that
specification a layer by simply using the + operator. For example,
The help pages of geom_histogram()describe a range of optional arguments that can be used
to customize the histogram further. For example,
For example, we can enrich my_hist by using the labs() layer to specify titles for the plot
and labels for the axes, and the theme_bw() layer to use a theme with black and white
elements (see ?labs and ?theme_bw for details).
my_hist +
labs(title = "Weight of Diamonds", x = "Carat", y = "Count") +
theme_bw()
Density Plots
Here is another example where we create a graphic with a kernel density estimate with the
geom_density() function:
# create kernel density plot and add title and axis labels
my_kd <- ggplot(diamonds, aes(x = carat)) +
geom_density() +
labs(title = "Weight of Diamonds", x = "Carat", y = "Density")
my_kd
Plots
A scatter plot is a two-dimensional plot, so ggplot2 expects to pass two arguments into the
aesthetics layer.
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22
0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3
...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2
5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3
4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1
59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4
...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78
4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49
2.39 ...
or a look at the help file (?diamonds) reveals that we have information on the price (in the
price variable) of each diamond and a rating on the quality of the cut (in the cut variable).
So, let’s make a scatter plot of price versus carat and colour by cut. This can be done
easily using geom_point() and enriching the aesthetics a bit
In fact, the most convenient place for aesthetics in multi-layered plots with multiple
geometries is the geom_*() functions. For example, for adding a horizontal line at 5000 and a
vertical line at 2 in the above graphic we use geom_hline() and geom_vline(), each with its
own aesthetic:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot(diamonds %>% filter(cut == "Ideal")) +
geom_point(aes(x = carat, y = price), alpha = 0.5) +
labs(title = 'Diamonds are forever...', x = 'Carat', y = 'Price')
A better way is to use faceting. Using the my_sc object we constructed earlier, this is as
simple as
We can facet with more than one variables. For example, if we want the carat histogram for
each combination of colour and cut we do
my_hist + facet_grid(color ~ cut)
Take a look at ?facet_grid and ?facet_wrap for more information about faceting, and the
several available arguments for producing a wealth of grids.
Themes
The theme_*() family of functions of ggplot2 allows you to quickly change the look and
feel of your graphics. ggplot2 ships with some default themes. We have already seen
theme_bw(). Here are a few more
There is also the ggthemes R package that provides some extra themes. For example, after
you install ggthemes, you can do
library(ggthemes)
my_sc + theme_wsj()
to get graphics closely resembling the graphics theme used in The Wall Street Journal.
References
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize,
and model data (1st ed.). O’Reilly Media.
Wilkinson, L. (1999). The grammar of graphics (1st ed.). Springer.
Wilkinson, L., & Wills, G. (2005). The grammar of graphics (2nd ed.). Springer.
ST2195 Programming for Data
Science
Graphics and Data Visualisation in
Python
** Note: The code chunks below should be run in the following order **
Plotting in Python
In Python, matplotlib and seaborn are two main Python libraries for plots. matplotlib is a basic
plotting library. seaborn is a Python data visualization library based on matplotlib, and it provides
some “prettier” and sometimes more interesting plots. We use both matplotlib and seaborn in
this notebook, although we mainly use matplotlib here. We import the libraries as follows:
We will first start with the basics of matplotlib, which includes how to create a space for the plot,
and how to make a basic plot. We will then talk about how to create different plots including:
• Line plot
• Bar plot
• Boxplot
• Scatter plot
• Heat map
using matplotlib and seaborn. Before we introduce how to use the plotting libraries, let us first
load the data for visualisation.
Import and Clean the Data
We use GDP per capita data from DBnomics and wine chemistry composition data from scikit-
learn in this notebook. The data are already downloaded and stored in three csv files: gdp.csv,
gdp_wide and wine.csv and we load the data to Python with the following code:
import numpy as np
import pandas as pd
import json
from datetime import datetime
gdp = pd.read_csv("gdp.csv")
# convert the year column from string to datatime object
gdp['year'] = gdp['year'].apply(lambda x : datetime.strptime(x, '%Y-%M-%d'))
gdp_wide = pd.read_csv("gdp_wide.csv")
# set the row labels as countries
gdp_wide.set_index('country', inplace= True)
# set the column labels (years) with the type int
gdp_wide.columns = gdp_wide.columns.astype(int)
wine = pd.read_csv("wine.csv")
gdp_uk, gdp_fr, gdp_it, gdp_de store the GDP per capita time series for a corresponding
country. For example
print(gdp_uk)
index year value country
0 0 2000-01-01 00:01:00 27130.0 UK
1 1 2001-01-01 00:01:00 27770.0 UK
2 2 2002-01-01 00:01:00 28250.0 UK
3 3 2003-01-01 00:01:00 29060.0 UK
4 4 2004-01-01 00:01:00 29560.0 UK
5 5 2005-01-01 00:01:00 30210.0 UK
6 6 2006-01-01 00:01:00 30810.0 UK
7 7 2007-01-01 00:01:00 31280.0 UK
8 8 2008-01-01 00:01:00 30940.0 UK
9 9 2009-01-01 00:01:00 29460.0 UK
10 10 2010-01-01 00:01:00 29830.0 UK
11 11 2011-01-01 00:01:00 29960.0 UK
12 12 2012-01-01 00:01:00 30190.0 UK
13 13 2013-01-01 00:01:00 30660.0 UK
14 14 2014-01-01 00:01:00 31290.0 UK
15 15 2015-01-01 00:01:00 31780.0 UK
16 16 2016-01-01 00:01:00 32060.0 UK
17 17 2017-01-01 00:01:00 32430.0 UK
18 18 2018-01-01 00:01:00 32640.0 UK
19 19 2019-01-01 00:01:00 32870.0 UK
print(gdp_wide.head())
2000 2001 2002 2003 ... 2016 2017 2018 201
9
country ...
DE 28910.0 29370.0 29290.0 29100.0 ... 34610.0 35380.0 35720.0 35840.
0
FR 28930.0 29290.0 29410.0 29440.0 ... 31770.0 32380.0 32860.0 33270.
0
IT 27430.0 27950.0 27960.0 27850.0 ... 26020.0 26490.0 26780.0 26920.
0
UK 27130.0 27770.0 28250.0 29060.0 ... 32060.0 32430.0 32640.0 32870.
0
[4 rows x 20 columns]
wine stores the chemistry composition of wines in a table, with the last column provides the
information of which class the wines belong to:
print(wine.head())
alcohol malic_acid ash ... od280/od315_of_diluted_wines proline target
0 14.23 1.71 2.43 ... 3.92 1065.0 class_0
1 13.20 1.78 2.14 ... 3.40 1050.0 class_0
2 13.16 2.36 2.67 ... 3.17 1185.0 class_0
3 14.37 1.95 2.50 ... 3.45 1480.0 class_0
4 13.24 2.59 2.87 ... 2.93 735.0 class_0
[5 rows x 14 columns]
matplotlib Basics
Creating an Empty Plot
We first learn how to create a empty figure. It can be done by:
fig, ax = plt.subplots()
plt.show()
this return a Figure and an Axes object. By manipulating these objects, we can plot and set the
properties of the plot. plt.show() is used to make sure the plot is shown (although the plot may
still be shown without calling plt.show()).
In this notebook we focus on manipulating the plots using Axes object ax or plt but not the
Figure object fig. If you want to learn how you can use Figure object to manipulate the plots,
please see here.
Plotting Using ax
After the empty plot is created, we can plot the data by using the Axes object ax and the method
plot(). Here plot the GDP per capital time series data:
fig, ax = plt.subplots()
ax.plot(gdp_uk['year'], gdp_uk['value'])
plt.show()
plt.show()
plt.show()
plt.xlabel('year')
plt.ylabel('GDP per capita (Euro)')
plt.title("UK vs France")
plt.show()
fig, ax = plt.subplots()
fig, ax = plt.subplots()
ax.set_xlabel('year')
ax.set_ylabel('Euro')
ax.set_ylabel('GDP per capita (Euro)')
ax.title.set_text("UK vs France")
plt.show()
Bar Plot
Here we continue to plot the GDP per capital data, but now we are plotting against different
countries rather than time. We can plot the bars vertically or horizontally:
plt.show()
ax.set_ylim(top = 45000) # make ylim max to be larger so that the legend and the ba
rs are not overlapping
ax.set_ylabel('GDP per capita (Euro)')
ax.set_xlabel('countries')
plt.show()
Note here we did not call fig, ax = plt.subplots() and then pass ax into Dataframe
plot(). Instead we use the ax created from Dataframe plot(). We could of course use fig,
ax = plt.subplots() and then pass ax into Dataframe plot() (as shown below), but we do
not have to.
fig, ax = plt.subplots()
ax.set_ylim(top = 45000) # make ylim max to be larger so that the legend and the ba
rs are not overlapping
ax.set_ylabel('GDP per capita (Euro)')
ax.set_xlabel('countries')
plt.show()
Histogram: Showing the Distribution of a
Variable
Below we create a histogram via plt to show the frequency of observations (here are the wines)
that fall into a given ash level interval. You can use ax instead if you want to (although the syntax
for setting labels and title is not the same).
fig, ax = plt.subplots()
plt.show()
In order for the histogram to represent distribution (i.e. the area of the histogram is 1), add the
argument density = 1:
fig, ax = plt.subplots()
plt.show()
Boxplot and variation: showing the
distribution of a variable
In Python boxplot (and its variations) can be created via matplotlib or seaborn. We will use the
boxplots to compare the chemistry composition of wines with different class labels.
We start at looking at the colour intensity regardless the class label:
fig, ax = plt.subplots()
ax.boxplot(wine['color_intensity'])
ax.set_ylabel('color_intensity')
ax.set_xticklabels([''])
plt.show()
Side By Side (or grouped) Boxplot
Often, it is more useful to have boxplots side by side to compare the same variables conditional
on some factors. For example, here we want to see the colour intensity of wines given the type of
the wine. With matplotlib, we need to first convert our long wine['color_intensity'] data to
wide, and then create the boxplot:
ax.boxplot([wine['color_intensity'][wine['target']=='class_0'],
wine['color_intensity'][wine['target']=='class_1'],
wine['color_intensity'][wine['target']=='class_2']])
ax.set_ylabel('color_intensity')
ax.set_xticklabels(['class_0', 'class_1', 'class_2'])
plt.show()
Side by side boxplot can be created more easily via seaborn. With seaborn, we can use the
original wine DataFrame ('target' column in the wine DataFrame contains the class label
information):
ax.set_xlabel("type of wine")
plt.show()
Boxplot Variation
To create a notched boxplot, add the argument notch = True:
plt.subplots(figsize=(8, 4))
ax = sns.boxplot(data = wine, x = 'target', y = 'color_intensity', notch = True, wi
dth = 0.3)
ax.set_xlabel("type of wine")
plt.show()
To create a violin plot, we use the violinplot(). Below boxplot is plotted as well to show you
how you can have subplots with seaborn:
ax[0].set_xlabel("type of wine")
plt.show()
Scatter: Showing Relations Between Two
Variables
Below we create a scatter plot via DataFrame plot() to show how the two chemistry
composition levels relate to each other (you can use matplotlib instead if you want to).
plt.show()
Scatter Plot With seaborn
With seaborn, we can use relplot() to plot the scattered plot. We can set the optional
argument hue, style, size, etc to show an extra dimension of the data.
For example, here we plot the color_intensity against alcohol, with the colour of scattered
points depends on which group the wines belong to by providing the optional argument hue:
Heat Map
To create a heatmap, we can use sns.heatmap(). Here we create a heatmap to show the
correlation of the wine chemistry composition:
sns.set(font_scale=0.5)
ax = sns.heatmap(wine.corr(),
center= 0, vmin = -1, vmax = 1, cmap= "RdYlBu", # this set the co
lour of the heatmap
square=True) # set the heatmap to be square shape
ax.figure.tight_layout() # this makes sure all labels are shown in the plot
plt.show()
The argument cmap= "RdYlBu" sets the palette, with red corresponds to lowest value, yellow
corresponds to the middle value and blue corresponds to the highest value. center = 0 sets 0
as the middle value, and vmax = 1 and vmax = -1 set the maximum and minimum values to be
1 and -1. To set heatmap with different colours, see here.
Exporting Plots
Similar to R, you can save the figure as a separate file. In Python, you can do it by using the
function savefig():
fig, ax = plt.subplots()
ax.set_xlabel('year')
ax.set_ylabel('Euro')
ax.set_ylabel('GDP per capita (Euro)')
ax.title.set_text("UK vs France")
plt.savefig('gdp_uk_france.png') # you can find the file in the folder where you ru
n this line of code
Networks
A network \(G\) consists of a set \(V\) of vertices and a set \(E\) of edges. An edge indicates a
relationship between two vertices. Networks provide standard tools for visualisation in data
science and related fields. Below are some examples of kinds of real world network data that can
be represented by networks:
• Social networks: Representing people and their social interactions. Examples include
looking into Facebook, LinkedIn, email exchanges, etc.
• Knowledge networks: Representing entities and their relationships. For example Google
knowledge graph, Bing Satori, Freebase, Yago, Wordnet, etc.
• Collaboration networks: Representing people and their collaborations. Examples include
Co-authorships from dblp, Google Scholar, Microsoft Academic search, etc.
• Product purchase networks: Representing who bought what. Example: Amazon product
purchases
• Reviews: Ratings of products or services provided by users. Example: TripAdvisor
• Road networks, communication networks.
Networks in Python
The standard Python package for the creation, manipulation, and study of the structure of
networks is networkx. A tutorial for it can be found here. To install networkx type the following
in the terminal or Anaconda prompt:
Below are some standard layout styles for networks produced by the networkx package:
We will illustrate the above in a real world dataset. In next two subsections, we will explore and
pre-process the data and then we will sample from the above styles.
GitHub Organisations Data
As a working example, we will study the activities of users across different GitHub organizations
by a representation of data as a network. The vertices of the network will represent organisations
and an edge between two vertices will indicate that there is at least one user who initiated an
event to at least one repository in each of the two corresponding organizations. The edge
weights are defined as the number of such users. We will use the data retrieved from GitHub
archive for 2 March 2015 that are contained in the attached file 2015-03-02.csv.
We will need to do a fair amount or pre-processing to bring the data in the desired format. This
can be seen as a good data wrangling exercise. First, we initialise Python and load the data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# load the data
fdate = '2015-03-02'
df = pd.read_csv(fdate + '.csv', parse_dates=['created_at'])
df.head()
The dataset contains several variables indicating the user (actor.login) and the GitHub
organisation (org.login) as well as other variables such as date created, ID, name of the
repository and type of event. To check which types of events exist we use the code below:
Next, we want to transform the data to triplets: (actor, organization, repository). Moreover, we
remove events of type WatchEvent, drop duplicates and also remove missing values from
org.login.
df = df[(df['type']!='WatchEvent')]
df = df[['actor.login', 'org.login', 'repo.name']]
df = df.drop_duplicates()
df = df[df['org.login'].notnull()]
df.head()
Next, we want to focus on the most frequently occurring organisations, say the top 100. We will
do so using the function groupby().
org_actor_df = df.groupby(['org.login','actor.login']).agg('count').reset_index()
toporgs = org_actor_df.groupby(['org.login'])['org.login'].agg('count')
toporgs = toporgs.sort_values(ascending=False)[:100].keys().tolist()
toporgs[:10]
['mozilla', 'apache', 'facebook', 'google', 'angular', 'docker', 'elasticsearch', '
iojs', 'dotnet', 'owncloud']
The first line in the code above takes all possible pairs of org.login and actor.login and
applies the operation .agg('count') that counts the frequency of each pair. The second line
takes the frequency of each organisation and stores them into toporgs. Finally we sort the
organisations in terms of frequency of appearance and print the top 10 of them.
The next step is almost taking us to our objective as we track the organisations where each user
initiated an event and record all possible pairs between them. For example, if a user accessed
three organisations, we will record the three possible pairs between them using the function
combinations() from the Python package itertools. Note also that the line for name, group
in actor_df: loops over all groups (having their name as well) contained in the groupby object
actor_df that aggregates over each user.
actor_df = df.groupby('actor.login')
dfs = pd.DataFrame(columns=['Actor', 'Org1', 'Org2'])
for name, group in actor_df:
orgs = group['org.login'].unique()
orgs = [val for val in orgs if val in toporgs]
if len(orgs)>1:
actor_edge = pd.DataFrame(data=list(combinations(orgs, 2)), columns=['Org1'
, 'Org2'])
actor_edge['Actor'] = name
dfs = pd.concat([dfs, actor_edge])
dfs.head()
Each row in the dfs data frame corresponds to a user initiating an event to two different
organisations (vertices), in other words it defines an edge to the network. We are almost ready to
proceed; the final steps produce a data frame that allows for a more efficient way to construct the
network by counting the number of times each edge (pair) occurs, also known as weight of the
edge, and records it. We also print the top 10 edges in terms of their weight.
import networkx as nx
G = nx.Graph()
Finally, we can now produce a network with the networkx package. We start with the spring
style after setting some network options regarding its size and colour:
plt.subplots(figsize=(10, 10))
pos = nx.circular_layout(G)
nx.draw(G, pos=pos, font_size=9, **options)
nx.draw_networkx_labels(G, pos=pos, font_size=9)
plt.tight_layout()
plt.axis('off')
plt.show()
More Network Layouts From Graphviz
The package networkx has more visualisation options (feel free to explore) and can also be
combined with Graphviz (graph visualization software), which is a package of open-source tools
initiated by AT&T Labs Research for drawing networks specified in DOT language scripts. More
information can be found here. To install Graphviz follow the instructions here. To incorporate
Graphviz into Python, the Pygraphviz interface to the Graphviz package can be used, see here
for more information. To install Pygraphviz, type one of the following in the Anaconda prompt or
terminal:
or
Networks in R
The most well-used package for manipulating network data in R is the igraph R package, which
provides methods to handle and visualize network data, as well as methods for computing key
network summaries (e.g. node centrality and connectivity metrics), methods for community
detection and basic methods for the visualization of network data. igraph and its capabilities are
also available in Python through the python-igraph module.
In addition to the visualization capabilities igraph provides, there is also a range of dedicated R
packages for the visualization of networks. Marked examples are the vizNetwork R package that
can be used to produce interactive network visualizations (see the vizNetwork pages for
examples), the ggnetwork that provides ggplot2 geometries for network data (see ggnetwork’s
vignettes for examples), and the networkD3 R package.
Katya Ognyanova’s blog post provides a lot of resources about visualizing static and dynamic
networks in R.
The sna R package also provides a range of advanced summaries and data-analytic tools for
social network analysis.
References
Robinson, I., Webber, J., & Eifrem, E. (2015). Graph databases: New opportunities for connected
data (2nd ed.). O’Reilly Media.
ST2195 Programming for Data
Science
Block 9 Machine Learning
Frameworks
VLE Resources
There are extra resources for this Block on the VLE – including quizzes and videos please
make sure you review them as part of your learning. You can find them here:
https://emfss.elearning.london.ac.uk/course/view.php?id=382#section-9
1. Importing the data, cleaning them and preparing them for use (data wrangling)
2. Get some basic level of understanding on the data by exploration and visualisation.
3. Consider several models as well as combinations thereof to separate signal from noise.
4. Compare models and inform future decisions.
In this and the next week we will touch upon the last two steps. Note that this essentially
constitutes what is also known as machine learning, one of the most substantial parts of data
science. Here we will briefly mention some key concepts; for more on that consider taking the
course ST3189 Machine Learning.
1. For each missing point impute the mean, median or the mode of each variable based on
the observed data.
2. For each missing point impute a randomly sampled value from the observed data.
3. As in 2 above, but rather than sampling with equal probabilities, sample based on a
histogram
4. Do not impute and remove the cases with missing values entirely.
From a programming point of view, when data pre-processing steps are merged with machine
learning models, we get a new more extended model termed as machine learning pipeline.
• Supervised learning: For cases where there is a single variable that we want to predict,
termed as target or response or dependent variable, based on several other ones,
termed as features or covariates or independent variables.
• Unsupervised learning: For cases when there is no single target or response variable,
where we are interested in exploring the variables together with equal importance on all
of them. In such cases we may be interested in identifying groups of variables that share
something in common, known as factor analysis, or groups of individuals that have
something in common, known as clustering. Analysis of networks also falls into this
category.
In the next two weeks we will focus on supervised learning by working through several examples.
We can further divide supervised learning into two categories depending on the type of the
response variable: When the response is a continuous variable the task is termed as regression,
whereas when it is categorical it is termed as classification. There are several machine learning
models for those tasks, below is just an indicative list:
Most of these methods are covered in the course ST3189: Machine Learning. Here we will just
use them.
The most standard cost function used for Regression task is the mean squared error (MSE). This
is obtained by subtracting each prediction from the actual point, taking the square of this
difference and averaging all these squares. For the case of classifications see the notes of next
week.
1. Train set: Data to be used to train each model. After training the models we can contrast
their predictions of the target against the actual observations in this set. This provides the
train error which is a measure of the in sample performance of each model.
2. Validation set: The trained models from the previous step can also provide predictions for
the validation set point based the features of this set. Comparing predictions against
actual points for the target provides the test error, which is a measure of the out of
sample performance. We typically prefer the model with the lowest test error.
3. Test set: It turns out that the test error calculated on the validation set is good for
choosing models but downward biased if the aim is to estimate the resulting test error.
This is because the choice of the optimal models is made in sight of the supposedly
unseen data of the validation set. This is why we often set aside some entirely unseen
data in order to apply our chosen model in the end and obtain a more reliable value for
the test error.
The above choice often depends on the sample size; if it is not big enough the validation and test
sets are often merged into a single set called test set.
Cross-Validation
A potential issue with having a single train-validation-test split is that we may end of having a
fortunate or unfortunate split for some models due to chance. To limit this possibility, this process
can be done repeatedly, in other words we can use cross-validation. The repeated partitions can
be done in various ways.
One of the most widely used techniques is K-fold cross-validation: The data is randomly split into
K subsamples. Then, each of these subsamples is set as test set in turn, with the remaining K−1
subsamples being used for training purposes. This process is repeated K times and the results
are aggregated, e.g. averaged. See below for a schematic representation.
Diagram of K-fold cross-validation. Source: Wikipedia.
Other examples include repeated random subsampling (randomly split in train and test sets
many times and aggregate), leave-one-out or Jackknife (K-fold cross-validation with K=1), etc.
1. Import and pre-process the data (data wrangling). The decisions made in these steps can
be thought of as model parameters.
2. Consider different models/learners, train them and compare them with the validation set.
3. Obtain predictions from the optimal model/learner chosen in the previous steps and
check its performance.
References
Murphy, K. P. (2013). Machine learning : A probabilistic perspective. MIT Press.
Machine Learning Frameworks in R
** Note: The code chunks below should be run in the following order **
Data
In this page the aim is to illustrate the main concepts of machine learning as well as working with
pipelines. We will do so using the mlr3 R package and work on the Boston dataset, which
consists of the following 14 variables:
Variable Description
crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 square feet
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio by town
black 1000(Bk - 0.63)^2 where Bk is the proportion of Black ethnicity residents by town
lstat % lower status of the population
medv median value of owner-occupied homes in $1000’s
See Harrison & Rubinfeld (1978) for more information on the aim the data were used initially.
Here we will focus on developing competitive machine learning techniques to predict the value of
an owner-occupied home in Boston, based on the features (variables) above. This is a
supervised learning task.
As a first step, we load and inspect the data from the MASS library.
library('MASS')
df<-Boston
head(df)
crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
To illustrate the use of a machine learning pipeline, we will also change the first value of the crim
variable into a missing value. The use of pipelines will then be essential. We use the skimr
package to produce a report on the data; the presence of the missing value is confirmed there.
library(skimr)
df$crim[1]<-NA
skim(df)
Data summary
Name df
Number of rows 506
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None
Variable type: numeric
skim_variabl n_missin complete_rat
mean sd p0 p25 p50 p75 p100hist
e g e
crim 1 1 3.62 8.61 0.01 0.08 0.26 3.68 88.98▇▁▁▁▁
100.0
zn 0 1 11.36 23.32 0.00 0.00 0.00 12.50 ▇▁▁▁▁
0
indus 0 1 11.14 6.86 0.46 5.19 9.69 18.10 27.74▇▆▁▇▁
chas 0 1 0.07 0.25 0.00 0.00 0.00 0.00 1.00▇▁▁▁▁
nox 0 1 0.55 0.12 0.38 0.45 0.54 0.62 0.87▇▇▆▅▁
rm 0 1 6.28 0.70 3.56 5.89 6.21 6.62 8.78▁▂▇▂▁
100.0
age 0 1 68.57 28.15 2.90 45.02 77.50 94.07 ▂▂▂▃▇
0
dis 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.13▇▅▂▁▁
rad 0 1 9.55 8.71 1.00 4.00 5.00 24.00 24.00▇▂▁▁▃
408.2 168.5 187.0 279.0 330.0 666.0 711.0
tax 0 1 ▇▇▃▁▇
4 4 0 0 0 0 0
ptratio 0 1 18.46 2.16 12.60 17.40 19.05 20.20 22.00▁▃▅▅▇
356.6 375.3 391.4 396.2 396.9▁▁▁
black 0 1 91.29 0.32
7 8 4 2 0▁▇
lstat 0 1 12.65 7.14 1.73 6.95 11.36 16.96 37.97▇▇▅▂▁
medv 0 1 22.53 9.20 5.00 17.02 21.20 25.00 50.00▂▇▅▁▁
Machine Lerning Pipelines in mlr3
The mlr3 provides a general platform to train and evaluate a wide range of machine learning
models. It also allows to set up machine learning pipelines in a very convenient manner. Below
we will demonstrate the key steps for doing so.
library('mlr3')
task <- TaskRegr$new('boston', backend=df, target = 'medv')
measure <- msr('regr.mse')
Next, a learner needs be to chosen. We start with linear regression regr.lm which is essentially
the lm() function in R.
library('mlr3learners')
learner_lm <- lrn('regr.lm')
If we were to try and fit lm() to the data, the cases with missing values will be removed entirely
which may not be the best way to proceed. Other learners may even return an error. So it is
essential to either remove the cases with missing values or perform some imputation. Below we
go with imputemean, i.e. imputing the mean of the variable’s other values and include this as part
of the learner. Other learners can be constructed based on other options and compared to each
other, for example:
In order to join together the operations of imputing missing values and fitting the linear regression
model, mlr3 creates a pipe for each of these tasks and connects them with the operator %>>%
from mlr3pipelines. The resulting pipeline is termed as a graph, denoted by gr_lm in the code
below, and can be converted in a learner by using GraphLearner.new():
library('mlr3pipelines')
gr_lm <- po('imputemean') %>>%
po(learner_lm)
glrn_lm <- GraphLearner$new(gr_lm)
Now we can view the combined learner glrn_lm as a separate machine learning algorithm. To
evaluate its performance we can proceed by
This can all be done with the code below that returns the value of the mse.
set.seed(1)
train_set <- sample(task$nrow, 0.7 * task$nrow)
test_set <- setdiff(seq_len(task$nrow), train_set)
glrn_lm$train(task, row_ids = train_set)
glrn_lm$predict(task, row_ids = test_set)$score()
regr.mse
27.2937
One of the advantages of working with pipelines, as above, is that they are quite convenient for
guarding against issues such as data leaking. In other words it ensures that the imputation is
done based entirely on information contained on the training dataset; this would not be the case
if we were to impute the value in the beginning on the basis of the entire dataset.
The above give a slightly lower mse than before. But it relied on the artificial value lambda =
0.03 which may not represent the best option. To determine its optimal value we can try several
different values and choose the optimal via cross-validation. This process is known as tuning
hyperparameters.
First we need to create a slightly different version of the previous learner. More specifically the
values of lambda will not be fixed beforehand but it will be a free parameter to be optimised. As
before, we create a pipeline by incorporating imputation and scaling. This gives another learner,
termed glrn_ridge2.
The above can be done (in that order) with the following code. In the end we combine everything
into a new pipeline at_ridge which is trained on the training data.
library('mlr3tuning')
library('paradox')
# Set up tuning environment
tune_lambda <- ParamSet$new (list(
ParamDbl$new('regr.glmnet.lambda', lower = 0.001, upper = 2)
))
tuner<-tnr('grid_search')
terminator <- trm('evals', n_evals = 20)
#Put everything together in a new learner
at_ridge <- AutoTuner$new(
learner = glrn_ridge2,
resampling = rsmp('cv', folds = 3),
measure = measure,
search_space = tune_lambda,
terminator = terminator,
tuner = tuner
)
#Train the learner on the training data
at_ridge$train(task, row_ids = train_set)
After training the ridge machine learning pipeline, the next step is to evaluate its performance.
Below we just report it mse. Further inspection of the training, for example finding out the optimal
value of lambda can be done with at_ridge$model.
at_ridge$predict(task, row_ids = test_set)$score()
regr.mse
27.18366
The result is a slightly lower mse which is not convincing. A similar learner is offered by lasso
regression and can be fit using the previous code and simply setting alpha = 1; it is
recommended to do it as an exercise.
Random Forests
So, next, we move away from a linear regression setting and try random forests, a very
successful technique in Kaggle competitions. Random forests are known to do very well in
finding non-linear associations as well as synergies between the features. For (entirely optional)
reading on random forests, the James et al. (2014) textbook is a good place to start. To specify a
random forest pipeline the code is almost identical and only differs in the learner (regr.ranger)
and its parameters. Again, the code below assumes that the R package ranger is installed in
your computer.
We see that the improvement offered by random forests is substantial in this case.
Benchmarking
An alternative and easier-to-code way to compare the different learners is by the benchmark()
function which automates the train/predict steps. It requires to setup a benchmark design as an
input which can be set by the function benchmark_grid(). The latter function tabulates the
different comparisons to be made that consist of combinations of tasks, different ways to
resample and different learners. Below, we stick to regression task of the Boston data, choose
three-fold cross-validation and the four learners we checked so far (the code below may take a
while to run):
A nice way to visualise the comparisons made is with boxplots. We can also see the overall mse
for each learner using bmr$aggregate(measure).
library('mlr3viz')
library('ggplot2')
autoplot(bmr) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
bmr$aggregate(measure)
nr resample_result task_id learner_id
1: 1 <ResampleResult[21]> boston imputemean.regr.lm
2: 2 <ResampleResult[21]> boston scale.imputemean.regr.glmnet
3: 3 <ResampleResult[21]> boston scale.imputemean.regr.glmnet.tuned
4: 4 <ResampleResult[21]> boston scale.imputemean.regr.ranger.tuned
resampling_id iters regr.mse
1: cv 3 23.81437
2: cv 3 23.79866
3: cv 3 23.83034
4: cv 3 11.32159
The results above confirm the clear dominance or random forests over regression methods in the
Boston dataset.
Parallelisation
Training pipelines may end up taking a lot of time. One way to reduce this is by splitting the
procedure into multiple jobs and execute them in parallel, simultaneously. This procedure is
known as parallelisation and allows for significant gains in terms of computing power and
substantial reduction in the execution times.
The package mlr3 is compatible with the future package for parallelisation, thus making it very
easy to parallelise. All you have to do is essentially install the package future and simply run
the following line in the beginning of your code
future::plan()
References
Harrison, D., & Rubinfeld, D. L. (1978). Hedonic prices and the demand for clean air. Journal
Environmental Economics and Management, 5, 81–102.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014). An introduction to statistical learning:
With applications in R. Springer.
Categorical Variables in Machine
Learning
As already mentioned categorical variables take values in a predetermined finite set of
categories. An example of a categorical variable is whether an email is spam or not. Below we
will briefly discuss how to handle such variables in machine learning (mostly supervised learning)
depending on whether they are forming the target variable or feature variables.
Categorical Predictions
In order to illustrate the concepts let us consider binary classification with the target being \(Y\)
taking values of 1: a person will default on its credit card or 0: no default. Typically machine
learning models, also called classifiers in this case, provide probabilities of \(Y=1\) for each
individual given their set of features. Often a threshold is selected for these probabilities, above
which we classify the case as 1 otherwise it is classified as 0. Without loss of generality, let us
assume that this threshold is 0.5. In our example, the individuals with probability higher than 0.5
are classified as projected to default, otherwise they are considered safe. In the presence of the
data we can then check the actual outcome, i.e. whether they did default or not. When
contrasting projected classifications and outcomes, there are four different cases:
The rate of true positives, correctly identified is also known as sensitivity whereas the rate of true
negatives correctly identified is also known as specificity. These quantities can be summarised in
the so called confusion matrix. Below we see an example of a confusion matrix, taken from
James et al. (2014), that corresponds to the threshold of 0.5:
Based on the above matrix we see that overall classification accuracy is (9644+81)/10000, or
else 97.25%. This may seem high but if we focus on the 333 individuals who defaulted only 81
were identified. In other words the sensitivity is 81/333=0.24, which suggests that 76% were
missed. On the other hand the model is very successful in classifying the negatives, i.e. those
less likely to default, as the specificity is 9644/9667=0.99. Note however that the confusion
matrix depends on the choice of threshold whose choice may also be determined by the context.
In this case the aim may be to identify individuals that are likely to default and take up an action
to prevent it. In that sense we may be willing to tolerate some false negative, in other words
triggering false alarms indicating people are in danger when they are not, if this results in higher
chances of identifying people in danger. This can be calibrated by choosing another threshold.
Below is the confusion matrix from the same example for a threshold of 0.2:
ROC curve for the previous example with AUC of 0.95. Figure 4.8 in James et
al. (2014)
In the ROC plot above the ideal ROC curve would come as close as possible to the top left
corner, which would correspond to high true positive rate (high sensitivity) and low false positive
rate (low specificity). The dotted line corresponds to a classifier that is essential guessing at
random and classifies correctly with 50% chance.
Categorical Features
There are two ways to handle categorical variables as features (inputs) in regression or
classification tasks. The first way is to assign a numerical score on their values and treat them as
continuous. This is clearly suboptimal from an interpretation viewpoint as the assignment of
numerical scores is arbitrary. In the case of nominal variables (e.g. 1:red, 2:yellow, 3:green) there
is no reason why one of these colours should have bigger values than the others. But even in
ordinal variables (e.g. 1:low, 2:medium, 3:high) there is no guarantee why the distance between
medium and low is the same as that of between high and medium. So we may as well have used
the numbers (1, 3, 7) or any other ordered triplet of real numbers. Despite this, the use of
numerical scores (quantification) is still being used as it is convenient and result in fewer features
in the model.
A more meaningful option is to create dummy variable(s) from each categorical variable. Dummy
variables take values of either 0 or 1, indicating the presence (1) or absence (0) of a category.
Typically the number of dummy variables for each categorical variable is equal to the number of
categories (say k) minus 1; one category is treated as the reference category. Let us consider
the (1:red, 2:yellow, 3:green) example, where we can set the green category as reference and
create two dummy variables: one taking value 1 if the colour is red and 0 otherwise, and another
one taking the value 1 if the colour is yellow and 0 otherwise. The reason for creating k-1 dummy
variables, rather than k, is that the k-th dummy variable is redundant in the presence of the
others. Still on the (1:red, 2:yellow, 3:green) example, if we know whether the colour is red (or
not) and whether it is yellow (or not), then we can also infer if it is green or not. In other words, if
the dummy variable for red is 1 or if the dummy variable for yellow is 1, the colour is not green.
But if both of the dummy variables for red and yellow are 0, then the colour is red.
Now consider a regression model where the target is \(y\) and the dummy variables for red and
yellow are the features. The coefficient of the red dummy variable will then reflect the mean
difference (in the values of \(y\)) between cases corresponding to red and green colours.
Similarly the coefficient of the yellow dummy variable will reflect the mean difference (in the
values of \(y\)) between cases corresponding to yellow and green colours. The constant in the
regression equation will reflect the mean value of \(y\) when both dummy variables are zero or, in
other words, when the colour is green.
Hence, we can setup a regression model with a categorical feature in that way and the
regression coefficients will reflect the mean values of \(y\) in each of the categories. All we need
to do is construct these dummy variables, use them as inputs, and then proceed as in the case of
continuous inputs. Note also that the choice of the reference category only affects the
interpretation of the regression coefficients. In terms of predictive performance, models with
dummy variables extracted on the basis of different reference categories will be equivalent.
References
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014). An introduction to statistical learning:
With applications in R. Springer.
ST2195 Programming for Data
Science
Machine Learning Frameworks in
Python
** Note: The code chunks below should be run in the following order **
Python Setup
We start by activating the necessary Python packages. In addition to the standard use of pandas
and matplotlib, we will be using several features from the scikit-learn or sklearn package
which is the default option for machine learning in Python. The scikit-learn package is an
open-source collection of simple and efficient tools for machine learning, built on numpy, scipy,
and matplotlib. More details about it can be found here.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml #using openml to import data
from sklearn.metrics import plot_roc_curve
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer #transform different types
Data
We will be working on the titanic dataset that is available from the OpenML website. The features
and response are stored in separate arrays, X_initial, and y respectively. To view all the data
we combine them into a single pandas data frame called combined_dataset.
To illustrate concepts we will focus on five features: age, fare, embarked, sex and class. The
matrix of these five features to be used in the following examples will be labelled as X.
For categorical features, imputation is applied as before and we also use the OneHotEcoder()
function to create the appropriate dummy variables (see notes on categorical variables in this
week’s material). The resulting pipeline is called categorical_transformer.
Having pre-processed the numerical and categorical features separately we can now merge
them with the function ColumnTransformer(), in which we specify the two pipelines to be
merged using the argument transformers. This creates a new pipeline called
data_transformer.
data_transformer = ColumnTransformer(
transformers=[
('numerical', numerical_transformer, numerical_features),
('categorical', categorical_transformer, categorical_features)])
The data_transformer pipeline now contains all the pre-processing steps. It can be used as the
starting point for different (machine) learners as we illustrate in the next sections.
The next stage is to perform the standard machine learning operations of training the pipeline
and evaluating its predictive performance. We begin by splitting the data into train and test
datasets with the very convenient function train_test_split().
To tune the pipeline we can use cross-validation for each combination of its hyperparameters. In
other words we can use grid search cross-validation that can be done with the function
GridSearchCV(). This requires setting up a parameter grid, which is a dictionary containing the
names of each hyperparameter followed by its range of values. The hyperparameters to be tuned
below are the imputation strategies used in SimpleImputer(). In the case of numerical variables
there are two options, mean or median. For categorical variables one option is to impute the
most frequent value of each variable (most_frequent), whereas another option is to impute the
missing value to be the reference category (constant with filled value being left to the default of
0). The best possible combination is chosen based on fitting the models on the training data.
param_grid = {
'data_transformer__numerical__imputer__strategy': ['mean', 'median'],
'data_transformer__categorical__imputer__strategy': ['constant','most_frequent'
]
}
grid_lr = GridSearchCV(pipe_lr, param_grid=param_grid)
grid_lr.fit(X_train, y_train);
Finally let us compare the two machine learning pipelines by overlaying their receiver operating
characteristic (ROC) curves.
ax = plt.gca()
plot_roc_curve(grid_lr, X_test, y_test, ax=ax, name='Logistic Regression')
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x14a0606d8>
plot_roc_curve(grid_gdb, X_test, y_test, ax=ax, name='Gradient Boosting')
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x14a064198>
plt.show()
The Gradient Boosting offers a small improvement which can potentially be improved by tuning
its hyperparameters on the training dataset. This is left as an exercise.
Parallelisation
As in R, it is straightforward to use parallel computing in order to speed up the training time. The
function GridSearchCV() has the argument njobs, which if set to any value other than 1 results
in parallel computing. Setting njobs=-1 results in using all the available cores. An alternative to
GridSearchCV(), which can also be parallelised in the same way, is RandomizedSearchCV()
that searches randomly over the possible combinations of hyperparameter values.
How the above phases are combined and in what order, and how much time or effort is
invested in each phase is guided by the software development life cycle (SDLC)
methodology that each software development team or organization chooses to adopt.
• Waterfall SDLC model: The phases are organized in a rigid way in the order given
above with strong investment of time and effort in each phase and with no option to
revisit past phases. Most software companies nowadays require more flexibility than
what the waterfall model provides, but it still remains a strong methodology for
mission-critical projects, where mistakes in the solution can result in extreme costs
(e.g. piloting software, software used in nuclear power stations, etc.)
• Iterative SDLC model: The key idea of this model is to develop a system through
repeated iterations or the waterfall model, where at each iteration (also known as a
“mini waterfall”) a sub-problem is solved. Despite the fact that the phases are again
organized in a rigid way, developers can take advantage of what was learned from the
iterations of earlier parts.
• Agile SDLC model: The agile model is in a sense the opposite of the waterfall model.
Instead of organizing the phases in a rigid way, the agile model makes all phases an
ongoing process that requires the involvement of developers, management and
customers. Work is typically organized in two- to four-week segments (known as
“sprints”), in which the responsible teams tackle the major needs of their customers
and perform testing, integration and deployment on the go. The agile model is popular
in start-ups where speed and flexibility (e.g. because of changes in requirements) are
essential.
• V-shaped SDLC model: This is an evolution of the waterfall model, where each
development phase before implementation is associated with a testing phase
(e.g. requirement analysis and planning—acceptance testing, software design—
software testing). The next phase starts only after there is a testing activity
corresponding to it.
• DevOps SDLC models: One of the most recent SDLC models, where development and
operational teams are merged together into a single team. Software developers and
engineers work across the entire SDLC from development to testing, to deployment
and operations, including interaction with customers.
• Spiral SDLC model: The spiral model is a blend of iterative and waterfall approaches,
and allows teams to adopt multiple SDLC models based on the risk patterns of the
project.
Packages
Python and R packages play a key role in data science projects, not only because they are
routinely being used as dependencies or tools during at least some of the software
development phases but also because they are often the key outputs and deliverables of a
project or part of those.
In fact, one of the reasons that Python and R are so pervasive in data science is because they
benefit from diverse and active ecosystems of packages that, as we have seen in this course,
provide ready methods for read and write operations, data wrangling and visualization, and
for developing machine learning pipelines. As we will also see next week, there are even
dedicated packages for testing software.
A package bundles together code, data, documentation, and potentially tests. This makes it
easy to share with others, either by submitting to one of the official repositories (e.g. CRAN
for R and PyPi for Python) or by hosting it in a repository hosting service like GitHub.
Even if you do not want to share your code, organizing it in a package always pays off.
Python and R packages follow strict rules on how code is organized, documented and tested,
and as a result, packaged code is easier to return to than plain text files, even after several
years.
In the following sections you will learn the basic structure of R and Python packages.
Developing R Packages
Package Skeleton
Suppose that you wrote a function circle() that takes as input the radius of a circle and
returns a list of class "circle" with the radius, and a function area.circle() that takes as
input objects of class "circle" and calculates the area of the circle
You may notice the commented-out line #' @export before area.circle(). Ignore it for
now; this is special terminology, whose importance will become apparent later in this page.
You immediately realize that there is potential for an R package that provides functions and
methods to compute the area of circles and other 2D shapes, called areacalc. How can you go
about it?
The first step in writing an R package is to set up up the basic skeleton for the package
directory. The conventions that should be used for a valid R package are detailed in Section
1.1 of the “Writing R extensions” resource. While you can always go ahead and create the
required directories and files manually, many of those tasks are nowadays abstracted away
through high-level, specialized utilities such as the ones provided by the usethis R package.
Assuming that you have already installed usethis, the following code chunk sets up the
skeleton for areacalc under the directory "~/Repositories/" (feel free to replace this with
any existing directory in your system):
library("usethis")
create_package("~/Repositories/areacalc/")
✔ Creating '~/Repositories/areacalc/'
✔ Setting active project to '~/Repositories/areacalc'
✔ Creating 'R/'
✔ Writing 'DESCRIPTION'
Package: areacalc
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R (parsed):
* First Last <[email protected]> [aut, cre] (YOUR-ORCID-ID)
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to
pick a license
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
✔ Writing 'NAMESPACE'
✔ Changing working directory to '~/Repositories/areacalc/'
The output above details the steps that create_package() took while setting up the package
skeleton for the areacalc R package. Now, the directory ~/Repositories/areacalc looks
like (using the dir_tree() function of fs R package)
library("fs")
dir_tree("~/Repositories/areacalc")
~/Repositories/areacalc
├── DESCRIPTION
├── NAMESPACE
└── R
This is the most basic structure for an R package. There is DESCRIPTION file, whose contents
where automatically generated by create_package() earlier (see the output from
create_packages() above), a NAMESPACE file which defines the package name space, and a
directory called R, which will host the R code we will put in the package.
use_r("circle")
✓ Setting active project to '~/Repositories/areacalc'
● Modify 'R/circle.R'
● Call `use_test()` to create a matching test file
Go ahead and copy and paste in that script the code chunk where the functions circle() and
area.circle() are defined above and hit save. Feel free to ignore the last piece of advice in
the output about calling use_test() for now; we will come back to this in the following
week.
Congratulations! You have just created your first bare-bones R package! You can use the
load_all() function of the devtools to load your new functions
library("devtools")
Loading required package: usethis
load_all("~/Repositories/areacalc")
Loading areacalc
ci <- circle(1)
area.circle(ci)
[1] 3.141593
Now, as an exercise, include a new script called square.R in areacalc with the following
two functions to compute the area of a square
#' @export
area.square <- function(object) {
object$side^2
}
dir_tree("~/Repositories/areacalc")
~/Repositories/areacalc
├── DESCRIPTION
├── LICENSE
├── LICENSE.md
├── NAMESPACE
├── R
│ ├── area.R
│ ├── circle.R
│ ├── circle.R~
│ └── square.R
└── man
├── area.circle.Rd
└── circle.Rd
load_all("~/Repositories/areacalc")
Loading areacalc
sq <- square(3)
area.square(sq)
[1] 9
Back to areacalc, using the S3 OOP system we can easily implement a generic method called
area() that will calculate the area of a shape, depending on whether the object we pass to the
generic method is of class "circle" (e.g. as constructed by the circle() function) or
"square" (e.g. as constructed by the square() function).
The names we used for the functions area.circle() and area.square() in areacalc were
far from accidental. In fact, the particular naming convention generic_method.class() we
used is S3-specific and makes area.circle() and area.square() methods ready to be used
by a generic method area()!
In order to implement the generic method, we first create a new script under
~/Repositories/areacalc/R/ (I called it area.R) with the following very simple generic
function specification
Then, we can call the document() function from the devtools R package to update the
package’s name space. document() uses the methods from the roxygen2 package to populate
the NAMESPACE file.
document("~/Repositories/areacalc")
Updating areacalc documentation
Loading areacalc
Writing NAMESPACE
Writing NAMESPACE
The #' @export statements before the definitions of area.circle() and area.square told
roxygen2 to generate the table of exported methods. Your NAMESPACE file now looks like
S3method(area,circle)
S3method(area,square)
in essence telling R that areacalc provides two S3 methods for the generic area() for objects
of class "circle" and "square".
Now we can do
ci <- circle(1)
sq <- square(1)
area(ci)
area(sq)
Notice here that the generic method area() automatically decides which specific method to
dispatch (area.circle() or area.square()) depending on the class of the object we pass to
it ("circle" or "square", respectively). If you want to add support for a new shape, say a
rectangle, you just need to write the rectangle(), and area.rectangle() functions, add #'
@export before the definition of the area.rectangle(), and rerun document().
DESCRIPTION File
The DESCRIPTION file is where we specify the package title, version, authors, licence, and
much other information, including other packages that the package depends on. Go ahead and
edit areacalc’s description file to add an informative title and your details. In my copy, I have
edited the title and person specification, set the version to 0.0.1, and used the
use_mit_license("John Doe") function to populate the License attribute with the details
of the MIT licence. After all this, my DESCRIPTION file looks like
Package: areacalc
Title: Methods to Define and Compute the Area of Various Shapes
Version: 0.0.1x
Authors@R:
person(given = "John",
family = "Doe",
role = c("aut", "cre"),
email = "[email protected]")
Description: Provides methods to define various shapes in terms of their
defining characteristics and compute their area.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
dir_tree("~/Repositories/areacalc")
~/Repositories/areacalc
├── DESCRIPTION
├── LICENSE
├── LICENSE.md
├── NAMESPACE
├── R
│ ├── area.R
│ ├── circle.R
│ ├── circle.R~
│ └── square.R
└── man
├── area.circle.Rd
└── circle.Rd
Building Packages
If we want to share the package or submit it in a repository we can do
build("~/Repositories/areacalc")
✔ checking for file ‘~/Repositories/areacalc/DESCRIPTION’
─ preparing ‘areacalc’:
✔ checking DESCRIPTION meta-information
─ checking for LF line-endings in source and make files and shell scripts
─ checking for empty or unneeded directories
Removed empty directory ‘areacalc/man’
─ building ‘areacalc_0.0.1.tar.gz’
[1] "~/Repositories/areacalc_0.0.1.tar.gz"
As the output suggests, the package directory structure is now in the compressed file
areacalc_0.0.1.tar.gz, which can be shared as is for installation and use by other users.
The use_github() function of the usethis R package allows you to upload the package in
GitHub. This can also be done “less manually” by using the GUIs that RStudio provides.
Installing
If you want to install areacalc version 0.01 and have it ready for use in future R sessions you
simply do install("~/Repositories/areacalc") if you want to install directly from the
package directory structure, or
install.packages("~/Repositories/areacalc_0.0.1.tar.gz") if you want to install
from the compressed file you’ve built in earlier.
Publishing
The easiest way to release an R package is using the release() from the devtools R
package, which will guide you through the steps required for release and make
recommendations about what you are expected to have done prior to release; see section on
releasing R packages in the “R packages” book. Of course, areacalc is not ready yet for
release. More work is needed to documenting the package, and it is also useful to incorporate
some tests to make sure that the methods we developed work as expected.
Useful Links and Resources
• R packages book by Hadley Wickham: An authoritative account about how to build
and publish R packages.
• Writing R extensions resource: The definitive guide for creating R packages and
writing R documentation.
Modularization
Modular programming is a design technique to separate the functionality of a program into
some smaller independent parts for some specific tasks. As each module is for some specific
tasks, it is much easier to navigate and locate the code. Different developers can also work on
different parts of the program independently. We have already seen one tool to promote code
modularization: functions. With the use of functions, we can break down a complex problem
into some smaller sub-problems. The use of functions also facilitates code reusing. In this
page, we introduce two other tools for code modularization: modules and packages.
Modules
In Python, a module is a file with the suffix .py that contains Python code. The name of the
module is the same of the name of file. For example, the module circle is created by writing
relevant code in the circle.py file.
'''
This module contains functions for circle calculation
'''
pi = 3.14159
def area(radius):
'''
return area of a circle given the radius. Radius is assumed to be a
non-negative number
'''
return pi *(radius**2)
def circumference(radius):
'''
return circumference of a circle given the radius. Radius is assumed to
be a non-negative number
'''
return 2 * pi * radius
Using modules is very similar to using packages, as we have done so far. We first import the
module by import module. For example, to import the module circle we write:
import circle
By using the function help(), we can see how to use the module:
>> help(circle)
Help on module circle:
NAME
circle - This module contains functions for circle calculation
FUNCTIONS
area(radius)
return area of a circle given the radius. Radius is assumed to be a
non-negative number
circumference(radius)
return circumference of a circle given the radius. Radius is
assumed to be a non-negative number
DATA
pi = 3.14159
The docstring of this module tells us that the module aims to do circle-related calculation.
The help() function also tells us that the module has two functions for calculating the area
and circumference of a circle given the radius, and an object pi. While the help() function
tells us how to use the module and its functions, it does not tell us how the module and its
functions are implemented. This is abstraction - we hide the unnecessary details from the
user.
To get the attributes (which includes functions and other objects) in the module, we use
module.attribute. For functions, we call module.func(). For example, if we want to use
the function area() from the module circle, we need to call the function as follows:
circle.area(3)
28.26
Similarly, if you want to get the retrieve the objects from the module, you need to write
module.obj. For example, if we want to retrieve pi from the module circle, we need to
write:
circle.pi
3.14
We will get an error if we call the functions or use other objects directly, as functions and
other objects within the module scope.
If you need to use the functions a lot, calling the functions in the module by module.func()
may not convenient. Instead you can import the function or other objects from the module by
from module import function. This enables you to use the them directly:
Now let’s create our own package my_math for calculation. We put the file __init__.py to
tell Python this is a package. We also put the files vector.py and circle.py for the vector
and circle-related calculations. The structure of the files are as follows:
We can import the package my_math in the same way we have imported other third-party
packages:
import my_math.circle
my_math.circle.area(3)
28.26
Sub-Packages
A package often has a large number of files, and a hierarchical structure is often imposed by
grouping the modules files into different folders and include a __init__.py file in the
folders. Take a look of the source code of NumPy from GitHub and you can see that modules
are organized via folders. For example, the modules related to random sampling are under the
sub-folder numpy/random and the modules related to polynomial calculation are under the
sub-folder numpy/polynomial. For each sub-folder, there is another __init__.py. Each sub-
folder therefore is a sub-package. For NumPy, numpy.random is a sub-package.
Let us create another package my_math_2, which with the following structure:
The Python Package Index (PyPI) is a repository of software for the Python programming
language. Packages are shared
PyPI is the official third-party software repository for Python, and it is the default source for
packages for pip, which is a popular package installer for Python. The name Conda is for
both the general-purpose package management system and the package repository. Unlike
PyPI, Conda manages software of any language.
Publishing a Package
To publish the package, see the official Python document and PyPI
Documentation
You have (hopefully) typed help or ? at least twice in this course while studying the
workings of R and Python, and, hence, can appreciate how important it is for software to be
documented well. Documentation allows both developers and users to at least get an idea
what a function or package does and how, or what particular objects are, without necessarily
having to read the code. In fact, it would be almost impossible to write Python or R code
without having access to documentation of Python or R!
Documenting R Code
R objects are documented in files with extension .Rd, written in “R documentation” (Rd)
format Rd is a simple markup language much of which closely resembles (La)TeX and can
be processed into a variety of formats, including LaTeX, HTML and plain text. Typically,
documentation files are distinct from the script files of an R package, put under the man/
directory in the directory structure of an R package. A clear description of how .Rd files
should be authored and the conventions for their elements is at the “Writing R extensions”
resource.
#' @export
area.circle <- function(object) {
pi * object$radius^2
}
Both circle() and area.cicle() are currently not documented; that is if you type ?circle
or ?area.circle no guidance on how these functions are used comes up. In order to add
some documentation we edit the file to look like
As you notice roxygen2 comments start with #'. In this file we have added documentation
for two functions, with the documentation appearing before the definition of each function.
The first line of the documentation is the title of the help file. The following line is a short
description of what the function does. @param followed by the argument name is used to
document each of the function arguments; for functions with multiple arguments we use
multiple lines starting with @param followed by the argument name. @seealso provides links
to other related objects, and after @examples we have a code chunk that illustrates what the
function does. @export is a statement we have already seen last week, and it is used to tell
document() that we want this function to be visible to the users of the package. Everything
else is just plain Markdown markup. A detailed account of the conventions used and how
Markdown is translated in .Rd markup can be found in roxygen2’s pages.
Now,
document("~/Repositories/areacalc")
will parse all scripts, and produce appropriate .Rd files under man/. If you install areacalc
and type ?circle you will now see the help page for the function circle().
The basic steps for creating documentation for objects in an R package are:
As an exercise, try documenting all methods and generics in the areacalc package. Make sure
you consult roxygen2’s pages or search in the web if anything does not work for you.
Software Testing
Most probably, you have already been testing things out while writing code, by inspecting the
output of your programs and, if it is not as expected, going back to the code, inspecting the
output again, and so on, until you are happy with the result.
As can be inferred from the various software development models we introduced last week,
testing is not only a vital part of software development process, but the various SDLC models
(see “Introduction to Software Development” section) also define when and how software
testing is done. Testing your code ensures that it does what you want it to do.
Wikipedia’s page on software testing provides a good overview of the many software testing
approaches (e.g. white-box testing, black-box testing, grey-box testing) and software testing
levels (unit testing, integration testing, system testing, acceptance testing) that have been
devised.
In this page we will focus on unit testing because of its usefulness and importance when
developing software for data science projects.
Unit Testing
According to Wikipedia’s page on software testing
Unit testing refers to tests that verify the functionality of a specific section of
code, usually at the function level. In an object-oriented environment, this is
usually at the class level, and the minimal unit tests include the constructors
and destructors
Unit testing refers to tests of the components of the software (e.g. functions and classes) to
verify that those and the software itself work as expected. It is typically done using inputs and
testing against an expected output. Remarkably, the most experienced programmers start by
writing out the test before even writing the code itself!
Of course, all tests above fail with errors, simply because the code for rectangle() has not
been written yet! This may seem like a waste of time, but it pays off very quickly.
Let’s think for a moment what have we achieved by writing the tests: i) We have determined
that for a well-specified rectangle ones needs to specify a length and a width, ii) We thought
about the interface of the rectangle() function, i.e. it should take as input the arguments
length and width, iii) We decided that the result should be a list with elements length and
width and of class “rectangle,” iv) We set ourselves the goal to write code for passing the
tests. So, oftentimes by writing the tests you have almost written the code! In this case,
Imagine doing the above throughout development and for as many components of your code
as possible, testing results, corner cases, exceptions, and so on. You are not only more
confident that your code does exactly what you want it to do (validation), but you have also
eliminated many potential defects in your code. You also have a list of ready tests to run
whenever you change anything to see if anything breaks, hence improving maintainability
and making code refactoring easier.
It is inevitable that your tests will eventually fail after changing your code! Once they fail,
though, you already know what failed and, if your unit tests are granular enough, you also
know why. Then, you can use the techniques (debugging and error handling) and the
strategies (e.g. defensive programming) we discussed on the “Exceptions, Error Handling and
Debugging in R” section to both fix the issue and improve your code.
Testing Frameworks in R
There is nothing wrong with doing unit testing by incrementally adding tests into a script,
which you then run whenever you make changes or additions to the software.
However, it is best to use one of the specialized dedicated R packages, like testthat and
tinytest, which allow you to unit test in a more effective and structured manner, and also
integrate the unit tests within your R package. For example, with tinytest we can write the
above unit tests, and a few more for areacalc, as
library("devtools")
Loading required package: usethis
load_all("~/Repositories/areacalc")
Loading areacalc
library("tinytest")
re <- rectangle(length = 2, width = 3)
expect_true(is.list(re))
----- PASSED : <-->
call| expect_true(is.list(re))
expect_identical(re$length, 2)
----- PASSED : <-->
call| expect_identical(re$length, 2)
expect_identical(re$width, 3)
----- PASSED : <-->
call| expect_identical(re$width, 3)
expect_identical(class(re), "rectangle")
----- PASSED : <-->
call| expect_identical(class(re), "rectangle")
expect_error(circle(diameter = 3))
----- PASSED : <-->
call| expect_error(circle(diameter = 3))
expect_equal(area(circle(1)), pi)
----- PASSED : <-->
call| expect_equal(area(circle(1)), pi)
Both testthat and tinytest have detailed vignettes and tutorials on how they can be used and
be integrated within an R package; just follow the links on their CRAN pages. As a note of
caution, it is not recommended to mix tinytest and testthat unit testing, or to have them both
loaded at the same R session.