Tidyverse
Presidency University
December, 2024
Introduction
I Tidyverse is a collection of essential R packages for data
science.
I The packages under the tidyverse umbrella help us in
performing and interacting with the data.
I There are a whole host of things you can do with your data,
such as subsetting, transforming, visualizing, etc.
Introduction (Contd.)
I Tidyverse was created by the great Hadley Wickham and his
team with the aim of providing all these utilities to clean and
work with data.
I The list of packages include:
I Data wrangling: dplyr, tidyr, tibble, readr
I Visualization: ggplot2
I List manipulation: purrr
Introduction (Contd.)
If we load the tidyverse package, we can see what’s in there:
options(warn=-1)
library(tidyverse)
## – Attaching core tidyverse packages –––––––––––– tidyverse 2.0.0 –
## v dplyr 1.1.3 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.0
## v ggplot2 3.4.4 v tibble 3.2.1
## v lubridate 1.9.3 v tidyr 1.3.0
## v purrr 1.0.2
## – Conflicts ––––––––––––––––––––– tidyverse_conflicts() –
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to
become errors
Why shall we use tidyverse?
There are certain reasons for using tidyverse.
I Packages have a very consistent API
I Function names and commands follow a focused grammar
I Extremely powerful when working with data frames and lists
(matrices, not so much, yet!)
I Allows pipes (%>% operator) to fluidly glue functionality
together
I Very active developer and user community
I Main advantage: at its best, tidyverse data wrangling code can
be read like a story using the pipe operator!
Common iteration tasks
I Generally in R, we iterate over:
I elements of a list
I dimensions of an array (e.g., rows/columns of a matrix)
I sub data frames induced by one or more factors
I All of this is possible in base R, using the apply family of
functions: ‘lapply()‘, ‘sapply()‘, ‘apply()‘, ‘tapply()‘, etc. So
why look anywhere else?
I Answer: because some alternatives offer better consistency:
I With the apply family of functions, there are some
inconsistencies in both the interfaces to the functions, as well
as their outputs
I This can both slow down learning and also lead to inefficiencies
in practice (frequent checking and post-processing of results)
I However, the world isn’t black-and-white: base R still has its
advantages, and the best thing you can do is to be informed
and well-versed in using all the major options!
Why not “plyr”?
I The ‘plyr‘ package used to be one of the most popular (most
downloaded) R packages of all-time. It was more popular in
the late 2000s and early 2010s
I It is no longer under active development and that development
is now happening elsewhere (mainly in the tidyverse).
However, we may still use it.
What is “purrr”?
I “purrr” is a package that is part of the tidyverse.
I It offers a family of functions for iterating (mainly over lists)
that can be seen as alternatives to base R’s family of apply
functions
I Compared to base R, they are more consistent
I Compared to “plyr”, they can often be faster
The map family
I purrr‘ offers a family of map functions, which allow you to
apply a function across different chunks of data (primarily used
with lists).
I Offers an alternative base R’s apply functions.
I Summary of functions:
I map(): apply a function across elements of a list or vector
I map_dbl(), map_lgl(), map_chr(): same, but return a vector
of a particular data type
I map_dfr(), map_dfc(): same, but return a data frame
map(): list in, list out
I The map() function is an alternative to lapply().
I It has the following simple form: map(x, f), where ‘x‘ is a list
or vector, and ‘f‘ is a function. It always returns a list
my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12],
bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
map(my.list, length)
## $nums
## [1] 6
##
## $chars
## [1] 12
##
## $bools
## [1] 6
I Base R is just as easy
lapply(my.list, length)
map_dbl(): list in, numeric out
I The map_dbl() function is an alternative to sapply().
I It has the form: map_dbl(x, f), where ‘x‘ is a list or vector,
and ‘f‘ is a function that returns a numeric value (when
applied to each element of ‘x‘)
I Similarly:
I map_int() returns an integer vector
I map_lgl() returns a logical vector
I map_chr() returns a character vector
Example
map_dbl(my.list, length)
## nums chars bools
## 6 12 6
map_chr(my.list, length)
## nums chars bools
## "6" "12" "6"
# Base R is a bit more complicated
as.numeric(sapply(my.list, length))
## [1] 6 12 6
as.numeric(unlist(lapply(my.list, length)))
## [1] 6 12 6
vapply(my.list, FUN=length, FUN.VALUE=numeric(1))
## nums chars bools
## 6 12 6
Applying a custom function
library(repurrrsive) # Load Game of Thrones data set
class(got_chars)
## [1] "list"
class(got_chars[[1]])
## [1] "list"
Contd.
names(got_chars[[1]])
## [1] "url" "id" "name" "gender" "culture"
## [6] "born" "died" "alive" "titles" "aliases"
## [11] "father" "mother" "spouse" "allegiances" "books"
## [16] "povBooks" "tvSeries" "playedBy"
map_chr(got_chars, function(x) { return(x$name) })
## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"
## [4] "Will" "Areo Hotah" "Chett"
## [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
## [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart"
## [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr"
## [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark"
## [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister"
## [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy"
## [25] "Kevan Lannister" "Melisandre" "Merrett Frey"
## [28] "Quentyn Martell" "Samwell Tarly" "Sansa Stark"
Extracting elements
I Handily, the map functions all allow the second argument to
be an integer or string, and treat this internally as an
appropriate extractor function
map_chr(got_chars, "name")
## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"
## [4] "Will" "Areo Hotah" "Chett"
## [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
## [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart"
## [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr"
## [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark"
## [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister"
## [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy"
## [25] "Kevan Lannister" "Melisandre" "Merrett Frey"
## [28] "Quentyn Martell" "Samwell Tarly" "Sansa Stark"
Contd.
map_lgl(got_chars, "alive")
## [1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [13] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## [25] FALSE TRUE FALSE FALSE TRUE TRUE
Contd.
I Interestingly, we can actually do the following in base R: ’[’()
and ’[[’() are functions that act in the following way for an
integer x and index i
I ’[’(x, i) is equivalent to x[i]
I ’[[’(x, i) is equivalent to x[[i]] (This works whether i is an
integer or a string)
sapply(got_chars, '[[', "name")
## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"
## [4] "Will" "Areo Hotah" "Chett"
## [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
## [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart"
## [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr"
## [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark"
## [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister"
## [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy"
## [25] "Kevan Lannister" "Melisandre" "Merrett Frey"
## [28] "Quentyn Martell" "Samwell Tarly" "Sansa Stark"
contd.
sapply(got_chars, '[[', "alive")
## [1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [13] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## [25] FALSE TRUE FALSE FALSE TRUE TRUE
map_dfr() and map_dfc(): list in, data frame out
I The map_dfr() and map_dfc() functions iterate a function
call over a list or vector, but automatically combine the results
into a data frame. They differ in whether that data frame is
formed by row-binding or column-binding
map_dfr(got_chars, `[`, c("name", "alive"))
## # A tibble: 30 x 2
## name alive
## <chr> <lgl>
## 1 Theon Greyjoy TRUE
## 2 Tyrion Lannister TRUE
## 3 Victarion Greyjoy TRUE
## 4 Will FALSE
## 5 Areo Hotah TRUE
## 6 Chett FALSE
## 7 Cressen FALSE
## 8 Arianne Martell TRUE
## 9 Daenerys Targaryen TRUE
## 10 Davos Seaworth TRUE
## # i 20 more rows
Contd.
# Base R is much less convenient
data.frame(name = sapply(got_chars, `[[`, "name"),
alive = sapply(got_chars, `[[`, "alive"))
## name alive
## 1 Theon Greyjoy TRUE
## 2 Tyrion Lannister TRUE
## 3 Victarion Greyjoy TRUE
## 4 Will FALSE
## 5 Areo Hotah TRUE
## 6 Chett FALSE
## 7 Cressen FALSE
## 8 Arianne Martell TRUE
## 9 Daenerys Targaryen TRUE
## 10 Davos Seaworth TRUE
## 11 Arya Stark TRUE
## 12 Arys Oakheart FALSE
## 13 Asha Greyjoy TRUE
## 14 Barristan Selmy TRUE
## 15 Varamyr FALSE
## 16 Brandon Stark TRUE
## 17 Brienne of Tarth TRUE
## 18 Catelyn Stark FALSE
## 19 Cersei Lannister TRUE
## 20 Eddard Stark FALSE
## 21 Jaime Lannister TRUE
## 22 Jon Connington TRUE
## 23 Jon Snow TRUE
## 24 Aeron Greyjoy TRUE
## 25 Kevan Lannister FALSE
## 26 Melisandre TRUE
## 27 Merrett Frey FALSE
## 28 Quentyn Martell FALSE
## 29 Samwell Tarly TRUE
Data wrangling the tidy way
I dplyr and tidyr are going to be our main workhorses for data
wrangling
I The main structure these packages use is the data frame (or
tibble, but we won’t go there)
I Two keys to getting started:
I learn about pipes %>%
I learn the dplyr verbs
I dplyr functions are analogous to SQL counterparts, so learn
dplyr and get SQL for free.
piping operator
I Fundamentally piping take one return value and automatically
feed it in as an input to another function, to form a flow of
results. It looks like this: %>%.
I This operator actually comes from the magrittr package
(automatically included in tidyverse)
I So it can be used on its own, completely independently of the
tidyverse.
I However tidyverse functions are at their best when composed
together using the pipe operator.
Single argument
While passing a single argument through pipes, we interpret
something like:
x %>% f %>% g %>% h
as h(g (f (x))). This means whenever we find %>% we shall read
this as “and then”.
Example
I We can write log (1) with pipes as 1 %>% log(), and
exp(sin(π)) as pi %>% sin() %>% exp()
exp(1)
## [1] 2.718282
1 %>% exp()
## [1] 2.718282
1 %>% exp() %>% log()
## [1] 1
Multiple argument
When we have multi-arguments functions, we interpret something
like:
x %>% f(y)
as f (x, y ).
Example
Recall what we use to write as
head(mtcars,4)
can be alternatively written as
mtcars %>% head(4)
## mpg cyl disp hp drat wt qsec vs am g
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
Alternative
Alternatively we can write the command x %>% f(y) using dot
notation as:
x %>% f(., y)
Alternative (Contd.)
I What’s the advantage of using dots?
I Sometimes you want to pass in a variable as the second or
third (say, not first) argument to a function, with a pipe.
I For example x %>% f(y, .) is equivalent to f (y , x).
Example
state_df <- data.frame(state.x77)
state.region %>%
tolower %>%
tapply(state_df$Income, ., summary)
## $`north central`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4167 4466 4594 4611 4694 5107
##
## $northeast
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3694 4281 4558 4570 4903 5348
##
## $south
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3098 3622 3848 4012 4316 5299
##
## $west
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3601 4347 4660 4703 4963 6315
Example I
x <- seq(-2*pi,2*pi,len=1000)
x %>% sin %>% abs %>% plot(x, ., type="l")
1.0
0.8
0.6
.
0.4
0.2
0.0
−6 −4 −2 0 2 4 6
x
dplyr verbs
I We shall start learning dplyr with the following verbs
(functions):
I slice(): subset rows based on integer indexing
I filter(): subset rows based on logical criteria
I select(): select certain columns
I pull(): pull out a individual column
I arrange(): order rows by value of a column
I rename(): rename columns
I mutate(): create new columns
I mutate_at(): apply a function to given columns
I The idea is we can think of data frames as nouns and dplyr
verbs as actions that we can apply to manipulate
them—especially natural when using pipes.
slice
I We shall use slice() when we want to indicate certain row
numbers need to be kept:
mtcars %>% slice(c(7,8,14:15))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
I This is same as doing in the old way with base R as:
mtcars[c(7,8,14:15),]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
slice (Contd.)
I We can also do negative slicing:
mtcars %>% slice(-c(1:2,19:23)) %>% nrow()
## [1] 25
I If we do it using the base R we shall write
nrow(mtcars[-c(1:2,19:23),])
## [1] 25
filter
I We shall use filter() when we want to subset rows based on
logical conditions:
mtcars %>% filter((mpg >= 14 & disp >= 200)|(drat <= 3)) %>% head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# note rownames are silenced
I If we do it in base R we shall write
head(subset(mtcars, (mpg >= 14 & disp >= 200) | (drat <= 3))
head(mtcars[(mtcars$mpg >= 14 & mtcars$disp >= 200) | (mtcar
select
Use select() when you want to pick out certain columns:
mtcars %>% select(cyl, disp, hp) %>% head(2)
## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110
# Base R:
head(mtcars[, c("cyl", "disp", "hp")], 2)
Handy select() helpers
mtcars %>% select(starts_with("d")) %>% head(2)
## disp drat
## Mazda RX4 160 3.9
## Mazda RX4 Wag 160 3.9
# Base R (yikes!):
d_colnames <- grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)
mtcars %>% select(ends_with('t')) %>% head(2)
## drat wt
## Mazda RX4 3.9 2.620
## Mazda RX4 Wag 3.9 2.875
mtcars %>% select(ends_with('yl')) %>% head(2)
## cyl
## Mazda RX4 6
## Mazda RX4 Wag 6
mtcars %>% select(contains('ar')) %>% head(2)
## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
Additional, less important function: pull()
I You can grab a single column from a data frame and get it
back as a vector if you use pull
I select preserves column structure even with a single column
mtcars %>% pull(mpg)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
# Same as: mtcars$mpg
mtcars %>% select(mpg)
## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
arrange
I Use arrange() to order rows by values of a column:
mtcars %>%
arrange(desc(disp)) %>%
select(mpg, disp, drat) %>%
head(2)
## mpg disp drat
## Cadillac Fleetwood 10.4 472 2.93
## Lincoln Continental 10.4 460 3.00
# Base R:
drat_inds <- order(mtcars$disp, decreasing = TRUE)
head(mtcars[drat_inds, c("mpg", "disp", "drat")], 2)
Contd.
I We can order by multiple columns too:
mtcars %>%
arrange(desc(gear), desc(hp)) %>%
select(gear, hp, everything()) %>%
head(8)
## gear hp mpg cyl disp drat wt qsec vs am carb
## Maserati Bora 5 335 15.0 8 301.0 3.54 3.570 14.60 0 1 8
## Ford Pantera L 5 264 15.8 8 351.0 4.22 3.170 14.50 0 1 4
## Ferrari Dino 5 175 19.7 6 145.0 3.62 2.770 15.50 0 1 6
## Lotus Europa 5 113 30.4 4 95.1 3.77 1.513 16.90 1 1 2
## Porsche 914-2 5 91 26.0 4 120.3 4.43 2.140 16.70 0 1 2
## Merc 280 4 123 19.2 6 167.6 3.92 3.440 18.30 1 0 4
## Merc 280C 4 123 17.8 6 167.6 3.92 3.440 18.90 1 0 4
## Mazda RX4 4 110 21.0 6 160.0 3.90 2.620 16.46 0 1 4
mutate()
I Use mutate() when you want to create one or several columns:
mtcars <- mtcars %>%
mutate(hp_wt = hp/wt,
mpg_wt = mpg/wt)
# Base R:
mtcars$hp_wt <- mtcars$hp/mtcars$wt
mtcars$mpg_wt <- mtcars$mpg/mtcars$wt
I create a new data.frame with updated/added columns
mtcars <- mtcars %>%
mutate(hp_wt = 1) # update hp_wt to just the one value
mtcars %>% head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1 8.015267
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1 7.304348
# Base R:
mtcars$hp_wt <- 1
I newly created variables are useable immediately
mtcars <- mtcars %>%
mutate(hp_wt_correct = hp/wt,
hp_wt_cyl = hp_wt_correct/cyl)
mtcars %>% head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1 8.015267
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1 7.304348
## hp_wt_correct hp_wt_cyl
## Mazda RX4 41.98473 6.997455
## Mazda RX4 Wag 38.26087 6.376812
# Base R:
mtcars$hp_wt_correct <- mtcars$hp/mtcars$wt
mtcars$hp_wt_cyl <- mtcars$hp_wt_correct/mtcars$cyl
mutate_at()
I Use mutate_at() when you want to apply a function to one or
several columns:
# correction
mtcars <- mtcars %>% mutate(hp_wt = hp_wt_correct)
mtcars <- mtcars %>%
mutate_at(c("hp_wt", "mpg_wt"), log)
# Base R:
mtcars$hp_wt <- log(mtcars$hp_wt)
mtcars$mpg_wt <- log(mtcars$mpg_wt)
rename()
I Use rename() to easily rename columns:
mtcars %>%
rename(hp_wt_log = hp_wt, mpg_wt_log = mpg_wt) %>%
head(2)
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt_log
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1.318365
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1.293199
## mpg_wt_log hp_wt_correct hp_wt_cyl
## Mazda RX4 0.7330158 41.98473 6.997455
## Mazda RX4 Wag 0.6873654 38.26087 6.376812
# Base R:
colnames(mtcars)[colnames(mtcars) == "hp_wt"] <- "hp_wt_log"
colnames(mtcars)[colnames(mtcars) == "mpg_wt"] <- "mpg_wt_lo
head(mtcars, 2)
Important note
I Calling dplyr verbs always outputs a new data frame, it does
not alter the existing data frame
I So to keep the changes, we have to reassign the data frame to
be the output of the pipe!
tidyr verbs
I Our tidyr journey starts of with learning the following verbs
(functions):
I pivot_longer(): make “wide” data longer
I pivot_wider(): make “long” data wider
I separate(): split a single column into multiple columns
I unite(): combine multiple columns into a single column
I Key takeaway: as with dplyr, think of data frames as nouns
and tidyr verbs as actions that you apply to manipulate
them—especially natural when using pipes
pivot_longer()
# devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets
##
## Attaching package: ’EDAWR’
## The following object is masked from ’package:dplyr’:
##
## storms
## The following objects are masked from ’package:tidyr’:
##
## population, who
EDAWR::cases %>%
head(3)
## country 2011 2012 2013
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", cols = 2:4) %>%
head(5)
## # A tibble: 5 x 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", cols = 2:4) %>%
head(5)
## # A tibble: 5 x 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
I Here we transposed columns 2:4 into a year column
I We put the corresponding count values into a column called n
I Note tidyr did all the heavy lifting of the transposing work
I We just had to declaratively specify the output
Different approach
# Different approach to do the same thing
EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", -country) %>%
head(5)
## # A tibble: 5 x 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
#Could also do:
# EDAWR::cases %>%
# pivot_longer(names_to = "year", values_to = "n", c(`2011`, `2012`, `2013`))
# or:
# EDAWR::cases %>%
# pivot_longer(names_to = "year", values_to = "n", `2011`:`2013`)
pivot_wider()
EDAWR::pollution %>%
head(5)
## city size amount
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
EDAWR::pollution %>%
pivot_wider(names_from = "size",
values_from = "amount")
## # A tibble: 3 x 3
## city large small
## <chr> <dbl> <dbl>
## 1 New York 23 14
## 2 London 22 16
## 3 Beijing 121 56
I Here we transposed to a wide format by size
I We tabulated the corresponding amount for each size
I Note tidyr did all the heavy lifting again
I We just had to declaratively specify the output
I Note that pivot_wider() and pivot_longer() are inverses
When could I used these?
I To visualize things like matrices in ggplot (pivot_longer)
I To make prettier / more readable “tables” (pivot_wider)
I Additionally, if you find yourself getting stuck (with nuance
situations) they are more complicated functions like
pivot_wider_spec, etc. for these cases (see Manual Specs)
separate()
Use separate() to split a single column into multiple ones:
EDAWR::storms %>%
head(3)
## storm wind pressure date
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
storms2 <- EDAWR::storms %>%
separate(date, c("y", "m", "d")) # sep = "-"
unite()
Use unite() to combine multiple columns into a single column:
storms2 %>%
unite(date, y, m, d, sep = "-")
## # A tibble: 6 x 4
## storm wind pressure date
## <chr> <int> <int> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
## 4 Ana 40 1013 1997-06-30
## 5 Arlene 50 1010 1999-06-11
## 6 Arthur 45 1010 1996-06-17
group_by()
Use group_by() to define a grouping of rows based on a column:
mtcars %>%
group_by(cyl) %>%
head(4)
## # A tibble: 4 x 15
## # Groups: cyl [2]
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1.32 0.733
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1.29 0.687
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1.31 0.826
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1.26 0.640
## # i 2 more variables: hp_wt_correct <dbl>, hp_wt_cyl <dbl>
mtcars %>%
group_by(cyl) %>%
head(4) %>% class
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Note
I This doesn’t actually change anything about the way the df
looks
I Only difference is that when it prints, we’re told about the
groups
I But it will play a big role in how other dplyr verbs act
summarize()
I Use summarise() (or summarize() ) to apply functions to
rows—ungrouped or grouped—of a data frame:
# Ungrouped
mtcars %>%
summarize(mpg = mean(mpg),
hp = mean(hp))
## mpg hp
## 1 20.09062 146.6875
# Grouped by number of cylinders
mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg),
hp = mean(hp))
## # A tibble: 3 x 3
## cyl mpg hp
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
contd.
mtcars %>%
group_by(cyl) %>%
summarize(mpg_mean = mean(mpg),
mpg_max = max(mpg),
hp_mean = mean(hp),
hp_max = max(hp))
## # A tibble: 3 x 5
## cyl mpg_mean mpg_max hp_mean hp_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 33.9 82.6 113
## 2 6 19.7 21.4 122. 175
## 3 8 15.1 19.2 209. 335
ungroup()
Use ungroup() to remove groupings structure from a data frame:
mtcars %>%
group_by(cyl) %>%
ungroup() %>%
summarize(hp = mean(hp),
mpg = mean(mpg))
## # A tibble: 1 x 2
## hp mpg
## <dbl> <dbl>
## 1 147. 20.1
Beyond summarize()
mtcars %>%
pull(hp) %>% tapply(INDEX = mtcars$cyl, FUN = summary)
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.00 65.50 91.00 82.64 96.00 113.00
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 105.0 110.0 110.0 122.3 123.0 175.0
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 150.0 176.2 192.5 209.2 241.2 335.0
Contd.
mtcars %>%
group_by(cyl) %>%
nest() %>% # creates a column with the data conditional of subset
mutate(sum = purrr::map(data, function(df) summary(df$hp)),
sum_df = purrr::map(sum, broom::tidy)) %>% # unravel things to be data.frames
select(cyl, sum_df) %>%
unnest(cols = sum_df)
## # A tibble: 3 x 7
## # Groups: cyl [3]
## cyl minimum q1 median mean q3 maximum
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 105 110 110 122. 123 175
## 2 4 52 65.5 91 82.6 96 113
## 3 8 150 176. 192. 209. 241. 335
Join operations
I A “join” operation in database terminology is a merging of two
data frames for us. There are 4 types of joins:
I Inner join (or just join): retain just the rows each table that
match the condition
I Left outer join (or just left join): retain all rows in the first
table, and just the rows in the second table that match the
condition
I Right outer join (or just right join): retain just the rows in the
first table that match the condition, and all rows in the second
table
I Full outer join (or just full join): retain all rows in both tables
I Column values that cannot be filled in are assigned NA values
Join
Two toy data frames
has_kids_tab1 <- data.frame(name = c("Robert Downey, Jr", "Scarlett Johansson", "Chris Hemsworth"),
children = c(3, 1, 3),
stringsAsFactors = FALSE)
americans_tab2 <- data.frame(name = c("Chris Evans", "Robert Downey, Jr", "Scarlett Johansson"),
age = c(38, 54, 34),
stringsAsFactors = FALSE)
has_kids_tab1
## name children
## 1 Robert Downey, Jr 3
## 2 Scarlett Johansson 1
## 3 Chris Hemsworth 3
americans_tab2
## name age
## 1 Chris Evans 38
## 2 Robert Downey, Jr 54
## 3 Scarlett Johansson 34
inner_join()
Suppose we want to join tab1 and tab2 by name, but keep only
actors in intersection (aka in both tables):
inner_join(x = has_kids_tab1, y = americans_tab2, by = "name")
## name children age
## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
left_join()
Suppose we want to join tab1 and tab2 by name, but keep all
actors from tab1:
left_join(x = has_kids_tab1, y = americans_tab2, by = c("name" = "name"))
## name children age
## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Hemsworth 3 NA
right_join()
Suppose we want to join tab1 and tab2 by name, but keep all
actors from tab2:
right_join(x = has_kids_tab1, y = americans_tab2, by = "name")
## name children age
## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Evans NA 38
full_join()
Finally, suppose we want to join tab1 and tab2 by name, and keep
all actors from both:
full_join(x = has_kids_tab1, y = americans_tab2, by = "name")
## name children age
## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Hemsworth 3 NA
## 4 Chris Evans NA 38
More nuanced structure
my_peeps <- data.frame(pol = factor(sample(c("R", "D"),
10, replace =T)),
gender = factor(sample(c("F", "M"),
10, replace =T)),
state = factor(sample(c("AZ", "PA"), 10,
replace = T)),
IQ = round(rnorm(n = 10,mean = 100, sd = 10)))
politics <- data.frame(senator = c(
"Kyrsten Sinema", "Martha McSally",
"Pat Toomey","Boy Casey Jr."),
pol = c("D", "R", "R", "D"),
gender = c("F", "F", "M", "M"),
STATE = c("AZ", "AZ", "PA", "PA")
)
More nuanced structure
my_peeps %T>% print() %>% dim()
## pol gender state IQ
## 1 R F AZ 93
## 2 R F AZ 116
## 3 R F PA 90
## 4 R F AZ 109
## 5 R F PA 127
## 6 D M AZ 93
## 7 D F PA 103
## 8 D M PA 84
## 9 D F PA 114
## 10 R M PA 91
## [1] 10 4
contd.
politics %T>% print() %>% dim()
## senator pol gender STATE
## 1 Kyrsten Sinema D F AZ
## 2 Martha McSally R F AZ
## 3 Pat Toomey R M PA
## 4 Boy Casey Jr. D M PA
## [1] 4 4
contd.
I %T>% is a special pipe that passes my_peeps into print() as
a “side-effect” and then also passes my_peeps onto the rest of
the chain (which in this case is just dim())
my_peeps %>% left_join(politics,by = c("state" = "STATE",
"pol" = "pol")) %>% head(6)
## pol gender.x state IQ senator gender.y
## 1 R F AZ 93 Martha McSally F
## 2 R F AZ 116 Martha McSally F
## 3 R F PA 90 Pat Toomey M
## 4 R F AZ 109 Martha McSally F
## 5 R F PA 127 Pat Toomey M
## 6 D M AZ 93 Kyrsten Sinema F