0% found this document useful (0 votes)
26 views76 pages

Tidy Verse

Tidyverse is a collection of R packages designed for data science, created by Hadley Wickham and his team, facilitating data manipulation, visualization, and interaction. Key packages include dplyr for data wrangling, ggplot2 for visualization, and purrr for list manipulation, all of which provide a consistent API and allow for fluid data processing using the pipe operator. The document also discusses the advantages of using Tidyverse over base R functions and provides examples of using the map family of functions for data iteration.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views76 pages

Tidy Verse

Tidyverse is a collection of R packages designed for data science, created by Hadley Wickham and his team, facilitating data manipulation, visualization, and interaction. Key packages include dplyr for data wrangling, ggplot2 for visualization, and purrr for list manipulation, all of which provide a consistent API and allow for fluid data processing using the pipe operator. The document also discusses the advantages of using Tidyverse over base R functions and provides examples of using the map family of functions for data iteration.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Tidyverse

Presidency University

December, 2024
Introduction

I Tidyverse is a collection of essential R packages for data


science.
I The packages under the tidyverse umbrella help us in
performing and interacting with the data.
I There are a whole host of things you can do with your data,
such as subsetting, transforming, visualizing, etc.
Introduction (Contd.)

I Tidyverse was created by the great Hadley Wickham and his


team with the aim of providing all these utilities to clean and
work with data.
I The list of packages include:
I Data wrangling: dplyr, tidyr, tibble, readr
I Visualization: ggplot2
I List manipulation: purrr
Introduction (Contd.)

If we load the tidyverse package, we can see what’s in there:


options(warn=-1)
library(tidyverse)

## – Attaching core tidyverse packages –––––––––––– tidyverse 2.0.0 –


## v dplyr 1.1.3 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.0
## v ggplot2 3.4.4 v tibble 3.2.1
## v lubridate 1.9.3 v tidyr 1.3.0
## v purrr 1.0.2
## – Conflicts ––––––––––––––––––––– tidyverse_conflicts() –
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to
become errors
Why shall we use tidyverse?

There are certain reasons for using tidyverse.


I Packages have a very consistent API
I Function names and commands follow a focused grammar
I Extremely powerful when working with data frames and lists
(matrices, not so much, yet!)
I Allows pipes (%>% operator) to fluidly glue functionality
together
I Very active developer and user community
I Main advantage: at its best, tidyverse data wrangling code can
be read like a story using the pipe operator!
Common iteration tasks
I Generally in R, we iterate over:
I elements of a list
I dimensions of an array (e.g., rows/columns of a matrix)
I sub data frames induced by one or more factors
I All of this is possible in base R, using the apply family of
functions: ‘lapply()‘, ‘sapply()‘, ‘apply()‘, ‘tapply()‘, etc. So
why look anywhere else?
I Answer: because some alternatives offer better consistency:
I With the apply family of functions, there are some
inconsistencies in both the interfaces to the functions, as well
as their outputs
I This can both slow down learning and also lead to inefficiencies
in practice (frequent checking and post-processing of results)
I However, the world isn’t black-and-white: base R still has its
advantages, and the best thing you can do is to be informed
and well-versed in using all the major options!
Why not “plyr”?

I The ‘plyr‘ package used to be one of the most popular (most


downloaded) R packages of all-time. It was more popular in
the late 2000s and early 2010s
I It is no longer under active development and that development
is now happening elsewhere (mainly in the tidyverse).
However, we may still use it.
What is “purrr”?

I “purrr” is a package that is part of the tidyverse.


I It offers a family of functions for iterating (mainly over lists)
that can be seen as alternatives to base R’s family of apply
functions
I Compared to base R, they are more consistent
I Compared to “plyr”, they can often be faster
The map family

I purrr‘ offers a family of map functions, which allow you to


apply a function across different chunks of data (primarily used
with lists).
I Offers an alternative base R’s apply functions.
I Summary of functions:
I map(): apply a function across elements of a list or vector
I map_dbl(), map_lgl(), map_chr(): same, but return a vector
of a particular data type
I map_dfr(), map_dfc(): same, but return a data frame
map(): list in, list out

I The map() function is an alternative to lapply().


I It has the following simple form: map(x, f), where ‘x‘ is a list
or vector, and ‘f‘ is a function. It always returns a list

my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12],


bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
map(my.list, length)

## $nums
## [1] 6
##
## $chars
## [1] 12
##
## $bools
## [1] 6

I Base R is just as easy


lapply(my.list, length)
map_dbl(): list in, numeric out

I The map_dbl() function is an alternative to sapply().


I It has the form: map_dbl(x, f), where ‘x‘ is a list or vector,
and ‘f‘ is a function that returns a numeric value (when
applied to each element of ‘x‘)
I Similarly:
I map_int() returns an integer vector
I map_lgl() returns a logical vector
I map_chr() returns a character vector
Example
map_dbl(my.list, length)

## nums chars bools


## 6 12 6

map_chr(my.list, length)

## nums chars bools


## "6" "12" "6"

# Base R is a bit more complicated


as.numeric(sapply(my.list, length))

## [1] 6 12 6

as.numeric(unlist(lapply(my.list, length)))

## [1] 6 12 6

vapply(my.list, FUN=length, FUN.VALUE=numeric(1))

## nums chars bools


## 6 12 6
Applying a custom function

library(repurrrsive) # Load Game of Thrones data set


class(got_chars)

## [1] "list"

class(got_chars[[1]])

## [1] "list"
Contd.

names(got_chars[[1]])

## [1] "url" "id" "name" "gender" "culture"


## [6] "born" "died" "alive" "titles" "aliases"
## [11] "father" "mother" "spouse" "allegiances" "books"
## [16] "povBooks" "tvSeries" "playedBy"

map_chr(got_chars, function(x) { return(x$name) })

## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"


## [4] "Will" "Areo Hotah" "Chett"
## [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
## [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart"
## [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr"
## [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark"
## [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister"
## [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy"
## [25] "Kevan Lannister" "Melisandre" "Merrett Frey"
## [28] "Quentyn Martell" "Samwell Tarly" "Sansa Stark"
Extracting elements

I Handily, the map functions all allow the second argument to


be an integer or string, and treat this internally as an
appropriate extractor function

map_chr(got_chars, "name")

## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"


## [4] "Will" "Areo Hotah" "Chett"
## [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
## [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart"
## [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr"
## [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark"
## [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister"
## [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy"
## [25] "Kevan Lannister" "Melisandre" "Merrett Frey"
## [28] "Quentyn Martell" "Samwell Tarly" "Sansa Stark"
Contd.

map_lgl(got_chars, "alive")

## [1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [13] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## [25] FALSE TRUE FALSE FALSE TRUE TRUE
Contd.

I Interestingly, we can actually do the following in base R: ’[’()


and ’[[’() are functions that act in the following way for an
integer x and index i
I ’[’(x, i) is equivalent to x[i]
I ’[[’(x, i) is equivalent to x[[i]] (This works whether i is an
integer or a string)

sapply(got_chars, '[[', "name")

## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"


## [4] "Will" "Areo Hotah" "Chett"
## [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
## [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart"
## [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr"
## [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark"
## [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister"
## [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy"
## [25] "Kevan Lannister" "Melisandre" "Merrett Frey"
## [28] "Quentyn Martell" "Samwell Tarly" "Sansa Stark"
contd.

sapply(got_chars, '[[', "alive")

## [1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [13] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## [25] FALSE TRUE FALSE FALSE TRUE TRUE
map_dfr() and map_dfc(): list in, data frame out

I The map_dfr() and map_dfc() functions iterate a function


call over a list or vector, but automatically combine the results
into a data frame. They differ in whether that data frame is
formed by row-binding or column-binding

map_dfr(got_chars, `[`, c("name", "alive"))

## # A tibble: 30 x 2
## name alive
## <chr> <lgl>
## 1 Theon Greyjoy TRUE
## 2 Tyrion Lannister TRUE
## 3 Victarion Greyjoy TRUE
## 4 Will FALSE
## 5 Areo Hotah TRUE
## 6 Chett FALSE
## 7 Cressen FALSE
## 8 Arianne Martell TRUE
## 9 Daenerys Targaryen TRUE
## 10 Davos Seaworth TRUE
## # i 20 more rows
Contd.
# Base R is much less convenient
data.frame(name = sapply(got_chars, `[[`, "name"),
alive = sapply(got_chars, `[[`, "alive"))

## name alive
## 1 Theon Greyjoy TRUE
## 2 Tyrion Lannister TRUE
## 3 Victarion Greyjoy TRUE
## 4 Will FALSE
## 5 Areo Hotah TRUE
## 6 Chett FALSE
## 7 Cressen FALSE
## 8 Arianne Martell TRUE
## 9 Daenerys Targaryen TRUE
## 10 Davos Seaworth TRUE
## 11 Arya Stark TRUE
## 12 Arys Oakheart FALSE
## 13 Asha Greyjoy TRUE
## 14 Barristan Selmy TRUE
## 15 Varamyr FALSE
## 16 Brandon Stark TRUE
## 17 Brienne of Tarth TRUE
## 18 Catelyn Stark FALSE
## 19 Cersei Lannister TRUE
## 20 Eddard Stark FALSE
## 21 Jaime Lannister TRUE
## 22 Jon Connington TRUE
## 23 Jon Snow TRUE
## 24 Aeron Greyjoy TRUE
## 25 Kevan Lannister FALSE
## 26 Melisandre TRUE
## 27 Merrett Frey FALSE
## 28 Quentyn Martell FALSE
## 29 Samwell Tarly TRUE
Data wrangling the tidy way

I dplyr and tidyr are going to be our main workhorses for data
wrangling
I The main structure these packages use is the data frame (or
tibble, but we won’t go there)
I Two keys to getting started:
I learn about pipes %>%
I learn the dplyr verbs
I dplyr functions are analogous to SQL counterparts, so learn
dplyr and get SQL for free.
piping operator

I Fundamentally piping take one return value and automatically


feed it in as an input to another function, to form a flow of
results. It looks like this: %>%.
I This operator actually comes from the magrittr package
(automatically included in tidyverse)
I So it can be used on its own, completely independently of the
tidyverse.
I However tidyverse functions are at their best when composed
together using the pipe operator.
Single argument

While passing a single argument through pipes, we interpret


something like:

x %>% f %>% g %>% h

as h(g (f (x))). This means whenever we find %>% we shall read


this as “and then”.
Example

I We can write log (1) with pipes as 1 %>% log(), and


exp(sin(π)) as pi %>% sin() %>% exp()

exp(1)

## [1] 2.718282

1 %>% exp()

## [1] 2.718282

1 %>% exp() %>% log()

## [1] 1
Multiple argument

When we have multi-arguments functions, we interpret something


like:

x %>% f(y)

as f (x, y ).
Example

Recall what we use to write as

head(mtcars,4)

can be alternatively written as

mtcars %>% head(4)

## mpg cyl disp hp drat wt qsec vs am g


## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
Alternative

Alternatively we can write the command x %>% f(y) using dot


notation as:

x %>% f(., y)
Alternative (Contd.)

I What’s the advantage of using dots?


I Sometimes you want to pass in a variable as the second or
third (say, not first) argument to a function, with a pipe.
I For example x %>% f(y, .) is equivalent to f (y , x).
Example

state_df <- data.frame(state.x77)


state.region %>%
tolower %>%
tapply(state_df$Income, ., summary)

## $`north central`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4167 4466 4594 4611 4694 5107
##
## $northeast
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3694 4281 4558 4570 4903 5348
##
## $south
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3098 3622 3848 4012 4316 5299
##
## $west
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3601 4347 4660 4703 4963 6315
Example I

x <- seq(-2*pi,2*pi,len=1000)
x %>% sin %>% abs %>% plot(x, ., type="l")
1.0
0.8
0.6
.

0.4
0.2
0.0

−6 −4 −2 0 2 4 6

x
dplyr verbs

I We shall start learning dplyr with the following verbs


(functions):
I slice(): subset rows based on integer indexing
I filter(): subset rows based on logical criteria
I select(): select certain columns
I pull(): pull out a individual column
I arrange(): order rows by value of a column
I rename(): rename columns
I mutate(): create new columns
I mutate_at(): apply a function to given columns
I The idea is we can think of data frames as nouns and dplyr
verbs as actions that we can apply to manipulate
them—especially natural when using pipes.
slice

I We shall use slice() when we want to indicate certain row


numbers need to be kept:

mtcars %>% slice(c(7,8,14:15))

## mpg cyl disp hp drat wt qsec vs am gear carb


## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4

I This is same as doing in the old way with base R as:

mtcars[c(7,8,14:15),]

## mpg cyl disp hp drat wt qsec vs am gear carb


## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
slice (Contd.)

I We can also do negative slicing:

mtcars %>% slice(-c(1:2,19:23)) %>% nrow()

## [1] 25

I If we do it using the base R we shall write

nrow(mtcars[-c(1:2,19:23),])

## [1] 25
filter

I We shall use filter() when we want to subset rows based on


logical conditions:

mtcars %>% filter((mpg >= 14 & disp >= 200)|(drat <= 3)) %>% head(2)

## mpg cyl disp hp drat wt qsec vs am gear carb


## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

# note rownames are silenced

I If we do it in base R we shall write

head(subset(mtcars, (mpg >= 14 & disp >= 200) | (drat <= 3))
head(mtcars[(mtcars$mpg >= 14 & mtcars$disp >= 200) | (mtcar
select

Use select() when you want to pick out certain columns:


mtcars %>% select(cyl, disp, hp) %>% head(2)

## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110

# Base R:
head(mtcars[, c("cyl", "disp", "hp")], 2)
Handy select() helpers

mtcars %>% select(starts_with("d")) %>% head(2)

## disp drat
## Mazda RX4 160 3.9
## Mazda RX4 Wag 160 3.9

# Base R (yikes!):
d_colnames <- grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)
mtcars %>% select(ends_with('t')) %>% head(2)

## drat wt
## Mazda RX4 3.9 2.620
## Mazda RX4 Wag 3.9 2.875

mtcars %>% select(ends_with('yl')) %>% head(2)

## cyl
## Mazda RX4 6
## Mazda RX4 Wag 6

mtcars %>% select(contains('ar')) %>% head(2)

## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
Additional, less important function: pull()
I You can grab a single column from a data frame and get it
back as a vector if you use pull
I select preserves column structure even with a single column
mtcars %>% pull(mpg)

## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

# Same as: mtcars$mpg


mtcars %>% select(mpg)

## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
arrange

I Use arrange() to order rows by values of a column:

mtcars %>%
arrange(desc(disp)) %>%
select(mpg, disp, drat) %>%
head(2)

## mpg disp drat


## Cadillac Fleetwood 10.4 472 2.93
## Lincoln Continental 10.4 460 3.00

# Base R:
drat_inds <- order(mtcars$disp, decreasing = TRUE)
head(mtcars[drat_inds, c("mpg", "disp", "drat")], 2)
Contd.

I We can order by multiple columns too:

mtcars %>%
arrange(desc(gear), desc(hp)) %>%
select(gear, hp, everything()) %>%
head(8)

## gear hp mpg cyl disp drat wt qsec vs am carb


## Maserati Bora 5 335 15.0 8 301.0 3.54 3.570 14.60 0 1 8
## Ford Pantera L 5 264 15.8 8 351.0 4.22 3.170 14.50 0 1 4
## Ferrari Dino 5 175 19.7 6 145.0 3.62 2.770 15.50 0 1 6
## Lotus Europa 5 113 30.4 4 95.1 3.77 1.513 16.90 1 1 2
## Porsche 914-2 5 91 26.0 4 120.3 4.43 2.140 16.70 0 1 2
## Merc 280 4 123 19.2 6 167.6 3.92 3.440 18.30 1 0 4
## Merc 280C 4 123 17.8 6 167.6 3.92 3.440 18.90 1 0 4
## Mazda RX4 4 110 21.0 6 160.0 3.90 2.620 16.46 0 1 4
mutate()

I Use mutate() when you want to create one or several columns:

mtcars <- mtcars %>%


mutate(hp_wt = hp/wt,
mpg_wt = mpg/wt)

# Base R:
mtcars$hp_wt <- mtcars$hp/mtcars$wt
mtcars$mpg_wt <- mtcars$mpg/mtcars$wt
I create a new data.frame with updated/added columns

mtcars <- mtcars %>%


mutate(hp_wt = 1) # update hp_wt to just the one value
mtcars %>% head(2)

## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt


## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1 8.015267
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1 7.304348

# Base R:
mtcars$hp_wt <- 1
I newly created variables are useable immediately

mtcars <- mtcars %>%


mutate(hp_wt_correct = hp/wt,
hp_wt_cyl = hp_wt_correct/cyl)
mtcars %>% head(2)

## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt


## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1 8.015267
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1 7.304348
## hp_wt_correct hp_wt_cyl
## Mazda RX4 41.98473 6.997455
## Mazda RX4 Wag 38.26087 6.376812

# Base R:
mtcars$hp_wt_correct <- mtcars$hp/mtcars$wt
mtcars$hp_wt_cyl <- mtcars$hp_wt_correct/mtcars$cyl
mutate_at()

I Use mutate_at() when you want to apply a function to one or


several columns:

# correction
mtcars <- mtcars %>% mutate(hp_wt = hp_wt_correct)

mtcars <- mtcars %>%


mutate_at(c("hp_wt", "mpg_wt"), log)

# Base R:
mtcars$hp_wt <- log(mtcars$hp_wt)
mtcars$mpg_wt <- log(mtcars$mpg_wt)
rename()

I Use rename() to easily rename columns:

mtcars %>%
rename(hp_wt_log = hp_wt, mpg_wt_log = mpg_wt) %>%
head(2)

## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt_log


## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1.318365
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1.293199
## mpg_wt_log hp_wt_correct hp_wt_cyl
## Mazda RX4 0.7330158 41.98473 6.997455
## Mazda RX4 Wag 0.6873654 38.26087 6.376812

# Base R:
colnames(mtcars)[colnames(mtcars) == "hp_wt"] <- "hp_wt_log"
colnames(mtcars)[colnames(mtcars) == "mpg_wt"] <- "mpg_wt_lo
head(mtcars, 2)
Important note

I Calling dplyr verbs always outputs a new data frame, it does


not alter the existing data frame
I So to keep the changes, we have to reassign the data frame to
be the output of the pipe!
tidyr verbs

I Our tidyr journey starts of with learning the following verbs


(functions):
I pivot_longer(): make “wide” data longer
I pivot_wider(): make “long” data wider
I separate(): split a single column into multiple columns
I unite(): combine multiple columns into a single column
I Key takeaway: as with dplyr, think of data frames as nouns
and tidyr verbs as actions that you apply to manipulate
them—especially natural when using pipes
pivot_longer()
# devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets

##
## Attaching package: ’EDAWR’
## The following object is masked from ’package:dplyr’:
##
## storms
## The following objects are masked from ’package:tidyr’:
##
## population, who

EDAWR::cases %>%
head(3)

## country 2011 2012 2013


## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000

EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", cols = 2:4) %>%
head(5)

## # A tibble: 5 x 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", cols = 2:4) %>%
head(5)

## # A tibble: 5 x 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
I Here we transposed columns 2:4 into a year column
I We put the corresponding count values into a column called n
I Note tidyr did all the heavy lifting of the transposing work
I We just had to declaratively specify the output
Different approach

# Different approach to do the same thing


EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", -country) %>%
head(5)

## # A tibble: 5 x 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
#Could also do:
# EDAWR::cases %>%
# pivot_longer(names_to = "year", values_to = "n", c(`2011`, `2012`, `2013`))
# or:
# EDAWR::cases %>%
# pivot_longer(names_to = "year", values_to = "n", `2011`:`2013`)
pivot_wider()

EDAWR::pollution %>%
head(5)

## city size amount


## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
EDAWR::pollution %>%
pivot_wider(names_from = "size",
values_from = "amount")

## # A tibble: 3 x 3
## city large small
## <chr> <dbl> <dbl>
## 1 New York 23 14
## 2 London 22 16
## 3 Beijing 121 56
I Here we transposed to a wide format by size
I We tabulated the corresponding amount for each size
I Note tidyr did all the heavy lifting again
I We just had to declaratively specify the output
I Note that pivot_wider() and pivot_longer() are inverses
When could I used these?

I To visualize things like matrices in ggplot (pivot_longer)


I To make prettier / more readable “tables” (pivot_wider)
I Additionally, if you find yourself getting stuck (with nuance
situations) they are more complicated functions like
pivot_wider_spec, etc. for these cases (see Manual Specs)
separate()

Use separate() to split a single column into multiple ones:


EDAWR::storms %>%
head(3)

## storm wind pressure date


## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03

storms2 <- EDAWR::storms %>%


separate(date, c("y", "m", "d")) # sep = "-"
unite()

Use unite() to combine multiple columns into a single column:


storms2 %>%
unite(date, y, m, d, sep = "-")

## # A tibble: 6 x 4
## storm wind pressure date
## <chr> <int> <int> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
## 4 Ana 40 1013 1997-06-30
## 5 Arlene 50 1010 1999-06-11
## 6 Arthur 45 1010 1996-06-17
group_by()

Use group_by() to define a grouping of rows based on a column:


mtcars %>%
group_by(cyl) %>%
head(4)

## # A tibble: 4 x 15
## # Groups: cyl [2]
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1.32 0.733
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1.29 0.687
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1.31 0.826
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1.26 0.640
## # i 2 more variables: hp_wt_correct <dbl>, hp_wt_cyl <dbl>

mtcars %>%
group_by(cyl) %>%
head(4) %>% class

## [1] "grouped_df" "tbl_df" "tbl" "data.frame"


Note

I This doesn’t actually change anything about the way the df


looks
I Only difference is that when it prints, we’re told about the
groups
I But it will play a big role in how other dplyr verbs act
summarize()

I Use summarise() (or summarize() ) to apply functions to


rows—ungrouped or grouped—of a data frame:

# Ungrouped
mtcars %>%
summarize(mpg = mean(mpg),
hp = mean(hp))

## mpg hp
## 1 20.09062 146.6875

# Grouped by number of cylinders


mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg),
hp = mean(hp))

## # A tibble: 3 x 3
## cyl mpg hp
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
contd.

mtcars %>%
group_by(cyl) %>%
summarize(mpg_mean = mean(mpg),
mpg_max = max(mpg),
hp_mean = mean(hp),
hp_max = max(hp))

## # A tibble: 3 x 5
## cyl mpg_mean mpg_max hp_mean hp_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 33.9 82.6 113
## 2 6 19.7 21.4 122. 175
## 3 8 15.1 19.2 209. 335
ungroup()

Use ungroup() to remove groupings structure from a data frame:


mtcars %>%
group_by(cyl) %>%
ungroup() %>%
summarize(hp = mean(hp),
mpg = mean(mpg))

## # A tibble: 1 x 2
## hp mpg
## <dbl> <dbl>
## 1 147. 20.1
Beyond summarize()

mtcars %>%
pull(hp) %>% tapply(INDEX = mtcars$cyl, FUN = summary)

## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.00 65.50 91.00 82.64 96.00 113.00
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 105.0 110.0 110.0 122.3 123.0 175.0
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 150.0 176.2 192.5 209.2 241.2 335.0
Contd.

mtcars %>%
group_by(cyl) %>%
nest() %>% # creates a column with the data conditional of subset
mutate(sum = purrr::map(data, function(df) summary(df$hp)),
sum_df = purrr::map(sum, broom::tidy)) %>% # unravel things to be data.frames
select(cyl, sum_df) %>%
unnest(cols = sum_df)

## # A tibble: 3 x 7
## # Groups: cyl [3]
## cyl minimum q1 median mean q3 maximum
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 105 110 110 122. 123 175
## 2 4 52 65.5 91 82.6 96 113
## 3 8 150 176. 192. 209. 241. 335
Join operations

I A “join” operation in database terminology is a merging of two


data frames for us. There are 4 types of joins:
I Inner join (or just join): retain just the rows each table that
match the condition
I Left outer join (or just left join): retain all rows in the first
table, and just the rows in the second table that match the
condition
I Right outer join (or just right join): retain just the rows in the
first table that match the condition, and all rows in the second
table
I Full outer join (or just full join): retain all rows in both tables
I Column values that cannot be filled in are assigned NA values
Join
Two toy data frames

has_kids_tab1 <- data.frame(name = c("Robert Downey, Jr", "Scarlett Johansson", "Chris Hemsworth"),
children = c(3, 1, 3),
stringsAsFactors = FALSE)
americans_tab2 <- data.frame(name = c("Chris Evans", "Robert Downey, Jr", "Scarlett Johansson"),
age = c(38, 54, 34),
stringsAsFactors = FALSE)
has_kids_tab1

## name children
## 1 Robert Downey, Jr 3
## 2 Scarlett Johansson 1
## 3 Chris Hemsworth 3

americans_tab2

## name age
## 1 Chris Evans 38
## 2 Robert Downey, Jr 54
## 3 Scarlett Johansson 34
inner_join()

Suppose we want to join tab1 and tab2 by name, but keep only
actors in intersection (aka in both tables):
inner_join(x = has_kids_tab1, y = americans_tab2, by = "name")

## name children age


## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
left_join()

Suppose we want to join tab1 and tab2 by name, but keep all
actors from tab1:
left_join(x = has_kids_tab1, y = americans_tab2, by = c("name" = "name"))

## name children age


## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Hemsworth 3 NA
right_join()

Suppose we want to join tab1 and tab2 by name, but keep all
actors from tab2:
right_join(x = has_kids_tab1, y = americans_tab2, by = "name")

## name children age


## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Evans NA 38
full_join()

Finally, suppose we want to join tab1 and tab2 by name, and keep
all actors from both:
full_join(x = has_kids_tab1, y = americans_tab2, by = "name")

## name children age


## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Hemsworth 3 NA
## 4 Chris Evans NA 38
More nuanced structure

my_peeps <- data.frame(pol = factor(sample(c("R", "D"),


10, replace =T)),
gender = factor(sample(c("F", "M"),
10, replace =T)),
state = factor(sample(c("AZ", "PA"), 10,
replace = T)),
IQ = round(rnorm(n = 10,mean = 100, sd = 10)))

politics <- data.frame(senator = c(


"Kyrsten Sinema", "Martha McSally",
"Pat Toomey","Boy Casey Jr."),
pol = c("D", "R", "R", "D"),
gender = c("F", "F", "M", "M"),
STATE = c("AZ", "AZ", "PA", "PA")
)
More nuanced structure

my_peeps %T>% print() %>% dim()

## pol gender state IQ


## 1 R F AZ 93
## 2 R F AZ 116
## 3 R F PA 90
## 4 R F AZ 109
## 5 R F PA 127
## 6 D M AZ 93
## 7 D F PA 103
## 8 D M PA 84
## 9 D F PA 114
## 10 R M PA 91
## [1] 10 4
contd.

politics %T>% print() %>% dim()

## senator pol gender STATE


## 1 Kyrsten Sinema D F AZ
## 2 Martha McSally R F AZ
## 3 Pat Toomey R M PA
## 4 Boy Casey Jr. D M PA
## [1] 4 4
contd.

I %T>% is a special pipe that passes my_peeps into print() as


a “side-effect” and then also passes my_peeps onto the rest of
the chain (which in this case is just dim())

my_peeps %>% left_join(politics,by = c("state" = "STATE",


"pol" = "pol")) %>% head(6)

## pol gender.x state IQ senator gender.y


## 1 R F AZ 93 Martha McSally F
## 2 R F AZ 116 Martha McSally F
## 3 R F PA 90 Pat Toomey M
## 4 R F AZ 109 Martha McSally F
## 5 R F PA 127 Pat Toomey M
## 6 D M AZ 93 Kyrsten Sinema F

You might also like