0% found this document useful (0 votes)

26 views76 pages

Tidy Verse

Tidyverse is a collection of R packages designed for data science, created by Hadley Wickham and his team, facilitating data manipulation, visualization, and interaction. Key packages include dplyr for data wrangling, ggplot2 for visualization, and purrr for list manipulation, all of which provide a consistent API and allow for fluid data processing using the pipe operator. The document also discusses the advantages of using Tidyverse over base R functions and provides examples of using the map family of functions for data iteration.

Uploaded by

Soumyadeep Majumdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views76 pages

Tidy Verse

Uploaded by

Soumyadeep Majumdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Tidyverse

Presidency University

December, 2024
Introduction

I Tidyverse is a collection of essential R packages for data

science.
I The packages under the tidyverse umbrella help us in
performing and interacting with the data.
I There are a whole host of things you can do with your data,
such as subsetting, transforming, visualizing, etc.
Introduction (Contd.)

I Tidyverse was created by the great Hadley Wickham and his

team with the aim of providing all these utilities to clean and
work with data.
I The list of packages include:
I Data wrangling: dplyr, tidyr, tibble, readr
I Visualization: ggplot2
I List manipulation: purrr
Introduction (Contd.)

If we load the tidyverse package, we can see what’s in there:

options(warn=-1)
library(tidyverse)

## – Attaching core tidyverse packages –––––––––––– tidyverse 2.0.0 –

## v dplyr 1.1.3 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.0
## v ggplot2 3.4.4 v tibble 3.2.1
## v lubridate 1.9.3 v tidyr 1.3.0
## v purrr 1.0.2
## – Conflicts ––––––––––––––––––––– tidyverse_conflicts() –
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to
become errors
Why shall we use tidyverse?

There are certain reasons for using tidyverse.

I Packages have a very consistent API
I Function names and commands follow a focused grammar
I Extremely powerful when working with data frames and lists
(matrices, not so much, yet!)
I Allows pipes (%>% operator) to fluidly glue functionality
together
I Very active developer and user community
I Main advantage: at its best, tidyverse data wrangling code can
be read like a story using the pipe operator!
Common iteration tasks
I Generally in R, we iterate over:
I elements of a list
I dimensions of an array (e.g., rows/columns of a matrix)
I sub data frames induced by one or more factors
I All of this is possible in base R, using the apply family of
functions: ‘lapply()‘, ‘sapply()‘, ‘apply()‘, ‘tapply()‘, etc. So
why look anywhere else?
I Answer: because some alternatives offer better consistency:
I With the apply family of functions, there are some
inconsistencies in both the interfaces to the functions, as well
as their outputs
I This can both slow down learning and also lead to inefficiencies
in practice (frequent checking and post-processing of results)
I However, the world isn’t black-and-white: base R still has its
advantages, and the best thing you can do is to be informed
and well-versed in using all the major options!
Why not “plyr”?

I The ‘plyr‘ package used to be one of the most popular (most

downloaded) R packages of all-time. It was more popular in
the late 2000s and early 2010s
I It is no longer under active development and that development
is now happening elsewhere (mainly in the tidyverse).
However, we may still use it.
What is “purrr”?

I “purrr” is a package that is part of the tidyverse.

I It offers a family of functions for iterating (mainly over lists)
that can be seen as alternatives to base R’s family of apply
functions
I Compared to base R, they are more consistent
I Compared to “plyr”, they can often be faster
The map family

I purrr‘ offers a family of map functions, which allow you to

apply a function across different chunks of data (primarily used
with lists).
I Offers an alternative base R’s apply functions.
I Summary of functions:
I map(): apply a function across elements of a list or vector
I map_dbl(), map_lgl(), map_chr(): same, but return a vector
of a particular data type
I map_dfr(), map_dfc(): same, but return a data frame
map(): list in, list out

I The map() function is an alternative to lapply().

I It has the following simple form: map(x, f), where ‘x‘ is a list
or vector, and ‘f‘ is a function. It always returns a list

my.list = list(nums=seq(0.1,0.6,by=0.1), chars=letters[1:12],

bools=sample(c(TRUE,FALSE), 6, replace=TRUE))
map(my.list, length)

## $nums
## [1] 6
##
## $chars
## [1] 12
##
## $bools
## [1] 6

I Base R is just as easy

lapply(my.list, length)
map_dbl(): list in, numeric out

I The map_dbl() function is an alternative to sapply().

I It has the form: map_dbl(x, f), where ‘x‘ is a list or vector,
and ‘f‘ is a function that returns a numeric value (when
applied to each element of ‘x‘)
I Similarly:
I map_int() returns an integer vector
I map_lgl() returns a logical vector
I map_chr() returns a character vector
Example
map_dbl(my.list, length)

## nums chars bools

## 6 12 6

map_chr(my.list, length)

## nums chars bools

## "6" "12" "6"

# Base R is a bit more complicated

as.numeric(sapply(my.list, length))

## [1] 6 12 6

as.numeric(unlist(lapply(my.list, length)))

## [1] 6 12 6

vapply(my.list, FUN=length, FUN.VALUE=numeric(1))

## nums chars bools

## 6 12 6
Applying a custom function

library(repurrrsive) # Load Game of Thrones data set

class(got_chars)

## [1] "list"

class(got_chars[[1]])

## [1] "list"
Contd.

names(got_chars[[1]])

## [1] "url" "id" "name" "gender" "culture"

## [6] "born" "died" "alive" "titles" "aliases"
## [11] "father" "mother" "spouse" "allegiances" "books"
## [16] "povBooks" "tvSeries" "playedBy"

map_chr(got_chars, function(x) { return(x$name) })

## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"

## [4] "Will" "Areo Hotah" "Chett"
## [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"
## [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart"
## [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr"
## [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark"
## [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister"
## [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy"
## [25] "Kevan Lannister" "Melisandre" "Merrett Frey"
## [28] "Quentyn Martell" "Samwell Tarly" "Sansa Stark"
Extracting elements

I Handily, the map functions all allow the second argument to

be an integer or string, and treat this internally as an
appropriate extractor function

map_chr(got_chars, "name")

## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"

map_lgl(got_chars, "alive")

## [1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [13] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## [25] FALSE TRUE FALSE FALSE TRUE TRUE
Contd.

I Interestingly, we can actually do the following in base R: ’[’()

and ’[[’() are functions that act in the following way for an
integer x and index i
I ’[’(x, i) is equivalent to x[i]
I ’[[’(x, i) is equivalent to x[[i]] (This works whether i is an
integer or a string)

sapply(got_chars, '[[', "name")

## [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"

sapply(got_chars, '[[', "alive")

## [1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [13] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
## [25] FALSE TRUE FALSE FALSE TRUE TRUE
map_dfr() and map_dfc(): list in, data frame out

I The map_dfr() and map_dfc() functions iterate a function

call over a list or vector, but automatically combine the results
into a data frame. They differ in whether that data frame is
formed by row-binding or column-binding

map_dfr(got_chars, `[`, c("name", "alive"))

## # A tibble: 30 x 2
## name alive
## <chr> <lgl>
## 1 Theon Greyjoy TRUE
## 2 Tyrion Lannister TRUE
## 3 Victarion Greyjoy TRUE
## 4 Will FALSE
## 5 Areo Hotah TRUE
## 6 Chett FALSE
## 7 Cressen FALSE
## 8 Arianne Martell TRUE
## 9 Daenerys Targaryen TRUE
## 10 Davos Seaworth TRUE
## # i 20 more rows
Contd.
# Base R is much less convenient
data.frame(name = sapply(got_chars, `[[`, "name"),
alive = sapply(got_chars, `[[`, "alive"))

## name alive
## 1 Theon Greyjoy TRUE
## 2 Tyrion Lannister TRUE
## 3 Victarion Greyjoy TRUE
## 4 Will FALSE
## 5 Areo Hotah TRUE
## 6 Chett FALSE
## 7 Cressen FALSE
## 8 Arianne Martell TRUE
## 9 Daenerys Targaryen TRUE
## 10 Davos Seaworth TRUE
## 11 Arya Stark TRUE
## 12 Arys Oakheart FALSE
## 13 Asha Greyjoy TRUE
## 14 Barristan Selmy TRUE
## 15 Varamyr FALSE
## 16 Brandon Stark TRUE
## 17 Brienne of Tarth TRUE
## 18 Catelyn Stark FALSE
## 19 Cersei Lannister TRUE
## 20 Eddard Stark FALSE
## 21 Jaime Lannister TRUE
## 22 Jon Connington TRUE
## 23 Jon Snow TRUE
## 24 Aeron Greyjoy TRUE
## 25 Kevan Lannister FALSE
## 26 Melisandre TRUE
## 27 Merrett Frey FALSE
## 28 Quentyn Martell FALSE
## 29 Samwell Tarly TRUE
Data wrangling the tidy way

I dplyr and tidyr are going to be our main workhorses for data
wrangling
I The main structure these packages use is the data frame (or
tibble, but we won’t go there)
I Two keys to getting started:
I learn about pipes %>%
I learn the dplyr verbs
I dplyr functions are analogous to SQL counterparts, so learn
dplyr and get SQL for free.
piping operator

I Fundamentally piping take one return value and automatically

feed it in as an input to another function, to form a flow of
results. It looks like this: %>%.
I This operator actually comes from the magrittr package
(automatically included in tidyverse)
I So it can be used on its own, completely independently of the
tidyverse.
I However tidyverse functions are at their best when composed
together using the pipe operator.
Single argument

While passing a single argument through pipes, we interpret

something like:

x %>% f %>% g %>% h

as h(g (f (x))). This means whenever we find %>% we shall read

this as “and then”.
Example

I We can write log (1) with pipes as 1 %>% log(), and

exp(sin(π)) as pi %>% sin() %>% exp()

exp(1)

## [1] 2.718282

1 %>% exp()

## [1] 2.718282

1 %>% exp() %>% log()

## [1] 1
Multiple argument

When we have multi-arguments functions, we interpret something

like:

x %>% f(y)

as f (x, y ).
Example

Recall what we use to write as

head(mtcars,4)

can be alternatively written as

mtcars %>% head(4)

## mpg cyl disp hp drat wt qsec vs am g

## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
Alternative

Alternatively we can write the command x %>% f(y) using dot

notation as:

x %>% f(., y)
Alternative (Contd.)

I What’s the advantage of using dots?

I Sometimes you want to pass in a variable as the second or
third (say, not first) argument to a function, with a pipe.
I For example x %>% f(y, .) is equivalent to f (y , x).
Example

state_df <- data.frame(state.x77)

state.region %>%
tolower %>%
tapply(state_df$Income, ., summary)

## $`north central`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4167 4466 4594 4611 4694 5107
##
## $northeast
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3694 4281 4558 4570 4903 5348
##
## $south
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3098 3622 3848 4012 4316 5299
##
## $west
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3601 4347 4660 4703 4963 6315
Example I

x <- seq(-2*pi,2*pi,len=1000)
x %>% sin %>% abs %>% plot(x, ., type="l")
1.0
0.8
0.6
.

0.4
0.2
0.0

−6 −4 −2 0 2 4 6

x
dplyr verbs

I We shall start learning dplyr with the following verbs

(functions):
I slice(): subset rows based on integer indexing
I filter(): subset rows based on logical criteria
I select(): select certain columns
I pull(): pull out a individual column
I arrange(): order rows by value of a column
I rename(): rename columns
I mutate(): create new columns
I mutate_at(): apply a function to given columns
I The idea is we can think of data frames as nouns and dplyr
verbs as actions that we can apply to manipulate
them—especially natural when using pipes.
slice

I We shall use slice() when we want to indicate certain row

numbers need to be kept:

mtcars %>% slice(c(7,8,14:15))

## mpg cyl disp hp drat wt qsec vs am gear carb

## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
## Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4

I This is same as doing in the old way with base R as:

mtcars[c(7,8,14:15),]

## mpg cyl disp hp drat wt qsec vs am gear carb

I We can also do negative slicing:

mtcars %>% slice(-c(1:2,19:23)) %>% nrow()

## [1] 25

I If we do it using the base R we shall write

nrow(mtcars[-c(1:2,19:23),])

## [1] 25
filter

I We shall use filter() when we want to subset rows based on

logical conditions:

mtcars %>% filter((mpg >= 14 & disp >= 200)|(drat <= 3)) %>% head(2)

## mpg cyl disp hp drat wt qsec vs am gear carb

## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

# note rownames are silenced

I If we do it in base R we shall write

head(subset(mtcars, (mpg >= 14 & disp >= 200) | (drat <= 3))
head(mtcars[(mtcars$mpg >= 14 & mtcars$disp >= 200) | (mtcar
select

Use select() when you want to pick out certain columns:

mtcars %>% select(cyl, disp, hp) %>% head(2)

## cyl disp hp
## Mazda RX4 6 160 110
## Mazda RX4 Wag 6 160 110

# Base R:
head(mtcars[, c("cyl", "disp", "hp")], 2)
Handy select() helpers

mtcars %>% select(starts_with("d")) %>% head(2)

## disp drat
## Mazda RX4 160 3.9
## Mazda RX4 Wag 160 3.9

# Base R (yikes!):
d_colnames <- grep(x = colnames(mtcars), pattern = "^d")
head(mtcars[, d_colnames], 2)
mtcars %>% select(ends_with('t')) %>% head(2)

## drat wt
## Mazda RX4 3.9 2.620
## Mazda RX4 Wag 3.9 2.875

mtcars %>% select(ends_with('yl')) %>% head(2)

## cyl
## Mazda RX4 6
## Mazda RX4 Wag 6

mtcars %>% select(contains('ar')) %>% head(2)

## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
Additional, less important function: pull()
I You can grab a single column from a data frame and get it
back as a vector if you use pull
I select preserves column structure even with a single column
mtcars %>% pull(mpg)

## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

# Same as: mtcars$mpg

mtcars %>% select(mpg)

## mpg
## Mazda RX4 21.0
## Mazda RX4 Wag 21.0
## Datsun 710 22.8
## Hornet 4 Drive 21.4
## Hornet Sportabout 18.7
## Valiant 18.1
## Duster 360 14.3
## Merc 240D 24.4
## Merc 230 22.8
## Merc 280 19.2
## Merc 280C 17.8
## Merc 450SE 16.4
## Merc 450SL 17.3
## Merc 450SLC 15.2
## Cadillac Fleetwood 10.4
arrange

I Use arrange() to order rows by values of a column:

mtcars %>%
arrange(desc(disp)) %>%
select(mpg, disp, drat) %>%
head(2)

## mpg disp drat

## Cadillac Fleetwood 10.4 472 2.93
## Lincoln Continental 10.4 460 3.00

# Base R:
drat_inds <- order(mtcars$disp, decreasing = TRUE)
head(mtcars[drat_inds, c("mpg", "disp", "drat")], 2)
Contd.

I We can order by multiple columns too:

mtcars %>%
arrange(desc(gear), desc(hp)) %>%
select(gear, hp, everything()) %>%
head(8)

## gear hp mpg cyl disp drat wt qsec vs am carb

## Maserati Bora 5 335 15.0 8 301.0 3.54 3.570 14.60 0 1 8
## Ford Pantera L 5 264 15.8 8 351.0 4.22 3.170 14.50 0 1 4
## Ferrari Dino 5 175 19.7 6 145.0 3.62 2.770 15.50 0 1 6
## Lotus Europa 5 113 30.4 4 95.1 3.77 1.513 16.90 1 1 2
## Porsche 914-2 5 91 26.0 4 120.3 4.43 2.140 16.70 0 1 2
## Merc 280 4 123 19.2 6 167.6 3.92 3.440 18.30 1 0 4
## Merc 280C 4 123 17.8 6 167.6 3.92 3.440 18.90 1 0 4
## Mazda RX4 4 110 21.0 6 160.0 3.90 2.620 16.46 0 1 4
mutate()

I Use mutate() when you want to create one or several columns:

mtcars <- mtcars %>%

mutate(hp_wt = hp/wt,
mpg_wt = mpg/wt)

# Base R:
mtcars$hp_wt <- mtcars$hp/mtcars$wt
mtcars$mpg_wt <- mtcars$mpg/mtcars$wt
I create a new data.frame with updated/added columns

mtcars <- mtcars %>%

mutate(hp_wt = 1) # update hp_wt to just the one value
mtcars %>% head(2)

## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt

## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1 8.015267
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1 7.304348

# Base R:
mtcars$hp_wt <- 1
I newly created variables are useable immediately

mtcars <- mtcars %>%

mutate(hp_wt_correct = hp/wt,
hp_wt_cyl = hp_wt_correct/cyl)
mtcars %>% head(2)

## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt

## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1 8.015267
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1 7.304348
## hp_wt_correct hp_wt_cyl
## Mazda RX4 41.98473 6.997455
## Mazda RX4 Wag 38.26087 6.376812

# Base R:
mtcars$hp_wt_correct <- mtcars$hp/mtcars$wt
mtcars$hp_wt_cyl <- mtcars$hp_wt_correct/mtcars$cyl
mutate_at()

I Use mutate_at() when you want to apply a function to one or

several columns:

# correction
mtcars <- mtcars %>% mutate(hp_wt = hp_wt_correct)

mtcars <- mtcars %>%

mutate_at(c("hp_wt", "mpg_wt"), log)

# Base R:
mtcars$hp_wt <- log(mtcars$hp_wt)
mtcars$mpg_wt <- log(mtcars$mpg_wt)
rename()

I Use rename() to easily rename columns:

mtcars %>%
rename(hp_wt_log = hp_wt, mpg_wt_log = mpg_wt) %>%
head(2)

## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt_log

## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 1.318365
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 1.293199
## mpg_wt_log hp_wt_correct hp_wt_cyl
## Mazda RX4 0.7330158 41.98473 6.997455
## Mazda RX4 Wag 0.6873654 38.26087 6.376812

# Base R:
colnames(mtcars)[colnames(mtcars) == "hp_wt"] <- "hp_wt_log"
colnames(mtcars)[colnames(mtcars) == "mpg_wt"] <- "mpg_wt_lo
head(mtcars, 2)
Important note

I Calling dplyr verbs always outputs a new data frame, it does

not alter the existing data frame
I So to keep the changes, we have to reassign the data frame to
be the output of the pipe!
tidyr verbs

I Our tidyr journey starts of with learning the following verbs

(functions):
I pivot_longer(): make “wide” data longer
I pivot_wider(): make “long” data wider
I separate(): split a single column into multiple columns
I unite(): combine multiple columns into a single column
I Key takeaway: as with dplyr, think of data frames as nouns
and tidyr verbs as actions that you apply to manipulate
them—especially natural when using pipes
pivot_longer()
# devtools::install_github("rstudio/EDAWR")
library(EDAWR) # Load some nice data sets

##
## Attaching package: ’EDAWR’
## The following object is masked from ’package:dplyr’:
##
## storms
## The following objects are masked from ’package:tidyr’:
##
## population, who

EDAWR::cases %>%
head(3)

## country 2011 2012 2013

## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000

EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", cols = 2:4) %>%
head(5)

## # A tibble: 5 x 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", cols = 2:4) %>%
head(5)

## # A tibble: 5 x 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
I Here we transposed columns 2:4 into a year column
I We put the corresponding count values into a column called n
I Note tidyr did all the heavy lifting of the transposing work
I We just had to declaratively specify the output
Different approach

# Different approach to do the same thing

EDAWR::cases %>%
pivot_longer(names_to = "year", values_to = "n", -country) %>%
head(5)

## # A tibble: 5 x 3
## country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
#Could also do:
# EDAWR::cases %>%
# pivot_longer(names_to = "year", values_to = "n", c(`2011`, `2012`, `2013`))
# or:
# EDAWR::cases %>%
# pivot_longer(names_to = "year", values_to = "n", `2011`:`2013`)
pivot_wider()

EDAWR::pollution %>%
head(5)

## city size amount

## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
EDAWR::pollution %>%
pivot_wider(names_from = "size",
values_from = "amount")

## # A tibble: 3 x 3
## city large small
## <chr> <dbl> <dbl>
## 1 New York 23 14
## 2 London 22 16
## 3 Beijing 121 56
I Here we transposed to a wide format by size
I We tabulated the corresponding amount for each size
I Note tidyr did all the heavy lifting again
I We just had to declaratively specify the output
I Note that pivot_wider() and pivot_longer() are inverses
When could I used these?

I To visualize things like matrices in ggplot (pivot_longer)

I To make prettier / more readable “tables” (pivot_wider)
I Additionally, if you find yourself getting stuck (with nuance
situations) they are more complicated functions like
pivot_wider_spec, etc. for these cases (see Manual Specs)
separate()

Use separate() to split a single column into multiple ones:

EDAWR::storms %>%
head(3)

## storm wind pressure date

## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03

storms2 <- EDAWR::storms %>%

separate(date, c("y", "m", "d")) # sep = "-"
unite()

Use unite() to combine multiple columns into a single column:

storms2 %>%
unite(date, y, m, d, sep = "-")

## # A tibble: 6 x 4
## storm wind pressure date
## <chr> <int> <int> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
## 4 Ana 40 1013 1997-06-30
## 5 Arlene 50 1010 1999-06-11
## 6 Arthur 45 1010 1996-06-17
group_by()

Use group_by() to define a grouping of rows based on a column:

mtcars %>%
group_by(cyl) %>%
head(4)

## # A tibble: 4 x 15
## # Groups: cyl [2]
## mpg cyl disp hp drat wt qsec vs am gear carb hp_wt mpg_wt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1.32 0.733
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1.29 0.687
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 1.31 0.826
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1.26 0.640
## # i 2 more variables: hp_wt_correct <dbl>, hp_wt_cyl <dbl>

mtcars %>%
group_by(cyl) %>%
head(4) %>% class

## [1] "grouped_df" "tbl_df" "tbl" "data.frame"

Note

I This doesn’t actually change anything about the way the df

looks
I Only difference is that when it prints, we’re told about the
groups
I But it will play a big role in how other dplyr verbs act
summarize()

I Use summarise() (or summarize() ) to apply functions to

rows—ungrouped or grouped—of a data frame:

# Ungrouped
mtcars %>%
summarize(mpg = mean(mpg),
hp = mean(hp))

## mpg hp
## 1 20.09062 146.6875

# Grouped by number of cylinders

mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg),
hp = mean(hp))

## # A tibble: 3 x 3
## cyl mpg hp
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
contd.

mtcars %>%
group_by(cyl) %>%
summarize(mpg_mean = mean(mpg),
mpg_max = max(mpg),
hp_mean = mean(hp),
hp_max = max(hp))

## # A tibble: 3 x 5
## cyl mpg_mean mpg_max hp_mean hp_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 26.7 33.9 82.6 113
## 2 6 19.7 21.4 122. 175
## 3 8 15.1 19.2 209. 335
ungroup()

Use ungroup() to remove groupings structure from a data frame:

mtcars %>%
group_by(cyl) %>%
ungroup() %>%
summarize(hp = mean(hp),
mpg = mean(mpg))

## # A tibble: 1 x 2
## hp mpg
## <dbl> <dbl>
## 1 147. 20.1
Beyond summarize()

mtcars %>%
pull(hp) %>% tapply(INDEX = mtcars$cyl, FUN = summary)

## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.00 65.50 91.00 82.64 96.00 113.00
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 105.0 110.0 110.0 122.3 123.0 175.0
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 150.0 176.2 192.5 209.2 241.2 335.0
Contd.

mtcars %>%
group_by(cyl) %>%
nest() %>% # creates a column with the data conditional of subset
mutate(sum = purrr::map(data, function(df) summary(df$hp)),
sum_df = purrr::map(sum, broom::tidy)) %>% # unravel things to be data.frames
select(cyl, sum_df) %>%
unnest(cols = sum_df)

## # A tibble: 3 x 7
## # Groups: cyl [3]
## cyl minimum q1 median mean q3 maximum
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 105 110 110 122. 123 175
## 2 4 52 65.5 91 82.6 96 113
## 3 8 150 176. 192. 209. 241. 335
Join operations

I A “join” operation in database terminology is a merging of two

data frames for us. There are 4 types of joins:
I Inner join (or just join): retain just the rows each table that
match the condition
I Left outer join (or just left join): retain all rows in the first
table, and just the rows in the second table that match the
condition
I Right outer join (or just right join): retain just the rows in the
first table that match the condition, and all rows in the second
table
I Full outer join (or just full join): retain all rows in both tables
I Column values that cannot be filled in are assigned NA values
Join
Two toy data frames

has_kids_tab1 <- data.frame(name = c("Robert Downey, Jr", "Scarlett Johansson", "Chris Hemsworth"),
children = c(3, 1, 3),
stringsAsFactors = FALSE)
americans_tab2 <- data.frame(name = c("Chris Evans", "Robert Downey, Jr", "Scarlett Johansson"),
age = c(38, 54, 34),
stringsAsFactors = FALSE)
has_kids_tab1

## name children
## 1 Robert Downey, Jr 3
## 2 Scarlett Johansson 1
## 3 Chris Hemsworth 3

americans_tab2

## name age
## 1 Chris Evans 38
## 2 Robert Downey, Jr 54
## 3 Scarlett Johansson 34
inner_join()

Suppose we want to join tab1 and tab2 by name, but keep only
actors in intersection (aka in both tables):
inner_join(x = has_kids_tab1, y = americans_tab2, by = "name")

## name children age

## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
left_join()

Suppose we want to join tab1 and tab2 by name, but keep all
actors from tab1:
left_join(x = has_kids_tab1, y = americans_tab2, by = c("name" = "name"))

## name children age

## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Hemsworth 3 NA
right_join()

Suppose we want to join tab1 and tab2 by name, but keep all
actors from tab2:
right_join(x = has_kids_tab1, y = americans_tab2, by = "name")

## name children age

## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Evans NA 38
full_join()

Finally, suppose we want to join tab1 and tab2 by name, and keep
all actors from both:
full_join(x = has_kids_tab1, y = americans_tab2, by = "name")

## name children age

## 1 Robert Downey, Jr 3 54
## 2 Scarlett Johansson 1 34
## 3 Chris Hemsworth 3 NA
## 4 Chris Evans NA 38
More nuanced structure

my_peeps <- data.frame(pol = factor(sample(c("R", "D"),

10, replace =T)),
gender = factor(sample(c("F", "M"),
10, replace =T)),
state = factor(sample(c("AZ", "PA"), 10,
replace = T)),
IQ = round(rnorm(n = 10,mean = 100, sd = 10)))

politics <- data.frame(senator = c(

"Kyrsten Sinema", "Martha McSally",
"Pat Toomey","Boy Casey Jr."),
pol = c("D", "R", "R", "D"),
gender = c("F", "F", "M", "M"),
STATE = c("AZ", "AZ", "PA", "PA")
)
More nuanced structure

my_peeps %T>% print() %>% dim()

## pol gender state IQ

## 1 R F AZ 93
## 2 R F AZ 116
## 3 R F PA 90
## 4 R F AZ 109
## 5 R F PA 127
## 6 D M AZ 93
## 7 D F PA 103
## 8 D M PA 84
## 9 D F PA 114
## 10 R M PA 91
## [1] 10 4
contd.

politics %T>% print() %>% dim()

## senator pol gender STATE

## 1 Kyrsten Sinema D F AZ
## 2 Martha McSally R F AZ
## 3 Pat Toomey R M PA
## 4 Boy Casey Jr. D M PA
## [1] 4 4
contd.

I %T>% is a special pipe that passes my_peeps into print() as

a “side-effect” and then also passes my_peeps onto the rest of
the chain (which in this case is just dim())

my_peeps %>% left_join(politics,by = c("state" = "STATE",

"pol" = "pol")) %>% head(6)

## pol gender.x state IQ senator gender.y

## 1 R F AZ 93 Martha McSally F
## 2 R F AZ 116 Martha McSally F
## 3 R F PA 90 Pat Toomey M
## 4 R F AZ 109 Martha McSally F
## 5 R F PA 127 Pat Toomey M
## 6 D M AZ 93 Kyrsten Sinema F

R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Advanced R Programming Tidyverse Packages Notes
No ratings yet
Advanced R Programming Tidyverse Packages Notes
12 pages
R
No ratings yet
R
15 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
7 pages
R Reference Guide for Programmers
No ratings yet
R Reference Guide for Programmers
6 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
Tidyverse Packages for R Data Science
No ratings yet
Tidyverse Packages for R Data Science
12 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
Base R
No ratings yet
Base R
2 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
Base R
No ratings yet
Base R
9 pages
Exploring the ToothGrowth Dataset in R
No ratings yet
Exploring the ToothGrowth Dataset in R
44 pages
Data Manipulation with dplyr
100% (1)
Data Manipulation with dplyr
39 pages
Week3 2020
No ratings yet
Week3 2020
20 pages
Introduction to R for Statistics
No ratings yet
Introduction to R for Statistics
56 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
Apply Functions With Purrr::: Cheat Sheet
No ratings yet
Apply Functions With Purrr::: Cheat Sheet
2 pages
R Programming Cheat Sheet: Data Structures
No ratings yet
R Programming Cheat Sheet: Data Structures
2 pages
R Programming Cheatsheet
100% (2)
R Programming Cheatsheet
6 pages
R Commands & Resources Guide
No ratings yet
R Commands & Resources Guide
274 pages
R Cheat Sheet PDF
100% (1)
R Cheat Sheet PDF
38 pages
R Program Cheat Sheet 1
No ratings yet
R Program Cheat Sheet 1
2 pages
R Cheat Sheet 3 PDF
No ratings yet
R Cheat Sheet 3 PDF
2 pages
R Programming Quick Reference
No ratings yet
R Programming Quick Reference
5 pages
R Cheatsheet Base R
No ratings yet
R Cheatsheet Base R
2 pages
Basics of R: Installation & Data Types
No ratings yet
Basics of R: Installation & Data Types
43 pages
楊睿中統計學合併版
No ratings yet
楊睿中統計學合併版
557 pages
Presentation of R
No ratings yet
Presentation of R
109 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
UL2
No ratings yet
UL2
2 pages
Exercise Dataframe
No ratings yet
Exercise Dataframe
6 pages
R BasicCommands
No ratings yet
R BasicCommands
5 pages
Basic Statistics with R Guide
No ratings yet
Basic Statistics with R Guide
241 pages
R/Rpad Reference Card: Slicing and Extracting Data
No ratings yet
R/Rpad Reference Card: Slicing and Extracting Data
5 pages
R Master Sheet - All Codes, Inbuilt Functions and Packages Needed For The Course
No ratings yet
R Master Sheet - All Codes, Inbuilt Functions and Packages Needed For The Course
2 pages
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
No ratings yet
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
17 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
R Reference Card Overview
100% (1)
R Reference Card Overview
4 pages
Introduction
No ratings yet
Introduction
47 pages
Prob Intro4
No ratings yet
Prob Intro4
277 pages
Linear Model 1
No ratings yet
Linear Model 1
71 pages
Linear Model Recap 2
No ratings yet
Linear Model Recap 2
313 pages
Assignment 6 New
No ratings yet
Assignment 6 New
3 pages
Assignment 1 New
No ratings yet
Assignment 1 New
6 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
Basic Testing
No ratings yet
Basic Testing
116 pages
Prob Intro2
No ratings yet
Prob Intro2
224 pages
Linear Review 1
No ratings yet
Linear Review 1
235 pages
Stative Verbs Chart
No ratings yet
Stative Verbs Chart
2 pages
No Name 1
No ratings yet
No Name 1
1 page
Structuralism and Poststructuralism
No ratings yet
Structuralism and Poststructuralism
11 pages
Week 02 - Pre-Task - Quiz - Vocabulary Week 02
No ratings yet
Week 02 - Pre-Task - Quiz - Vocabulary Week 02
4 pages
6 - 2018 CLM Word Version
No ratings yet
6 - 2018 CLM Word Version
10 pages
Chapter 4 Genres of Emergent Literature 11
No ratings yet
Chapter 4 Genres of Emergent Literature 11
33 pages
Literature Review On Cooperative Banks in India
100% (6)
Literature Review On Cooperative Banks in India
19 pages
Understanding Eliot's Tradition and Talent
No ratings yet
Understanding Eliot's Tradition and Talent
9 pages
Breakfast at Tiffany's Worksheets
No ratings yet
Breakfast at Tiffany's Worksheets
3 pages
Tahajjud Prayer Goal Tracker
No ratings yet
Tahajjud Prayer Goal Tracker
29 pages
Year 9 English Intertextuality Meme Assignment
No ratings yet
Year 9 English Intertextuality Meme Assignment
2 pages
IQ Logical Reasoning and Personality Test EXAM
No ratings yet
IQ Logical Reasoning and Personality Test EXAM
7 pages
Replekson RNW
No ratings yet
Replekson RNW
3 pages
Arab Study Plan
No ratings yet
Arab Study Plan
6 pages
Creating Tension Powerpoint
No ratings yet
Creating Tension Powerpoint
16 pages
Game of Thrones Episode Summaries
No ratings yet
Game of Thrones Episode Summaries
6 pages
Values Worksheet 1 Values Worksheet 1 Values Worksheet 1 Values Worksheet 1
No ratings yet
Values Worksheet 1 Values Worksheet 1 Values Worksheet 1 Values Worksheet 1
9 pages
Test 10th English Nelson Mandela
No ratings yet
Test 10th English Nelson Mandela
1 page
Dr Jekyll & Mr Hyde Activity Guide
No ratings yet
Dr Jekyll & Mr Hyde Activity Guide
9 pages
The Boy Who Painted Christ Black
100% (2)
The Boy Who Painted Christ Black
5 pages
Analyzing Idioms in Adele's Lyrics
No ratings yet
Analyzing Idioms in Adele's Lyrics
23 pages
Psalm 119 97-104
No ratings yet
Psalm 119 97-104
3 pages
Untouchable Spring Is A Saga of Endless Struggle of Untouchables For Self-Respect, Equality and Empowerment. Discuss
100% (2)
Untouchable Spring Is A Saga of Endless Struggle of Untouchables For Self-Respect, Equality and Empowerment. Discuss
2 pages
3 Year Degree & 4 Year Honours 1st Semester Examinations 2023 (CCFUP)
No ratings yet
3 Year Degree & 4 Year Honours 1st Semester Examinations 2023 (CCFUP)
2 pages
Book Review 1 Fiction
No ratings yet
Book Review 1 Fiction
7 pages
Grade 7 English Lesson Plan: Philippine Literature & Language Skills
No ratings yet
Grade 7 English Lesson Plan: Philippine Literature & Language Skills
27 pages
Anatomy and Physiology 5th Edition Marieb Test Bank 1
100% (96)
Anatomy and Physiology 5th Edition Marieb Test Bank 1
36 pages
Building A Resilient Tomorrow: How To Prepare For The Coming Climate Disruption Alice C. Hill
100% (3)
Building A Resilient Tomorrow: How To Prepare For The Coming Climate Disruption Alice C. Hill
58 pages
Discover the Destroyer Overview
0% (1)
Discover the Destroyer Overview
2 pages
Contemporary Love Studies in The Arts and Humanities Whats Love Got To Do With It Madalena Grobbelaar PDF Download
100% (1)
Contemporary Love Studies in The Arts and Humanities Whats Love Got To Do With It Madalena Grobbelaar PDF Download
41 pages
When You Are Old by W. B. Yeats
No ratings yet
When You Are Old by W. B. Yeats
3 pages
"On Translating Homer by Arnold"
No ratings yet
"On Translating Homer by Arnold"
215 pages