Https Tutorials Iq Harvard Edu R RDataManagement RDataManagement HTML
Https Tutorials Iq Harvard Edu R RDataManagement RDataManagement HTML
R Data Management
Prerequisites and Preparation
This is an intermediate/advanced R course appropriate for those with basic knowledge of R. If you need a refresher we
recommend the the IQSS R intro.
The UK Office for National Statistics provides yearly data on the most popular baby names going back to 1996. The data is
provided separately for boys and girls and is stored in Excel spreadsheets.
I have downloaded all the excel files containing boys names data from
[Link]
and made them available at [Link]
Our mission is to extract and graph the top 100 boys names in England and Wales for every year since 1996. There are several
things that make this challenging.
Exercise 0
Convertido de web en PDF a [Link] con el api html a pdf
1. Download and extract the data from [Link]
## You can download the file using a web browser, and extract using your file browser.
## For bonus points you can try doing it using R, but this is not required.
2. Locate the file named 1996boys_tcm77-[Link] and open it in a spreadsheet. (If you don’t have a spreadsheet
program installed on your computer you can downloads one from [Link] What
issues can you identify that might make working with these data more difficult?
3. Locate the file named [Link] and open it in a spreadsheet. In what ways is the format different than
the format of 1996boys_tcm77-[Link] ? How might these differences make it more difficult to work with these
data?
Exercise 0 solution
1. Download and extract the data from
[Link]
The data does not start on row one. Headers are on row 7, followed by a blank line, followed by the actual data.
The data is stored in an inconvenient way, with ranks 1-50 in the first set of columns and ranks 51-100 in a separate set of
columns.
The worksheet containing the data of interest is in different positions and has different names from one year to the next. However,
it always includes “Table 1” in the worksheet name.
These differences will make it more difficult to automate re-arranging the data since we have to write code that can handle different
input formats.
library(tidyverse)
Now that we’ve told R the names of the data files we can start working with them. For example, the first file is
[Link][[1]]
## [1] "data/boys/1996boys_tcm77-[Link]"
and we can use the excel_sheets function from the readxl package to list the worksheet names from this file.
library(readxl)
excel_sheets([Link][[1]])
excel_sheets([Link][[1]])
excel_sheets([Link][[2]])
## ...
excel_sheets([Link][[20]])
This is not a terrible idea for a small number of files, but it is more convenient to let R do the iteration for us. We could use a for
loop, or sapply , but the map family of functions from the purrr package gives us a more consistent alternative, so we’ll use that.
library(purrr)
map([Link], excel_sheets)
## [[1]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"
##
## [[2]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[3]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[4]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
Convertido de web en PDF a [Link] con el api html a pdf
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[5]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[6]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[7]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[8]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[9]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[10]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[11]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[12]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
Convertido de web en PDF a [Link] con el api html a pdf
## [7] "Table 6 - Boys names - E&W"
##
## [[13]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[14]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[15]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[16]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[17]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[18]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[19]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[20]]
## [1] "Contents" "Metadata" "Terms and Conditions"
Convertido de web en PDF a [Link] con el api html a pdf
## [1] "Contents" "Metadata" "Terms and Conditions"
## [4] "Table 1" "Table 2" "Table 3"
## [7] "Table 4" "Table 5" "Table 6"
## [10] "Related Publications"
The stringr package provides functions to detect, locate, extract, match, replace, combine and split strings (among other things).
Here we want to detect the pattern “Table 1”, and only return elements with this pattern. We can do that using the str_subset
function. The first argument to str_subset is character vector we want to search in. The second argument is a regular
expression matching the pattern we want to retain.
If you are not familiar with regular expressions, [Link] is a good place to start.
Now that we know how to filter character vectors using str_subset we can identify the correct sheet in a particular Excel file.
For example,
library(stringr)
str_subset(excel_sheets([Link][[1]]), "Table 1")
Now we can map this new function over our vector of file names.
map([Link],
[Link],
pattern = "Table 1")
## [[1]]
## [1] "Table
Convertido de web1en
- PDF
Top 100 boys, E&W"
a [Link] con el api html a pdf
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[2]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[3]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[4]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[5]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[6]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[7]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[8]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[9]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[10]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[11]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[12]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[13]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[14]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[15]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[16]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[17]]
## [1] "Table 1 - Top 100 boys, E&W"
##
Convertido de web en PDF a [Link] con el api html a pdf
##
## [[18]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[19]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[20]]
## [1] "Table 1"
We’ll start by reading the data from the first file, just to check that it works. Recall that the actual data starts on row 7, so we want
to skip the first 6 rows.
library(dplyr, quietly=TRUE)
glimpse(tmp)
## Observations: 59
## Variables: 7
## $ X__1 <chr> NA, "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"...
## $ Name <chr> NA, "JACK", "DANIEL", "THOMAS", "JAMES", "JOSHUA", "M...
## $ Count <dbl> NA, 10779, 10338, 9603, 9385, 7887, 7426, 6496, 6193,...
## $ X__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ X__3 <dbl> NA, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...
## $ Name__1 <chr> NA, "DOMINIC", "NICHOLAS", "BRANDON", "RHYS", "MARK",...
## $ Count__1 <dbl> NA, 1519, 1385, 1337, 1259, 1222, 1192, 1186, 1135, 1...
Exercise 1
1. Write a function that takes a file name as an argument and reads the worksheet containing “Table 1” from that file. Don’t
forget to skip the first 6 rows.
2. Test your function by using it to read one of the boys names Excel files.
3. Use the map function to read data from all the Excel files, using the function you wrote in step 1.
## 3. Use the `map` function to read data from all the Excel files,
## using the function you wrote in step 1.
boysNames <- map([Link], [Link])
Data cleanup
Now that we’ve read in the data we still have some cleanup to do. Specifically, we need to:
to this:
There are many ways to do this kind of data manipulation in R. We’re going to use the dplyr and tidyr packages to make our lives
easier. (Both packages were installed as dependencies of the tidyverse package.)
names(boysNames[[1]])
So we need to a) make sure each column has a name, and b) distinguish between the first and second occurrences of “Name”
and “Count”. We could do this step-by-step, but there is a handy function in R called [Link] that will do it for us.
names(boysNames[[1]])
Selecting columns
Next we want to retain just the Name , Name__1 and Count , Count__1 columns. We can do that using the select function:
boysNames[[1]]
## # A tibble: 59 x 7
## X__1 Name Count X__2 X__3 Name__1 Count__1
## <chr> <chr> <dbl> <lgl> <dbl> <chr> <dbl>
## 1 <NA> <NA> NA NA NA <NA> NA
## 2 1 JACK 10779 NA 51 DOMINIC 1519
## 3 2 DANIEL 10338 NA 52 NICHOLAS 1385
## 4 3 THOMAS 9603 NA 53 BRANDON 1337
## 5 4 JAMES 9385 NA 54 RHYS 1259
## 6 5 JOSHUA 7887 NA 55 MARK 1222
## 7 6 MATTHEW 7426 NA 56 MAX 1192
## 8 7 RYAN 6496 NA 57 DYLAN 1186
## 9 8 JOSEPH 6193 NA 58 HENRY 1135
## 10 9 SAMUEL 6161 NA 59 PETER 1128
## # ... with 49 more rows
boysNames[[1]]
## # A tibble: 59 x 4
## Name Name__1 Count Count__1
## <chr> <chr> <dbl> <dbl>
## 1 <NA> <NA> NA NA
## 2 JACK DOMINIC 10779 1519
## 3 DANIEL NICHOLAS 10338 1385
## 4 THOMAS BRANDON 9603 1337
## 5 JAMES RHYS 9385 1259
## 6 JOSHUA MARK 7887 1222
## 7 MATTHEW MAX 7426 1192
## 8 RYAN DYLAN 6496 1186
## 9 JOSEPH HENRY 6193 1135
## 10 SAMUEL PETER 6161 1128
## # ... with 49 more rows
Finally, we will want to filter out missing do this for all the elements in boysNames , a task I leave to you.
Exercise 2
1. Write a function that takes a [Link] as an argument and returns a modified version including only columns named
“Name”, “Name__1”, “Count”, or “Count__1”.
2. Test your function by using it to read one of the boys names Excel files.
3. Use the map function to read data from all the Excel files, using the function you wrote in step 1.
Exercise 2 solution
## 3. Use the `map` function to read data from all the Excel files,
## using the function you wrote in step 1.
[Link] <- map([Link], [Link])
boysNames[[1]]
## # A tibble: 50 x 4
## Name Name__1 Count Count__1
## <chr> <chr> <dbl> <dbl>
## 1 JACK DOMINIC 10779 1519
## 2 DANIEL NICHOLAS 10338 1385
## 3 THOMAS BRANDON 9603 1337
## 4 JAMES RHYS 9385 1259
## 5 JOSHUA MARK 7887 1222
## 6 MATTHEW MAX 7426 1192
## 7 RYAN DYLAN 6496 1186
## 8 JOSEPH HENRY 6193 1135
## 9 SAMUEL PETER 6161 1128
## 10 LIAM STEPHEN 5802 1122
## # ... with 40 more rows
## # A tibble: 100 x 2
## Name Count
## <chr> <dbl>
## 1 JACK 10779
## 2 DANIEL 10338
## 3 THOMAS 9603
## 4 JAMES 9385
## 5 JOSHUA 7887
## 6 MATTHEW 7426
## 7 RYAN 6496
## 8 JOSEPH 6193
## 9 SAMUEL 6161
## 10 LIAM 5802
## # ... with 90 more rows
Exercise prototype
There are different ways you can go about it. Here is one:
Exercise prototype
There are different ways you can go about it. Here is one:
Right now we have a list of tables, one for each year. This is not a bad way to go. It has the advantage of making it easy to work
with individual years without needing to load data from other years. We can store the data organized by year in .csv files, .rds
files, or in database tables.
We can store the data organized by year in .csv files, .rds files, or in database tables. For now let’s store these data in .csv
files and then see how easy it is to work with them.
——————-
Exercise prototype
Number one is easy, number two is harder:
## # A tibble: 5 x 4
## Name Count Length Year
## <chr> <int> <int> <int>
## 1 OLIVER 6949 6 2013
## 2 JACK 6212 4 2013
## 3 HARRY 5888 5 2013
## 4 JACOB 5126 5 2013
## 5 CHARLIE 5039 7 2013
## 2. How has the popularity of the name "ANDREW" changed over time?
boysNames <- map([Link]("./data/byyear", [Link] = TRUE),
read_csv)
Create a directory under data/all and write the data to a .csv file.
Finally, repeat the previous exercise, this time working with the data in one big table.
[Link]("data/all")
write_csv(boysNames, "data/all/boys_names.csv")
## # A tibble: 5 x 4
## Name Count Length Year
## <chr> <dbl> <int> <int>
## 1 OLIVER 6949 6 2013
## 2 JACK 6212 4 2013
## 3 HARRY 5888 5 2013
## 4 JACOB 5126 5 2013
## 5 CHARLIE 5039 7 2013
## How has the popularity of the name "ANDREW" changed over time?
andrew <- filter(boysNames, Name == "ANDREW")
[Link]
[Link]
[Link]