0% found this document useful (0 votes)
78 views26 pages

Https Tutorials Iq Harvard Edu R RDataManagement RDataManagement HTML

This workshop introduces tools for cleaning, correcting, and reformatting data before modeling or visualization. The data comes from UK birth records spanning 1996-2015, stored across many Excel files with inconsistent formats. Key challenges are extracting the relevant "Table 1" worksheet from each file and handling differences in column names and formats across years. Modern data manipulation packages like tidyverse will help organize the messy data into a consistent format for analysis.

Uploaded by

csscs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views26 pages

Https Tutorials Iq Harvard Edu R RDataManagement RDataManagement HTML

This workshop introduces tools for cleaning, correcting, and reformatting data before modeling or visualization. The data comes from UK birth records spanning 1996-2015, stored across many Excel files with inconsistent formats. Key challenges are extracting the relevant "Table 1" worksheet from each file and handling differences in column names and formats across years. Modern data manipulation packages like tidyverse will help organize the messy data into a consistent format for analysis.

Uploaded by

csscs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Workshop description

R Data Management
Prerequisites and Preparation

Example project overview


Workshop description
Data scientists are known and celebrated for modeling and visually displaying information, but down in the data science engine
Problems with the data
room there is a lot of less glamorous work to be done. Before data can be used effectively it must often be cleaned, corrected, and
Useful data manipulation packages reformatted. This workshop introduces the basic tools needed to make your data behave, including data reshaping, regular
expressions and other text manipulation tools.
Working with Excel worksheets

Reading Excel data files


Prerequisites and Preparation
Data cleanup
Prior to the workshop you should:
Adding derived columns
install R from [Link]
Data organization and storage install RStudio from [Link]
Additional reading and resources install the tidyverse package in R with [Link]("tidyverse")

The lesson notes are available at [Link]

This is an intermediate/advanced R course appropriate for those with basic knowledge of R. If you need a refresher we
recommend the the IQSS R intro.

Example project overview


It is common for data to be made available on a website somewhere, either by a government agency, research group, or other
organizations and entities. Often the data you want is spread over many files, and retrieving it all one file at a time is tedious and
time consuming. Such is the case with the baby names data we will be using today.

The UK Office for National Statistics provides yearly data on the most popular baby names going back to 1996. The data is
provided separately for boys and girls and is stored in Excel spreadsheets.

I have downloaded all the excel files containing boys names data from
[Link]
and made them available at [Link]

Our mission is to extract and graph the top 100 boys names in England and Wales for every year since 1996. There are several
things that make this challenging.

Problems with the data


While it was good of the UK Office for National Statistics to provide baby name data, they were not very diligent about arranging it
in a convenient or consistent format.

Exercise 0
Convertido de web en PDF a [Link] con el api html a pdf
1. Download and extract the data from [Link]

## You can download the file using a web browser, and extract using your file browser.
## For bonus points you can try doing it using R, but this is not required.

2. Locate the file named 1996boys_tcm77-[Link] and open it in a spreadsheet. (If you don’t have a spreadsheet
program installed on your computer you can downloads one from [Link] What
issues can you identify that might make working with these data more difficult?

3. Locate the file named [Link] and open it in a spreadsheet. In what ways is the format different than
the format of 1996boys_tcm77-[Link] ? How might these differences make it more difficult to work with these
data?

Exercise 0 solution
1. Download and extract the data from
[Link]

2. Locate the file named 1996boys_tcm77-[Link] and open it in a spreadsheet. (If


you don’t have a spreadsheet program installed on your computer you can downloads one
from [Link] What issues can you identify that
might make working with these data more difficult?

The data does not start on row one. Headers are on row 7, followed by a blank line, followed by the actual data.

The data is stored in an inconvenient way, with ranks 1-50 in the first set of columns and ranks 51-100 in a separate set of
columns.

There are notes below the data.

3. Locate the file named [Link] and open it in a spreadsheet. In what


ways is the format different than the format of 1996boys_tcm77-[Link] ? How
might these differences make it more difficult to work with these data?

The worksheet containing the data of interest is in different positions and has different names from one year to the next. However,
it always includes “Table 1” in the worksheet name.

Some years include columns for “changes in rank”, others do not.

These differences will make it more difficult to automate re-arranging the data since we have to write code that can handle different
input formats.

Convertido de web en PDF a [Link] con el api html a pdf


Useful data manipulation packages
As you can see, the data is in quite a messy state. Note that this is not a contrived example; this is exactly the way the data came
to us from the UK government website! Let’s start cleaning and organizing it. The tidyverse suite of packages provides many
modern conveniences that will make this job easier.

library(tidyverse)

Working with Excel worksheets


Each Excel file contains a worksheet with the baby names data we want. Each file also contains additional supplemental
worksheets that we are not currently interested in. As noted above, the worksheet of interest differs from year to year, but always
has “Table 1” in the sheet name.

The first step is to get a vector of file names.

[Link] <- [Link]("data/boys", [Link] = TRUE)

Now that we’ve told R the names of the data files we can start working with them. For example, the first file is

[Link][[1]]

## [1] "data/boys/1996boys_tcm77-[Link]"

and we can use the excel_sheets function from the readxl package to list the worksheet names from this file.

library(readxl)

excel_sheets([Link][[1]])

## [1] "Contents" "Table 1 - Top 100 boys, E&W"


## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"

Iterating over file names with map


Now that we know how to retrieve the names of the worksheets in an Excel file we could start writing code to extract the sheet
names from each file, e.g.,

excel_sheets([Link][[1]])

Convertido de web en PDF a [Link] con el api html a pdf


## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"

excel_sheets([Link][[2]])

## [1] "Contents" "Table 1 - Top 100 boys, E&W"


## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"

## ...
excel_sheets([Link][[20]])

## [1] "Contents" "Metadata" "Terms and Conditions"


## [4] "Table 1" "Table 2" "Table 3"
## [7] "Table 4" "Table 5" "Table 6"
## [10] "Related Publications"

This is not a terrible idea for a small number of files, but it is more convenient to let R do the iteration for us. We could use a for
loop, or sapply , but the map family of functions from the purrr package gives us a more consistent alternative, so we’ll use that.

library(purrr)
map([Link], excel_sheets)

## [[1]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"
##
## [[2]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[3]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[4]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
Convertido de web en PDF a [Link] con el api html a pdf
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[5]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[6]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[7]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[8]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[9]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[10]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[11]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[12]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
Convertido de web en PDF a [Link] con el api html a pdf
## [7] "Table 6 - Boys names - E&W"
##
## [[13]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[14]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[15]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[16]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[17]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[18]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[19]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[20]]
## [1] "Contents" "Metadata" "Terms and Conditions"
Convertido de web en PDF a [Link] con el api html a pdf
## [1] "Contents" "Metadata" "Terms and Conditions"
## [4] "Table 1" "Table 2" "Table 3"
## [7] "Table 4" "Table 5" "Table 6"
## [10] "Related Publications"

Filtering strings using regular expressions


In order extract the correct worksheet names we need a way to extract strings containing “Table 1”. Base R provides some string
manipulation capabilities (see ?regex , ?sub and ?grep ), but we will use the stringr package because it is more user-friendly.

The stringr package provides functions to detect, locate, extract, match, replace, combine and split strings (among other things).

Here we want to detect the pattern “Table 1”, and only return elements with this pattern. We can do that using the str_subset
function. The first argument to str_subset is character vector we want to search in. The second argument is a regular
expression matching the pattern we want to retain.

If you are not familiar with regular expressions, [Link] is a good place to start.

Now that we know how to filter character vectors using str_subset we can identify the correct sheet in a particular Excel file.
For example,

library(stringr)
str_subset(excel_sheets([Link][[1]]), "Table 1")

## [1] "Table 1 - Top 100 boys, E&W"

Writing your own functions


The map* functions are useful when you want to apply a function to a list or vector of inputs and obtain the return values. This is
very convenient when a function already exists that does exactly what you want. In the examples above we mapped the
excel_sheets function to the elements of a vector containing file names. But now there is no function that both retrieves
worksheet names and subsets them. Fortunately, writing functions in R is easy.

[Link] <- function(file, pattern) {


str_subset(excel_sheets(file), pattern)
}

Now we can map this new function over our vector of file names.

map([Link],
[Link],
pattern = "Table 1")

## [[1]]
## [1] "Table
Convertido de web1en
- PDF
Top 100 boys, E&W"
a [Link] con el api html a pdf
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[2]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[3]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[4]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[5]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[6]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[7]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[8]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[9]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[10]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[11]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[12]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[13]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[14]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[15]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[16]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[17]]
## [1] "Table 1 - Top 100 boys, E&W"
##
Convertido de web en PDF a [Link] con el api html a pdf
##
## [[18]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[19]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[20]]
## [1] "Table 1"

Reading Excel data files


Now that we know the correct worksheet from each file we can actually read those data into R. We can do that using the
read_excel function.

We’ll start by reading the data from the first file, just to check that it works. Recall that the actual data starts on row 7, so we want
to skip the first 6 rows.

tmp <- read_excel(


[Link][1],
sheet = [Link]([Link][1],
pattern = "Table 1"),
skip = 6)

library(dplyr, quietly=TRUE)
glimpse(tmp)

## Observations: 59
## Variables: 7
## $ X__1 <chr> NA, "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"...
## $ Name <chr> NA, "JACK", "DANIEL", "THOMAS", "JAMES", "JOSHUA", "M...
## $ Count <dbl> NA, 10779, 10338, 9603, 9385, 7887, 7426, 6496, 6193,...
## $ X__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ X__3 <dbl> NA, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...
## $ Name__1 <chr> NA, "DOMINIC", "NICHOLAS", "BRANDON", "RHYS", "MARK",...
## $ Count__1 <dbl> NA, 1519, 1385, 1337, 1259, 1222, 1192, 1186, 1135, 1...

Exercise 1
1. Write a function that takes a file name as an argument and reads the worksheet containing “Table 1” from that file. Don’t
forget to skip the first 6 rows.

2. Test your function by using it to read one of the boys names Excel files.

3. Use the map function to read data from all the Excel files, using the function you wrote in step 1.

Convertido de web en PDF a [Link] con el api html a pdf


Exercise 1 solution

## 1. Write a function that takes a file name as an argument and reads


## the worksheet containing "Table 1" from that file.
[Link] <- function(file) {
[Link] <- str_subset(excel_sheets(file), "Table 1")
read_excel(file, sheet = [Link], skip = 6)
}

## 2. Test your function by using it to read *one* of the boys names


## Excel files.
glimpse([Link]([Link][1]))

## 3. Use the `map` function to read data from all the Excel files,
## using the function you wrote in step 1.
boysNames <- map([Link], [Link])

Data cleanup
Now that we’ve read in the data we still have some cleanup to do. Specifically, we need to:

1. fix column names


2. get rid of blank row and the top and the notes at the bottom
3. get rid of extraneous “changes in rank” columns if they exist
4. transform the side-by-side tables layout to a single table.

In short, we want to go from this:

Convertido de web en PDF a [Link] con el api html a pdf


messy

to this:

Convertido de web en PDF a [Link] con el api html a pdf


Convertido de web en PDF a [Link] con el api html a pdf
tidy

There are many ways to do this kind of data manipulation in R. We’re going to use the dplyr and tidyr packages to make our lives
easier. (Both packages were installed as dependencies of the tidyverse package.)

Fixing column names


The column names are in bad shape. In R we need column names to a) start with a letter, b) contain only letters, numbers,
underscores and periods, and c) uniquely identify each column.

The actual column names look like this:

names(boysNames[[1]])

## [1] "X__1" "Name" "Count" "X__2" "X__3" "Name__1"


## [7] "Count__1"

So we need to a) make sure each column has a name, and b) distinguish between the first and second occurrences of “Name”
and “Count”. We could do this step-by-step, but there is a handy function in R called [Link] that will do it for us.

boysNames[[1]] <- setNames(


boysNames[[1]],
[Link](names(boysNames[[1]]),
unique = TRUE))

names(boysNames[[1]])

## [1] "X__1" "Name" "Count" "X__2" "X__3" "Name__1"


## [7] "Count__1"

Fixing all the names


Of course we need to iterate over each [Link] in the boysNames list and do this for each one. Fortunately the map function
makes this easy.

Convertido de web en PDF a [Link] con el api html a pdf


boysNames <- map(
boysNames,
function(x) {
setNames(x,
[Link](names(x),
unique = TRUE))
}
)

Selecting columns
Next we want to retain just the Name , Name__1 and Count , Count__1 columns. We can do that using the select function:

boysNames[[1]]

## # A tibble: 59 x 7
## X__1 Name Count X__2 X__3 Name__1 Count__1
## <chr> <chr> <dbl> <lgl> <dbl> <chr> <dbl>
## 1 <NA> <NA> NA NA NA <NA> NA
## 2 1 JACK 10779 NA 51 DOMINIC 1519
## 3 2 DANIEL 10338 NA 52 NICHOLAS 1385
## 4 3 THOMAS 9603 NA 53 BRANDON 1337
## 5 4 JAMES 9385 NA 54 RHYS 1259
## 6 5 JOSHUA 7887 NA 55 MARK 1222
## 7 6 MATTHEW 7426 NA 56 MAX 1192
## 8 7 RYAN 6496 NA 57 DYLAN 1186
## 9 8 JOSEPH 6193 NA 58 HENRY 1135
## 10 9 SAMUEL 6161 NA 59 PETER 1128
## # ... with 49 more rows

boysNames[[1]] <- select(boysNames[[1]], Name, Name__1, Count, Count__1)


boysNames[[1]]

Convertido de web en PDF a [Link] con el api html a pdf


## # A tibble: 59 x 4
## Name Name__1 Count Count__1
## <chr> <chr> <dbl> <dbl>
## 1 <NA> <NA> NA NA
## 2 JACK DOMINIC 10779 1519
## 3 DANIEL NICHOLAS 10338 1385
## 4 THOMAS BRANDON 9603 1337
## 5 JAMES RHYS 9385 1259
## 6 JOSHUA MARK 7887 1222
## 7 MATTHEW MAX 7426 1192
## 8 RYAN DYLAN 6496 1186
## 9 JOSEPH HENRY 6193 1135
## 10 SAMUEL PETER 6161 1128
## # ... with 49 more rows

Dropping missing values


Next we want to remove blank rows and rows used for notes. An easy way to do that is to use drop_na to remove rows with
missing values.

boysNames[[1]]

## # A tibble: 59 x 4
## Name Name__1 Count Count__1
## <chr> <chr> <dbl> <dbl>
## 1 <NA> <NA> NA NA
## 2 JACK DOMINIC 10779 1519
## 3 DANIEL NICHOLAS 10338 1385
## 4 THOMAS BRANDON 9603 1337
## 5 JAMES RHYS 9385 1259
## 6 JOSHUA MARK 7887 1222
## 7 MATTHEW MAX 7426 1192
## 8 RYAN DYLAN 6496 1186
## 9 JOSEPH HENRY 6193 1135
## 10 SAMUEL PETER 6161 1128
## # ... with 49 more rows

boysNames[[1]] <- drop_na(boysNames[[1]])


boysNames[[1]]

Convertido de web en PDF a [Link] con el api html a pdf


## # A tibble: 50 x 4
## Name Name__1 Count Count__1
## <chr> <chr> <dbl> <dbl>
## 1 JACK DOMINIC 10779 1519
## 2 DANIEL NICHOLAS 10338 1385
## 3 THOMAS BRANDON 9603 1337
## 4 JAMES RHYS 9385 1259
## 5 JOSHUA MARK 7887 1222
## 6 MATTHEW MAX 7426 1192
## 7 RYAN DYLAN 6496 1186
## 8 JOSEPH HENRY 6193 1135
## 9 SAMUEL PETER 6161 1128
## 10 LIAM STEPHEN 5802 1122
## # ... with 40 more rows

Finally, we will want to filter out missing do this for all the elements in boysNames , a task I leave to you.

Exercise 2
1. Write a function that takes a [Link] as an argument and returns a modified version including only columns named
“Name”, “Name__1”, “Count”, or “Count__1”.

2. Test your function by using it to read one of the boys names Excel files.

3. Use the map function to read data from all the Excel files, using the function you wrote in step 1.

Exercise 2 solution

## 1. Write a function that takes a file name as an argument and reads


## the worksheet containing "Table 1" from that file.
[Link] <- function(file) {
[Link] <- str_subset(excel_sheets(file), "Table 1")
read_excel(file, sheet = [Link], skip = 6)
}

## 2. Test your function by using it to read *one* of the boys names


## Excel files.
glimpse([Link]([Link][1]))

## 3. Use the `map` function to read data from all the Excel files,
## using the function you wrote in step 1.
[Link] <- map([Link], [Link])

Re-arranging into a single table


Our final task is to re-arrange to data so that it is all in a single table instead of in two side-by-side tables. For many similar tasks

Convertido de web en PDF a [Link] con el api html a pdf


the gather function in the tidyr package is useful, but in this case we will be better off using a combination of select and
bind_rows .

boysNames[[1]]

## # A tibble: 50 x 4
## Name Name__1 Count Count__1
## <chr> <chr> <dbl> <dbl>
## 1 JACK DOMINIC 10779 1519
## 2 DANIEL NICHOLAS 10338 1385
## 3 THOMAS BRANDON 9603 1337
## 4 JAMES RHYS 9385 1259
## 5 JOSHUA MARK 7887 1222
## 6 MATTHEW MAX 7426 1192
## 7 RYAN DYLAN 6496 1186
## 8 JOSEPH HENRY 6193 1135
## 9 SAMUEL PETER 6161 1128
## 10 LIAM STEPHEN 5802 1122
## # ... with 40 more rows

bind_rows(select(boysNames[[1]], Name, Count),


select(boysNames[[1]], Name = Name__1, Count = Count__1))

## # A tibble: 100 x 2
## Name Count
## <chr> <dbl>
## 1 JACK 10779
## 2 DANIEL 10338
## 3 THOMAS 9603
## 4 JAMES 9385
## 5 JOSHUA 7887
## 6 MATTHEW 7426
## 7 RYAN 6496
## 8 JOSEPH 6193
## 9 SAMUEL 6161
## 10 LIAM 5802
## # ... with 90 more rows

Exercise 3: Cleanup all the data


In the previous examples we learned how to drop empty rows with filter , select only relevant columns with select , and re-
arrange our data with select and bind_rows . In each case we applied the changes only to the first element of our
boysNames list.

Convertido de web en PDF a [Link] con el api html a pdf


Your task now is to use the map function to apply each of these transformations to all the elements in boysNames .

Exercise prototype
There are different ways you can go about it. Here is one:

## write a function that does all the cleanup


cleanupNamesData <- function(x) {
filtered <- filter(x, ![Link](Name)) # drop rows with no Name value
selected <- select(filtered, Name, Count, Name__1, Count__1) # select just Name and Count columns

bind_rows(select(selected, Name, Count), # re-arrange into two columns


select(selected, Name = Name__1, Count = Count__1))
}

## test it out on the second [Link] in the list


glimpse(boysNames[[2]]) # before cleanup
glimpse(cleanupNamesData(boysNames[[2]])) # after cleanup

## apply the cleanup function to all the [Link] in the list


boysNames <- map(boysNames, cleanupNamesData)

Adding derived columns


It is often useful to add columns that are derived from one or more existing columns. For example, we may wish to add a column
to store the length of each name:

boysNames <- map(boysNames, mutate, Length = str_count(Name))

Exercise 4: Add a year column


We originally read the data from each file listed in [Link] , and the data is still in that order. Use the information
contained in [Link] to add a Year column to each table in boysNames . (Hint: see ?map2 .)

Exercise prototype
There are different ways you can go about it. Here is one:

Convertido de web en PDF a [Link] con el api html a pdf


## Extract years
years <- [Link](str_extract([Link], "[0-9]{4}"))

## Insert year column in each table


boysNames <- map2(boysNames, years, function(x, y) mutate(x, Year = y))

Data organization and storage


Now that we have the data cleaned up and augmented, we can turn our attention to organizing and storing the data.

Right now we have a list of tables, one for each year. This is not a bad way to go. It has the advantage of making it easy to work
with individual years without needing to load data from other years. We can store the data organized by year in .csv files, .rds
files, or in database tables.

One table for each year


Right now we have a list of tables, one for each year. This is not a bad way to go. It has the advantage of making it easy to work
with individual years without needing to load data from other years. It has the disadvantage of making it more difficult to examine
questions the require data from multiple years.

We can store the data organized by year in .csv files, .rds files, or in database tables. For now let’s store these data in .csv
files and then see how easy it is to work with them.

## make directory to store the data


[Link]("./data/byyear", recursive = TRUE)

## Warning in [Link]("./data/byyear", recursive = TRUE): './data/byyear'


## already exists

## extract the years


years <- map_int(boysNames, function(x) unique(x$Year))
## construct paths
paths <- str_c("data/byyear/boys_names_", years, ".csv", sep = "")
## write out the data
walk2(boysNames, paths, write_csv)

## clear our workspace


rm(list = ls())

Exercise: work with tables organized by year


1. What where the five most popular names in 2013?

Convertido de web en PDF a [Link] con el api html a pdf


2. How has the popularity of the name “ANDREW” changed over time?

——————-
Exercise prototype
Number one is easy, number two is harder:

## 1. What where the five most popular names in 2013?


boys2013 <- read_csv("./data/byyear/boys_names_2013.csv")

## Parsed with column specification:


## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )

slice(arrange(boys2013, desc(Count)), 1:5)

## # A tibble: 5 x 4
## Name Count Length Year
## <chr> <int> <int> <int>
## 1 OLIVER 6949 6 2013
## 2 JACK 6212 4 2013
## 3 HARRY 5888 5 2013
## 4 JACOB 5126 5 2013
## 5 CHARLIE 5039 7 2013

## 2. How has the popularity of the name "ANDREW" changed over time?
boysNames <- map([Link]("./data/byyear", [Link] = TRUE),
read_csv)

Convertido de web en PDF a [Link] con el api html a pdf


## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )

## Parsed with column specification:


## cols(
## Name = col_character(),
## Count = col_double(),
## Length = col_integer(),
## Year = col_integer()
## )

## Parsed with column specification:


## cols(
## Name = col_character(),
## Count
Convertido = col_integer(),
de web en PDF a [Link] con el api html a pdf
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
Convertido de web en PDF a [Link] con el api html a pdf
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )

Convertido de web en PDF a [Link] con el api html a pdf


andrew <- map(boysNames, filter, Name == "ANDREW")
andrew <- bind_rows(andrew)

ggplot(andrew, aes(x = Year, y = Count)) +


geom_line() +
ggtitle("Popularity of \"Andrew\", over time")

One big table


By far the easiest approach is to store the data in one big table. We’ve already seen how we can combine a list of tables into one
big one.

Exercise: Make one big table


Turn the list of boys names [Link] into a single table.

Create a directory under data/all and write the data to a .csv file.

Finally, repeat the previous exercise, this time working with the data in one big table.

Convertido de web en PDF a [Link] con el api html a pdf


——–
Exercise prototype
Working with the data in one big table is often easier.

boysNames <- bind_rows(boysNames)

[Link]("data/all")

## Warning in [Link]("data/all"): 'data/all' already exists

write_csv(boysNames, "data/all/boys_names.csv")

## What where the five most popular names in 2013?


slice(arrange(filter(boysNames, Year == 2013),
desc(Count)),
1:5)

## # A tibble: 5 x 4
## Name Count Length Year
## <chr> <dbl> <int> <int>
## 1 OLIVER 6949 6 2013
## 2 JACK 6212 4 2013
## 3 HARRY 5888 5 2013
## 4 JACOB 5126 5 2013
## 5 CHARLIE 5039 7 2013

## How has the popularity of the name "ANDREW" changed over time?
andrew <- filter(boysNames, Name == "ANDREW")

ggplot(andrew, aes(x = Year, y = Count)) +


geom_line() +
ggtitle("Popularity of \"Andrew\", over time")

Convertido de web en PDF a [Link] con el api html a pdf


Additional reading and resources
Learn from the best: [Link] [Link]
R documentation: [Link]
Collection of R tutorials: [Link]

R for Programmers (by Norman Matloff, UC–Davis)

[Link]

Calling C and Fortran from R (by Charles Geyer, UMinn)

[Link]

State of the Art in Parallel Computing with R (Schmidberger et al.)

[Link]

Institute for Quantitative Social Science: [Link]


IQSS Data Science Services: [Link]

Convertido de web en PDF a [Link] con el api html a pdf

You might also like