0% found this document useful (0 votes)

78 views26 pages

Https Tutorials Iq Harvard Edu R RDataManagement RDataManagement HTML

This workshop introduces tools for cleaning, correcting, and reformatting data before modeling or visualization. The data comes from UK birth records spanning 1996-2015, stored across many Excel files with inconsistent formats. Key challenges are extracting the relevant "Table 1" worksheet from each file and handling differences in column names and formats across years. Modern data manipulation packages like tidyverse will help organize the messy data into a consistent format for analysis.

Uploaded by

csscs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views26 pages

Https Tutorials Iq Harvard Edu R RDataManagement RDataManagement HTML

Uploaded by

csscs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Workshop description

R Data Management
Prerequisites and Preparation

Example project overview

Workshop description
Data scientists are known and celebrated for modeling and visually displaying information, but down in the data science engine
Problems with the data
room there is a lot of less glamorous work to be done. Before data can be used effectively it must often be cleaned, corrected, and
Useful data manipulation packages reformatted. This workshop introduces the basic tools needed to make your data behave, including data reshaping, regular
expressions and other text manipulation tools.
Working with Excel worksheets

Reading Excel data files

Prerequisites and Preparation
Data cleanup
Prior to the workshop you should:
Adding derived columns
install R from [Link]
Data organization and storage install RStudio from [Link]
Additional reading and resources install the tidyverse package in R with [Link]("tidyverse")

The lesson notes are available at [Link]

This is an intermediate/advanced R course appropriate for those with basic knowledge of R. If you need a refresher we
recommend the the IQSS R intro.

Example project overview

It is common for data to be made available on a website somewhere, either by a government agency, research group, or other
organizations and entities. Often the data you want is spread over many files, and retrieving it all one file at a time is tedious and
time consuming. Such is the case with the baby names data we will be using today.

The UK Office for National Statistics provides yearly data on the most popular baby names going back to 1996. The data is
provided separately for boys and girls and is stored in Excel spreadsheets.

I have downloaded all the excel files containing boys names data from
[Link]
and made them available at [Link]

Our mission is to extract and graph the top 100 boys names in England and Wales for every year since 1996. There are several
things that make this challenging.

Problems with the data

While it was good of the UK Office for National Statistics to provide baby name data, they were not very diligent about arranging it
in a convenient or consistent format.

Exercise 0
Convertido de web en PDF a [Link] con el api html a pdf
1. Download and extract the data from [Link]

## You can download the file using a web browser, and extract using your file browser.
## For bonus points you can try doing it using R, but this is not required.

2. Locate the file named 1996boys_tcm77-[Link] and open it in a spreadsheet. (If you don’t have a spreadsheet
program installed on your computer you can downloads one from [Link] What
issues can you identify that might make working with these data more difficult?

3. Locate the file named [Link] and open it in a spreadsheet. In what ways is the format different than
the format of 1996boys_tcm77-[Link] ? How might these differences make it more difficult to work with these
data?

Exercise 0 solution
1. Download and extract the data from
[Link]

2. Locate the file named 1996boys_tcm77-[Link] and open it in a spreadsheet. (If

you don’t have a spreadsheet program installed on your computer you can downloads one
from [Link] What issues can you identify that
might make working with these data more difficult?

The data does not start on row one. Headers are on row 7, followed by a blank line, followed by the actual data.

The data is stored in an inconvenient way, with ranks 1-50 in the first set of columns and ranks 51-100 in a separate set of
columns.

There are notes below the data.

3. Locate the file named [Link] and open it in a spreadsheet. In what

ways is the format different than the format of 1996boys_tcm77-[Link] ? How
might these differences make it more difficult to work with these data?

The worksheet containing the data of interest is in different positions and has different names from one year to the next. However,
it always includes “Table 1” in the worksheet name.

Some years include columns for “changes in rank”, others do not.

These differences will make it more difficult to automate re-arranging the data since we have to write code that can handle different
input formats.

Convertido de web en PDF a [Link] con el api html a pdf

Useful data manipulation packages
As you can see, the data is in quite a messy state. Note that this is not a contrived example; this is exactly the way the data came
to us from the UK government website! Let’s start cleaning and organizing it. The tidyverse suite of packages provides many
modern conveniences that will make this job easier.

library(tidyverse)

Working with Excel worksheets

Each Excel file contains a worksheet with the baby names data we want. Each file also contains additional supplemental
worksheets that we are not currently interested in. As noted above, the worksheet of interest differs from year to year, but always
has “Table 1” in the sheet name.

The first step is to get a vector of file names.

[Link] <- [Link]("data/boys", [Link] = TRUE)

Now that we’ve told R the names of the data files we can start working with them. For example, the first file is

[Link][[1]]

## [1] "data/boys/1996boys_tcm77-[Link]"

and we can use the excel_sheets function from the readxl package to list the worksheet names from this file.

library(readxl)

excel_sheets([Link][[1]])

## [1] "Contents" "Table 1 - Top 100 boys, E&W"

## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"

Iterating over file names with map

Now that we know how to retrieve the names of the worksheets in an Excel file we could start writing code to extract the sheet
names from each file, e.g.,

excel_sheets([Link][[1]])

Convertido de web en PDF a [Link] con el api html a pdf

## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"

excel_sheets([Link][[2]])

## [1] "Contents" "Table 1 - Top 100 boys, E&W"

## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"

## ...
excel_sheets([Link][[20]])

## [1] "Contents" "Metadata" "Terms and Conditions"

## [4] "Table 1" "Table 2" "Table 3"
## [7] "Table 4" "Table 5" "Table 6"
## [10] "Related Publications"

This is not a terrible idea for a small number of files, but it is more convenient to let R do the iteration for us. We could use a for
loop, or sapply , but the map family of functions from the purrr package gives us a more consistent alternative, so we’ll use that.

library(purrr)
map([Link], excel_sheets)

## [[1]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2-Top 10 boys by month" "Table 3 - Boys names - E&W"
##
## [[2]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[3]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[4]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
Convertido de web en PDF a [Link] con el api html a pdf
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[5]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[6]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[7]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[8]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[9]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[10]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[11]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[12]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
Convertido de web en PDF a [Link] con el api html a pdf
## [7] "Table 6 - Boys names - E&W"
##
## [[13]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[14]]
## [1] "Contents" "Table 1 - Top 100 boys' names"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[15]]
## [1] "Contents" "Table 1 - Top 100 boys, E&W"
## [3] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [5] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [7] "Table 6 - Boys names - E&W"
##
## [[16]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[17]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[18]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[19]]
## [1] "Contents" "Metadata"
## [3] "Terms and Conditions" "Table 1 - Top 100 boys, E&W"
## [5] "Table 2 - Top 100 boys, England" "Table 3 - Top 100 boys, Wales"
## [7] "Table 4 - Top 10 boys by region" "Table 5 - Top 10 boys by month"
## [9] "Table 6 - Boys names - E&W" "Related Publications"
##
## [[20]]
## [1] "Contents" "Metadata" "Terms and Conditions"
Convertido de web en PDF a [Link] con el api html a pdf
## [1] "Contents" "Metadata" "Terms and Conditions"
## [4] "Table 1" "Table 2" "Table 3"
## [7] "Table 4" "Table 5" "Table 6"
## [10] "Related Publications"

Filtering strings using regular expressions

In order extract the correct worksheet names we need a way to extract strings containing “Table 1”. Base R provides some string
manipulation capabilities (see ?regex , ?sub and ?grep ), but we will use the stringr package because it is more user-friendly.

The stringr package provides functions to detect, locate, extract, match, replace, combine and split strings (among other things).

Here we want to detect the pattern “Table 1”, and only return elements with this pattern. We can do that using the str_subset
function. The first argument to str_subset is character vector we want to search in. The second argument is a regular
expression matching the pattern we want to retain.

If you are not familiar with regular expressions, [Link] is a good place to start.

Now that we know how to filter character vectors using str_subset we can identify the correct sheet in a particular Excel file.
For example,

library(stringr)
str_subset(excel_sheets([Link][[1]]), "Table 1")

## [1] "Table 1 - Top 100 boys, E&W"

Writing your own functions

The map* functions are useful when you want to apply a function to a list or vector of inputs and obtain the return values. This is
very convenient when a function already exists that does exactly what you want. In the examples above we mapped the
excel_sheets function to the elements of a vector containing file names. But now there is no function that both retrieves
worksheet names and subsets them. Fortunately, writing functions in R is easy.

[Link] <- function(file, pattern) {

str_subset(excel_sheets(file), pattern)
}

Now we can map this new function over our vector of file names.

map([Link],
[Link],
pattern = "Table 1")

## [[1]]
## [1] "Table
Convertido de web1en
- PDF
Top 100 boys, E&W"
a [Link] con el api html a pdf
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[2]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[3]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[4]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[5]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[6]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[7]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[8]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[9]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[10]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[11]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[12]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[13]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[14]]
## [1] "Table 1 - Top 100 boys' names"
##
## [[15]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[16]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[17]]
## [1] "Table 1 - Top 100 boys, E&W"
##
Convertido de web en PDF a [Link] con el api html a pdf
##
## [[18]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[19]]
## [1] "Table 1 - Top 100 boys, E&W"
##
## [[20]]
## [1] "Table 1"

Reading Excel data files

Now that we know the correct worksheet from each file we can actually read those data into R. We can do that using the
read_excel function.

We’ll start by reading the data from the first file, just to check that it works. Recall that the actual data starts on row 7, so we want
to skip the first 6 rows.

tmp <- read_excel(

[Link][1],
sheet = [Link]([Link][1],
pattern = "Table 1"),
skip = 6)

library(dplyr, quietly=TRUE)
glimpse(tmp)

## Observations: 59
## Variables: 7
## $ X__1 <chr> NA, "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"...
## $ Name <chr> NA, "JACK", "DANIEL", "THOMAS", "JAMES", "JOSHUA", "M...
## $ Count <dbl> NA, 10779, 10338, 9603, 9385, 7887, 7426, 6496, 6193,...
## $ X__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ X__3 <dbl> NA, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 6...
## $ Name__1 <chr> NA, "DOMINIC", "NICHOLAS", "BRANDON", "RHYS", "MARK",...
## $ Count__1 <dbl> NA, 1519, 1385, 1337, 1259, 1222, 1192, 1186, 1135, 1...

Exercise 1
1. Write a function that takes a file name as an argument and reads the worksheet containing “Table 1” from that file. Don’t
forget to skip the first 6 rows.

2. Test your function by using it to read one of the boys names Excel files.

3. Use the map function to read data from all the Excel files, using the function you wrote in step 1.

Convertido de web en PDF a [Link] con el api html a pdf

Exercise 1 solution

## 1. Write a function that takes a file name as an argument and reads

## the worksheet containing "Table 1" from that file.
[Link] <- function(file) {
[Link] <- str_subset(excel_sheets(file), "Table 1")
read_excel(file, sheet = [Link], skip = 6)
}

## 2. Test your function by using it to read one of the boys names

## Excel files.
glimpse([Link]([Link][1]))

## 3. Use the `map` function to read data from all the Excel files,
## using the function you wrote in step 1.
boysNames <- map([Link], [Link])

Data cleanup
Now that we’ve read in the data we still have some cleanup to do. Specifically, we need to:

1. fix column names

2. get rid of blank row and the top and the notes at the bottom
3. get rid of extraneous “changes in rank” columns if they exist
4. transform the side-by-side tables layout to a single table.

In short, we want to go from this:

Convertido de web en PDF a [Link] con el api html a pdf

messy

to this:

Convertido de web en PDF a [Link] con el api html a pdf

Convertido de web en PDF a [Link] con el api html a pdf
tidy

There are many ways to do this kind of data manipulation in R. We’re going to use the dplyr and tidyr packages to make our lives
easier. (Both packages were installed as dependencies of the tidyverse package.)

Fixing column names

The column names are in bad shape. In R we need column names to a) start with a letter, b) contain only letters, numbers,
underscores and periods, and c) uniquely identify each column.

The actual column names look like this:

names(boysNames[[1]])

## [1] "X1" "Name" "Count" "X2" "X3" "Name1"

## [7] "Count__1"

So we need to a) make sure each column has a name, and b) distinguish between the first and second occurrences of “Name”
and “Count”. We could do this step-by-step, but there is a handy function in R called [Link] that will do it for us.

boysNames[[1]] <- setNames(

boysNames[[1]],
[Link](names(boysNames[[1]]),
unique = TRUE))

names(boysNames[[1]])

## [1] "X1" "Name" "Count" "X2" "X3" "Name1"

## [7] "Count__1"

Fixing all the names

Of course we need to iterate over each [Link] in the boysNames list and do this for each one. Fortunately the map function
makes this easy.

Selecting columns
Next we want to retain just the Name , Name__1 and Count , Count__1 columns. We can do that using the select function:

boysNames[[1]]

## # A tibble: 59 x 7
## X__1 Name Count X__2 X__3 Name__1 Count__1
## <chr> <chr> <dbl> <lgl> <dbl> <chr> <dbl>
## 1 <NA> <NA> NA NA NA <NA> NA
## 2 1 JACK 10779 NA 51 DOMINIC 1519
## 3 2 DANIEL 10338 NA 52 NICHOLAS 1385
## 4 3 THOMAS 9603 NA 53 BRANDON 1337
## 5 4 JAMES 9385 NA 54 RHYS 1259
## 6 5 JOSHUA 7887 NA 55 MARK 1222
## 7 6 MATTHEW 7426 NA 56 MAX 1192
## 8 7 RYAN 6496 NA 57 DYLAN 1186
## 9 8 JOSEPH 6193 NA 58 HENRY 1135
## 10 9 SAMUEL 6161 NA 59 PETER 1128
## # ... with 49 more rows

boysNames[[1]] <- select(boysNames[[1]], Name, Name1, Count, Count1)

boysNames[[1]]

Dropping missing values

Next we want to remove blank rows and rows used for notes. An easy way to do that is to use drop_na to remove rows with
missing values.

boysNames[[1]]

boysNames[[1]] <- drop_na(boysNames[[1]])

boysNames[[1]]

Finally, we will want to filter out missing do this for all the elements in boysNames , a task I leave to you.

Exercise 2
1. Write a function that takes a [Link] as an argument and returns a modified version including only columns named
“Name”, “Name__1”, “Count”, or “Count__1”.

2. Test your function by using it to read one of the boys names Excel files.

3. Use the map function to read data from all the Excel files, using the function you wrote in step 1.

Exercise 2 solution

## 1. Write a function that takes a file name as an argument and reads

## the worksheet containing "Table 1" from that file.
[Link] <- function(file) {
[Link] <- str_subset(excel_sheets(file), "Table 1")
read_excel(file, sheet = [Link], skip = 6)
}

## 2. Test your function by using it to read one of the boys names

## Excel files.
glimpse([Link]([Link][1]))

## 3. Use the `map` function to read data from all the Excel files,
## using the function you wrote in step 1.
[Link] <- map([Link], [Link])

Re-arranging into a single table

Our final task is to re-arrange to data so that it is all in a single table instead of in two side-by-side tables. For many similar tasks

Convertido de web en PDF a [Link] con el api html a pdf

the gather function in the tidyr package is useful, but in this case we will be better off using a combination of select and
bind_rows .

boysNames[[1]]

bind_rows(select(boysNames[[1]], Name, Count),

select(boysNames[[1]], Name = Name__1, Count = Count__1))

## # A tibble: 100 x 2
## Name Count
## <chr> <dbl>
## 1 JACK 10779
## 2 DANIEL 10338
## 3 THOMAS 9603
## 4 JAMES 9385
## 5 JOSHUA 7887
## 6 MATTHEW 7426
## 7 RYAN 6496
## 8 JOSEPH 6193
## 9 SAMUEL 6161
## 10 LIAM 5802
## # ... with 90 more rows

Exercise 3: Cleanup all the data

In the previous examples we learned how to drop empty rows with filter , select only relevant columns with select , and re-
arrange our data with select and bind_rows . In each case we applied the changes only to the first element of our
boysNames list.

Convertido de web en PDF a [Link] con el api html a pdf

Your task now is to use the map function to apply each of these transformations to all the elements in boysNames .

Exercise prototype
There are different ways you can go about it. Here is one:

## write a function that does all the cleanup

cleanupNamesData <- function(x) {
filtered <- filter(x, ![Link](Name)) # drop rows with no Name value
selected <- select(filtered, Name, Count, Name__1, Count__1) # select just Name and Count columns

bind_rows(select(selected, Name, Count), # re-arrange into two columns

select(selected, Name = Name__1, Count = Count__1))
}

## test it out on the second [Link] in the list

glimpse(boysNames[[2]]) # before cleanup
glimpse(cleanupNamesData(boysNames[[2]])) # after cleanup

## apply the cleanup function to all the [Link] in the list

boysNames <- map(boysNames, cleanupNamesData)

Adding derived columns

It is often useful to add columns that are derived from one or more existing columns. For example, we may wish to add a column
to store the length of each name:

boysNames <- map(boysNames, mutate, Length = str_count(Name))

Exercise 4: Add a year column

We originally read the data from each file listed in [Link] , and the data is still in that order. Use the information
contained in [Link] to add a Year column to each table in boysNames . (Hint: see ?map2 .)

Exercise prototype
There are different ways you can go about it. Here is one:

Convertido de web en PDF a [Link] con el api html a pdf

## Extract years
years <- [Link](str_extract([Link], "[0-9]{4}"))

## Insert year column in each table

boysNames <- map2(boysNames, years, function(x, y) mutate(x, Year = y))

Data organization and storage

Now that we have the data cleaned up and augmented, we can turn our attention to organizing and storing the data.

Right now we have a list of tables, one for each year. This is not a bad way to go. It has the advantage of making it easy to work
with individual years without needing to load data from other years. We can store the data organized by year in .csv files, .rds
files, or in database tables.

One table for each year

Right now we have a list of tables, one for each year. This is not a bad way to go. It has the advantage of making it easy to work
with individual years without needing to load data from other years. It has the disadvantage of making it more difficult to examine
questions the require data from multiple years.

We can store the data organized by year in .csv files, .rds files, or in database tables. For now let’s store these data in .csv
files and then see how easy it is to work with them.

## make directory to store the data

[Link]("./data/byyear", recursive = TRUE)

## Warning in [Link]("./data/byyear", recursive = TRUE): './data/byyear'

## already exists

## extract the years

years <- map_int(boysNames, function(x) unique(x$Year))
## construct paths
paths <- str_c("data/byyear/boys_names_", years, ".csv", sep = "")
## write out the data
walk2(boysNames, paths, write_csv)

## clear our workspace

rm(list = ls())

Exercise: work with tables organized by year

1. What where the five most popular names in 2013?

Convertido de web en PDF a [Link] con el api html a pdf

2. How has the popularity of the name “ANDREW” changed over time?

——————-
Exercise prototype
Number one is easy, number two is harder:

## 1. What where the five most popular names in 2013?

boys2013 <- read_csv("./data/byyear/boys_names_2013.csv")

## Parsed with column specification:

## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )

slice(arrange(boys2013, desc(Count)), 1:5)

## # A tibble: 5 x 4
## Name Count Length Year
## <chr> <int> <int> <int>
## 1 OLIVER 6949 6 2013
## 2 JACK 6212 4 2013
## 3 HARRY 5888 5 2013
## 4 JACOB 5126 5 2013
## 5 CHARLIE 5039 7 2013

## 2. How has the popularity of the name "ANDREW" changed over time?
boysNames <- map([Link]("./data/byyear", [Link] = TRUE),
read_csv)

## Parsed with column specification:

## cols(
## Name = col_character(),
## Count = col_double(),
## Length = col_integer(),
## Year = col_integer()
## )

## Parsed with column specification:

## cols(
## Name = col_character(),
## Count
Convertido = col_integer(),
de web en PDF a [Link] con el api html a pdf
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
Convertido de web en PDF a [Link] con el api html a pdf
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )
## Parsed with column specification:
## cols(
## Name = col_character(),
## Count = col_integer(),
## Length = col_integer(),
## Year = col_integer()
## )

Convertido de web en PDF a [Link] con el api html a pdf

andrew <- map(boysNames, filter, Name == "ANDREW")
andrew <- bind_rows(andrew)

ggplot(andrew, aes(x = Year, y = Count)) +

geom_line() +
ggtitle("Popularity of \"Andrew\", over time")

One big table

By far the easiest approach is to store the data in one big table. We’ve already seen how we can combine a list of tables into one
big one.

Exercise: Make one big table

Turn the list of boys names [Link] into a single table.

Create a directory under data/all and write the data to a .csv file.

Finally, repeat the previous exercise, this time working with the data in one big table.

Convertido de web en PDF a [Link] con el api html a pdf

——–
Exercise prototype
Working with the data in one big table is often easier.

boysNames <- bind_rows(boysNames)

[Link]("data/all")

## Warning in [Link]("data/all"): 'data/all' already exists

write_csv(boysNames, "data/all/boys_names.csv")

## What where the five most popular names in 2013?

slice(arrange(filter(boysNames, Year == 2013),
desc(Count)),
1:5)

## # A tibble: 5 x 4
## Name Count Length Year
## <chr> <dbl> <int> <int>
## 1 OLIVER 6949 6 2013
## 2 JACK 6212 4 2013
## 3 HARRY 5888 5 2013
## 4 JACOB 5126 5 2013
## 5 CHARLIE 5039 7 2013

## How has the popularity of the name "ANDREW" changed over time?
andrew <- filter(boysNames, Name == "ANDREW")

ggplot(andrew, aes(x = Year, y = Count)) +

geom_line() +
ggtitle("Popularity of \"Andrew\", over time")

Convertido de web en PDF a [Link] con el api html a pdf

Additional reading and resources
Learn from the best: [Link] [Link]
R documentation: [Link]
Collection of R tutorials: [Link]

R for Programmers (by Norman Matloff, UC–Davis)

[Link]

Calling C and Fortran from R (by Charles Geyer, UMinn)

[Link]

State of the Art in Parallel Computing with R (Schmidberger et al.)

[Link]

Institute for Quantitative Social Science: [Link]

IQSS Data Science Services: [Link]

Convertido de web en PDF a [Link] con el api html a pdf

Introduction to plyr: US Baby Names Data
No ratings yet
Introduction to plyr: US Baby Names Data
49 pages
Excel Tricks Tor
No ratings yet
Excel Tricks Tor
17 pages
Practical Preprocessing and Data Cleaning
No ratings yet
Practical Preprocessing and Data Cleaning
51 pages
Data Wrangling with Tidyverse in R
No ratings yet
Data Wrangling with Tidyverse in R
54 pages
US Baby Names: Hadley Wickham
No ratings yet
US Baby Names: Hadley Wickham
20 pages
Tidy Verse
No ratings yet
Tidy Verse
76 pages
R Studio: Scripts, Data Handling & Cleaning
No ratings yet
R Studio: Scripts, Data Handling & Cleaning
25 pages
14 Clean The Mess
No ratings yet
14 Clean The Mess
77 pages
Working with Data Frames in R
No ratings yet
Working with Data Frames in R
8 pages
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
No ratings yet
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
17 pages
MBA Sem 1 Unit 3 Fundamentals of R
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R
41 pages
Lab Manual New
No ratings yet
Lab Manual New
12 pages
Week 7
No ratings yet
Week 7
10 pages
Module 2.9
No ratings yet
Module 2.9
12 pages
Data Science Wrangling
No ratings yet
Data Science Wrangling
121 pages
R Assignment 10
No ratings yet
R Assignment 10
12 pages
Data Import
No ratings yet
Data Import
2 pages
Data Analysis Using R - 4
No ratings yet
Data Analysis Using R - 4
23 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Module 7 - (Data Analysis With R Programming)
No ratings yet
Module 7 - (Data Analysis With R Programming)
18 pages
R Language PDF
100% (1)
R Language PDF
619 pages
R Programming Basics: Vectors, Matrices, Dataframes
No ratings yet
R Programming Basics: Vectors, Matrices, Dataframes
13 pages
Simple Tutorial in R
No ratings yet
Simple Tutorial in R
15 pages
R Commands & Resources Guide
No ratings yet
R Commands & Resources Guide
274 pages
M3 Dar
No ratings yet
M3 Dar
52 pages
Base R
No ratings yet
Base R
9 pages
R Module 8 - Data Cleaning
No ratings yet
R Module 8 - Data Cleaning
48 pages
Rcourse3 PDF
No ratings yet
Rcourse3 PDF
35 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
R Topicscovered
No ratings yet
R Topicscovered
22 pages
Data Import
No ratings yet
Data Import
2 pages
Mda Practical2 Eda
No ratings yet
Mda Practical2 Eda
50 pages
How To Do Reliability Analysis and Basic Factor Analysis in R
No ratings yet
How To Do Reliability Analysis and Basic Factor Analysis in R
4 pages
Lec 5 Working With Files
No ratings yet
Lec 5 Working With Files
34 pages
Basic R Commands for Data Analysis
No ratings yet
Basic R Commands for Data Analysis
7 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
20 pages
Statistics With R Unit 1: Divya Arun Kumar
No ratings yet
Statistics With R Unit 1: Divya Arun Kumar
65 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
R File Code
No ratings yet
R File Code
16 pages
An Introduction To R Language
No ratings yet
An Introduction To R Language
11 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Data Mining Lab Manual for R
No ratings yet
Data Mining Lab Manual for R
48 pages
Lesson 1
No ratings yet
Lesson 1
18 pages
BM1, Applied Statistics, Lesson 1: Data and Graph Basics: Luis Del Peso Ovalle
No ratings yet
BM1, Applied Statistics, Lesson 1: Data and Graph Basics: Luis Del Peso Ovalle
17 pages
Unit 3 Chatgpt
No ratings yet
Unit 3 Chatgpt
6 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
Day 2
No ratings yet
Day 2
5 pages
R
No ratings yet
R
15 pages
Week3 Slides
No ratings yet
Week3 Slides
36 pages
Data Import::: Cheat Sheet
No ratings yet
Data Import::: Cheat Sheet
2 pages
R Lab
No ratings yet
R Lab
7 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
Importing Excel Data in R Sessions
No ratings yet
Importing Excel Data in R Sessions
11 pages
Data Tidying With Tidyr::: Cheat Sheet
No ratings yet
Data Tidying With Tidyr::: Cheat Sheet
2 pages
Lab 02 - Compound Data Structures
No ratings yet
Lab 02 - Compound Data Structures
12 pages
All Codes
No ratings yet
All Codes
10 pages
Ecumenical Councils of The Church
No ratings yet
Ecumenical Councils of The Church
15 pages
Final Memorandum Report
No ratings yet
Final Memorandum Report
28 pages
Grammar (s1) 2
No ratings yet
Grammar (s1) 2
14 pages
Registers and Counters in Digital Electronics
No ratings yet
Registers and Counters in Digital Electronics
6 pages
Learning Episode 7,8,9
No ratings yet
Learning Episode 7,8,9
16 pages
Urbanization and City Changes Lecture
33% (3)
Urbanization and City Changes Lecture
3 pages
History of Pragmatics
No ratings yet
History of Pragmatics
5 pages
Desperately Seeking Intentions: Genuine and Jocular Insults On Social Media
No ratings yet
Desperately Seeking Intentions: Genuine and Jocular Insults On Social Media
11 pages
STSUP Scripting for Number Analysis
100% (1)
STSUP Scripting for Number Analysis
10 pages
CSE 2nd Year Timetable Jan-Jun 2025
No ratings yet
CSE 2nd Year Timetable Jan-Jun 2025
2 pages
Smart NL Lathe Training Manual
No ratings yet
Smart NL Lathe Training Manual
73 pages
Mathematics Exam Questions for Students
No ratings yet
Mathematics Exam Questions for Students
6 pages
FM240 Tech Manual
No ratings yet
FM240 Tech Manual
82 pages
21CS743 Model Set 1 Paper
No ratings yet
21CS743 Model Set 1 Paper
1 page
Poem No Men Are Foreign Class 9 Sub - English Date - 12/11/24
No ratings yet
Poem No Men Are Foreign Class 9 Sub - English Date - 12/11/24
6 pages
Dauphinee - The Ethics of Autoethnography (2010)
No ratings yet
Dauphinee - The Ethics of Autoethnography (2010)
21 pages
Coworker Vocabulary in British English
No ratings yet
Coworker Vocabulary in British English
6 pages
Data Engineering Pre-Interview Quiz MCQ
100% (2)
Data Engineering Pre-Interview Quiz MCQ
8 pages
Ancient Egyptian Stelae Analysis
No ratings yet
Ancient Egyptian Stelae Analysis
12 pages
De Kiem Tra Giua Ki I Tieng Anh Lop 4 Moi
No ratings yet
De Kiem Tra Giua Ki I Tieng Anh Lop 4 Moi
6 pages
Generating PWM With PIC Microcontroller - MikroC Pro
100% (3)
Generating PWM With PIC Microcontroller - MikroC Pro
3 pages
UG English 5th & 6th Sem Syllabus 2023-24
No ratings yet
UG English 5th & 6th Sem Syllabus 2023-24
23 pages
LecP EEE105 2 Fl2020
No ratings yet
LecP EEE105 2 Fl2020
3 pages
MCSE Exam Guide & Windows XP Setup
No ratings yet
MCSE Exam Guide & Windows XP Setup
68 pages
Srimathi Sundaravalli Memorial School: Admn. No. Name Maximum Marks Class Sec
No ratings yet
Srimathi Sundaravalli Memorial School: Admn. No. Name Maximum Marks Class Sec
2 pages
Thesis Formatting Guidelines for Nursing
No ratings yet
Thesis Formatting Guidelines for Nursing
7 pages
Murli 2025 02 03
No ratings yet
Murli 2025 02 03
4 pages
VMC - SOSE JEE MAINS - 4 (Question Paper)
No ratings yet
VMC - SOSE JEE MAINS - 4 (Question Paper)
14 pages
SUMMARY Intermediate 1
No ratings yet
SUMMARY Intermediate 1
8 pages
Hardware-Software Co-Design Overview
No ratings yet
Hardware-Software Co-Design Overview
29 pages

Https Tutorials Iq Harvard Edu R RDataManagement RDataManagement HTML

Uploaded by

Https Tutorials Iq Harvard Edu R RDataManagement RDataManagement HTML

Uploaded by

Workshop description

Example project overview

Reading Excel data files

The lesson notes are available at [Link]

Example project overview

Problems with the data

2. Locate the file named 1996boys_tcm77-[Link] and open it in a spreadsheet. (If

There are notes below the data.

3. Locate the file named [Link] and open it in a spreadsheet. In what

Some years include columns for “changes in rank”, others do not.

Convertido de web en PDF a [Link] con el api html a pdf

Working with Excel worksheets

The first step is to get a vector of file names.

[Link] <- [Link]("data/boys", [Link] = TRUE)

## [1] "Contents" "Table 1 - Top 100 boys, E&W"

Iterating over file names with map

Convertido de web en PDF a [Link] con el api html a pdf

## [1] "Contents" "Table 1 - Top 100 boys, E&W"

## [1] "Contents" "Metadata" "Terms and Conditions"

Filtering strings using regular expressions

## [1] "Table 1 - Top 100 boys, E&W"

Writing your own functions

[Link] <- function(file, pattern) {

Reading Excel data files

tmp <- read_excel(

Convertido de web en PDF a [Link] con el api html a pdf

## 1. Write a function that takes a file name as an argument and reads

## 2. Test your function by using it to read *one* of the boys names

1. fix column names

In short, we want to go from this:

Convertido de web en PDF a [Link] con el api html a pdf

Convertido de web en PDF a [Link] con el api html a pdf

Fixing column names

The actual column names look like this:

## [1] "X__1" "Name" "Count" "X__2" "X__3" "Name__1"

boysNames[[1]] <- setNames(

## [1] "X__1" "Name" "Count" "X__2" "X__3" "Name__1"

Fixing all the names

Convertido de web en PDF a [Link] con el api html a pdf

boysNames[[1]] <- select(boysNames[[1]], Name, Name__1, Count, Count__1)

Convertido de web en PDF a [Link] con el api html a pdf

Dropping missing values

boysNames[[1]] <- drop_na(boysNames[[1]])

Convertido de web en PDF a [Link] con el api html a pdf

## 1. Write a function that takes a file name as an argument and reads

## 2. Test your function by using it to read *one* of the boys names

Re-arranging into a single table

Convertido de web en PDF a [Link] con el api html a pdf

bind_rows(select(boysNames[[1]], Name, Count),

Exercise 3: Cleanup all the data

Convertido de web en PDF a [Link] con el api html a pdf

## write a function that does all the cleanup

bind_rows(select(selected, Name, Count), # re-arrange into two columns

## test it out on the second [Link] in the list

## apply the cleanup function to all the [Link] in the list

Adding derived columns

boysNames <- map(boysNames, mutate, Length = str_count(Name))

Exercise 4: Add a year column

Convertido de web en PDF a [Link] con el api html a pdf

## Insert year column in each table

Data organization and storage

One table for each year

## make directory to store the data

## Warning in [Link]("./data/byyear", recursive = TRUE): './data/byyear'

## extract the years

## clear our workspace

Exercise: work with tables organized by year

Convertido de web en PDF a [Link] con el api html a pdf

## 1. What where the five most popular names in 2013?

## Parsed with column specification:

slice(arrange(boys2013, desc(Count)), 1:5)

Convertido de web en PDF a [Link] con el api html a pdf

## Parsed with column specification:

## Parsed with column specification:

Convertido de web en PDF a [Link] con el api html a pdf

ggplot(andrew, aes(x = Year, y = Count)) +

## 2. Test your function by using it to read one of the boys names

## [1] "X1" "Name" "Count" "X2" "X3" "Name1"

## [1] "X1" "Name" "Count" "X2" "X3" "Name1"

boysNames[[1]] <- select(boysNames[[1]], Name, Name1, Count, Count1)

## 2. Test your function by using it to read one of the boys names