Subsetting Data in R
Subsetting Data in R
John Muschelli
January 5, 2016
Overview
I https:
//cran.rstudio.com/web/packages/dplyr/vignettes/
I https:
//stat545-ubc.github.io/block009_dplyr-intro.html
I https://www.datacamp.com/courses/
dplyr-data-manipulation-r-tutorial
Select specific elements using an index
Often you only want to look at subsets of a data set at any given
time. As a review, elements of an R object are selected using the
brackets ([ and ]).
For example, x is a vector of numbers and we can select the second
element of x using the brackets and an index (2):
x = c(1, 4, 2, 8, 10)
x[2]
[1] 4
Select specific elements using an index
x = c(1, 2, 4, 8, 10)
x[5]
[1] 10
x[c(2,5)]
[1] 2 10
Subsetting by deletion of entries
You can put a minus (-) before integers inside brackets to remove
these indices from the data.
[1] 1 4 8 10
Note that you have to be careful with this syntax when dropping
more than 1 element:
[1] 8 10
[1] 8 10
Select specific elements using logical operators
What about selecting rows based on the values of two variables?
We use logical statements. Here we select only elements of x
greater than 2:
[1] 1 2 4 8 10
x > 2
x[ x > 2 ]
[1] 4 8 10
Select specific elements using logical operators
I & : AND
I | : OR
[1] 4
x[ x > 5 | x == 2 ]
[1] 2 8 10
which function
The which functions takes in logical vectors and returns the index
for the elements where the logical value is TRUE.
[1] 2 4 5
x[ which(x > 5 | x == 2) ]
[1] 2 8 10
x[ x > 5 | x == 2 ]
[1] 2 8 10
Creating a data.frame to work with
set.seed(2016) # reproducbility
df = data.frame(x = c(1, 2, 4, 10, 10),
x2 = rpois(5, 10),
y = rnorm(5),
z = rpois(5, 6)
)
Renaming Columns of a data.frame: base R
We can use the colnames function to directly reassign column
names of df:
x X y z
1 1 7 -0.2707606 6
2 2 6 -1.1179372 4
3 4 10 -1.3473558 7
4 10 13 0.4832675 10
5 10 13 0.1523950 5
cn = colnames(df)
cn[ cn == "x2"] = "X"
colnames(df) = cn
head(df)
x X y z
1 1 7 -0.2707606 6
2 2 6 -1.1179372 4
3 4 10 -1.3473558 7
4 10 13 0.4832675 10
5 10 13 0.1523950 5
library(dplyr)
filter
head(stats::filter,2)
This is important when loading many packages, and you may have
some conflicts/masking:
Renaming Columns of a data.frame: dplyr
df = dplyr::rename(df, X = x2)
head(df)
x X y z
1 1 7 -0.2707606 6
2 2 6 -1.1179372 4
3 4 10 -1.3473558 7
4 10 13 0.4832675 10
5 10 13 0.1523950 5
df = dplyr::rename(df, x2 = X) # reset
Subset columns of a data.frame:
df$x
[1] 1 2 4 10 10
Subset columns of a data.frame:
We can also subset a data.frame using the bracket [, ]
subsetting.
For data.frames and matrices (2-dimensional objects), the
brackets are [rows, columns] subsetting. We can grab the x
column using the index of the column or the column name (“x”)
df[, 1]
[1] 1 2 4 10 10
df[, "x"]
[1] 1 2 4 10 10
Subset columns of a data.frame:
x y
1 1 -0.2707606
2 2 -1.1179372
3 4 -1.3473558
4 10 0.4832675
5 10 0.1523950
Subset columns of a data.frame: dplyr
select(df, x)
x
1 1
2 2
3 4
4 10
5 10
Select columns of a data.frame: dplyr
The select command from dplyr allows you to subset columns of
select(df, x, x2)
x x2
1 1 7
2 2 6
3 4 10
4 10 13
5 10 13
select(df, starts_with("x"))
x x2
1 1 7
2 2 6
3 4 10
4 10 13
Subset rows of a data.frame with indices:
x x2 y z
1 1 7 -0.2707606 6
3 4 10 -1.3473558 7
Subset rows of a data.frame:
x x2 y z
2 2 6 -1.1179372 4
4 10 13 0.4832675 10
5 10 13 0.1523950 5
Subset rows of a data.frame:
y z
2 -1.1179372 4
4 0.4832675 10
5 0.1523950 5
Subset rows of a data.frame: dplyr
filter(df, x > 5 | x == 2)
x x2 y z
1 2 6 -1.1179372 4
2 10 13 0.4832675 10
3 10 13 0.1523950 5
x x2 y z
1 4 10 -1.347356 7
x x2 y z
1 4 10 -1.347356 7
Combining filter and select
You can combine filter and select to subset the rows and
columns, respectively, of a data.frame:
y z
1 -1.347356 7
y z
1 -1.347356 7
It is read: “take df, then filter the rows and then select y, z”.
Adding new columns to a data.frame: base R
You can add a new column, called newcol to df, using the $
operator:
df$newcol = 5:1
df$newcol = df$x + 2
Removing columns to a data.frame: base R
df$newcol = NULL
You can also “column bind” a data.frame with a vector (or series
of vectors), using the cbind command:
x x2 y z newcol
1 1 7 -0.2707606 6 5
2 2 6 -1.1179372 4 4
3 4 10 -1.3473558 7 3
4 10 13 0.4832675 10 2
5 10 13 0.1523950 5 1
Adding columns to a data.frame: dplyr
The mutate function in dplyr allows you to add or replace columns
of a data.frame:
x x2 y z newcol
1 1 7 -0.2707606 6 5
2 2 6 -1.1179372 4 4
3 4 10 -1.3473558 7 3
4 10 13 0.4832675 10 2
5 10 13 0.1523950 5 1
x x2 y z newcol
1 1 7 -0.2707606 6 3
2 2 6 -1.1179372 4 4
3 4 10 -1.3473558 7 6
Removing columns to a data.frame: dplyr
select(df, -newcol)
x x2 y z
1 1 7 -0.2707606 6
2 2 6 -1.1179372 4
3 4 10 -1.3473558 7
4 10 13 0.4832675 10
5 10 13 0.1523950 5
Removing columns to a data.frame: dplyr
x x2 z
1 1 7 6
2 2 6 4
3 4 10 7
4 10 13 10
5 10 13 5
Ordering the columns of a data.frame: base R
cn = colnames(df)
df[, c("newcol", cn[cn != "newcol"]) ]
newcol x x2 y z
1 3 1 7 -0.2707606 6
2 4 2 6 -1.1179372 4
3 6 4 10 -1.3473558 7
4 12 10 13 0.4832675 10
5 12 10 13 0.1523950 5
Ordering the columns of a data.frame: dplyr
The select function can reorder columns. Put newcol first, then
select the rest of columns:
newcol x x2 y z
1 3 1 7 -0.2707606 6
2 4 2 6 -1.1179372 4
3 6 4 10 -1.3473558 7
4 12 10 13 0.4832675 10
5 12 10 13 0.1523950 5
Ordering the rows of a data.frame: base R
df[ order(df$x), ]
x x2 y z newcol
1 1 7 -0.2707606 6 3
2 2 6 -1.1179372 4 4
3 4 10 -1.3473558 7 6
4 10 13 0.4832675 10 12
5 10 13 0.1523950 5 12
Ordering the rows of a data.frame: base R
x x2 y z newcol
4 10 13 0.4832675 10 12
5 10 13 0.1523950 5 12
3 4 10 -1.3473558 7 6
2 2 6 -1.1179372 4 4
1 1 7 -0.2707606 6 3
Ordering the rows of a data.frame: base R
You can pass multiple vectors, and must use the negative (using -)
to mix decreasing and increasing orderings (sort increasing on x and
decreasing on y):
x x2 y z newcol
1 1 7 -0.2707606 6 3
2 2 6 -1.1179372 4 4
3 4 10 -1.3473558 7 6
4 10 13 0.4832675 10 12
5 10 13 0.1523950 5 12
Ordering the rows of a data.frame: dplyr
arrange(df, x)
x x2 y z newcol
1 1 7 -0.2707606 6 3
2 2 6 -1.1179372 4 4
3 4 10 -1.3473558 7 6
4 10 13 0.4832675 10 12
5 10 13 0.1523950 5 12
Ordering the rows of a data.frame: dplyr
arrange(df, desc(x))
x x2 y z newcol
1 10 13 0.4832675 10 12
2 10 13 0.1523950 5 12
3 4 10 -1.3473558 7 6
4 2 6 -1.1179372 4 4
5 1 7 -0.2707606 6 3
Ordering the rows of a data.frame: dplyr
arrange(df, x, desc(y))
x x2 y z newcol
1 1 7 -0.2707606 6 3
2 2 6 -1.1179372 4 4
3 4 10 -1.3473558 7 6
4 10 13 0.4832675 10 12
5 10 13 0.1523950 5 12
Transmutation
transmute(df, newcol2 = x * 3, x, y)
newcol2 x y
1 3 1 -0.2707606
2 6 2 -1.1179372
3 12 4 -1.3473558
4 30 10 0.4832675
5 30 10 0.1523950