Week3 2020
Week3 2020
Strings can be printed using print (or in the console just the variable name).
string
## [1] FALSE
However character strings containing numbers are compared in an unexpected way
"120" > "5"
## [1] FALSE
In lexicographic ordering “5” comes after “120” (as only the first digit “1” matters).
1
Factors
Factors are a variant on that theme and designed for use with categorical variables. The main difference
between a factor and a character vector is that factors are only allowed to take a pre-defined set of values
(“levels”).
Any vector can be converted to a factor using the function factor. It has the additional (optional) arguments
levels and labels which can be used to set the labels printed when a vector is displayed.
x <- c(1, 4, 2, 4, 1, 3, 1, 2, 4)
X <- factor(x, levels=1:5, labels=c("one", "two", "three", "four", "five"))
X
## [1] one four two four one three one two four
## Levels: one two three four five
In our example, the vector X is only allowed to take the values "one", "two", "three", "four" and "five"
(the level "five" is currently not being used). Thus we can set for example
X[1] <- "five"
X
## [1] five four two four one three one two four
## Levels: one two three four five
We cannot set the first entry to "six". "six" is not in the set of allowed labels.
X[1] <- "six"
## [1] <NA> four two four one three one two four
## Levels: one two three four five
In order to be able to set the first entry to "six" we need to first expand the set of levels.
levels(X) <- c(levels(X), "six")
X[1] <- "six"
X
## [1] six four two four one three one two four
## Levels: one two three four five six
To turn X back into its original numerical format we can use the function unclass.
unclass(X)
## [1] 6 4 2 4 1 3 1 2 4
## attr(,"levels")
## [1] "one" "two" "three" "four" "five" "six"
2
Converting between data types
You can convert between different data types by using as.<target-datatype>. For example you can convert
x <- pi # x is numeric
x <- as.character(x) # now x is a character string
x <- as.numeric(x) # x is numeric again (but we lost some digits)
Often R will convert between different types automatically. The most common data types are
## Missing values R uses the special value NA to indicate that a value is missing. It is different from NaN,
which is “not a number”, i.e. a value for which the calculations have gone wrong.
You cat set a value to NA by simply assigning it the value NA.
x <- 1:4
x[4] <- NA
x
## [1] 1 2 3 NA
Calculations (arithmetic, summary functions, etc.) involving NAs have the result set to NA as well: the result
depends on the value that is not available.
mean(x)
## [1] NA
If you want R to omit the missing values you can either use
mean(na.omit(x))
## [1] 2
or
mean(x, na.rm=TRUE)
## [1] 2
The former is more generic as not every function support the additional argument of na.rm=TRUE.
Use the function is.na to test whether a value is missing.
is.na(x)
## [1] NA NA NA NA
The comparison just results in NA.
3
Lists
Lists
https://youtu.be/-MumGJIIrTI
Duration: 6m37s
Creating lists
You can create a list using the list command.
example1 <- list(1:3, c(TRUE,FALSE), 7)
example1
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] TRUE FALSE
##
## [[3]]
## [1] 7
Lists can be nested within each other.
example2 <- list(list(pi, 1:2), list("some text", 1:3))
example2
## [[1]]
## [[1]][[1]]
## [1] 3.141593
##
## [[1]][[2]]
## [1] 1 2
##
##
## [[2]]
## [[2]][[1]]
## [1] "some text"
##
## [[2]][[2]]
## [1] 1 2 3
Just like vectors, lists can be concatenated using the function c.
c(example1, example2)
## [[1]]
## [1] 1 2 3
##
## [[2]]
4
## [1] TRUE FALSE
##
## [[3]]
## [1] 7
##
## [[4]]
## [[4]][[1]]
## [1] 3.141593
##
## [[4]][[2]]
## [1] 1 2
##
##
## [[5]]
## [[5]][[1]]
## [1] "some text"
##
## [[5]][[2]]
## [1] 1 2 3
The first three entries of the resulting list come from example1, the remaining fourth and fifth entry from
example2.
It is good programming style to name the elements of a list (if possible). This can be done using names:
names(example1) <- c("a", "b", "c")
example1
## $a
## [1] 1 2 3
##
## $b
## [1] TRUE FALSE
##
## $c
## [1] 7
Alternatively, you can specify the names directly in the list command:
example1 <- list(a=1:3, b=c(TRUE,FALSE), c=7)
Accessing elements
Elements of a list can be accessed using either double square brackets or, if the list has names, $. The
following three lines of R code all return the first entry of the list example1 (using either its position or its
name).
example1[[1]]
## [1] 1 2 3
example1[["a"]]
## [1] 1 2 3
example1$a
## [1] 1 2 3
5
if you want to subset a list (in the sense of extracting more than one entry) use single square brackets:
example1[1:2]
## $a
## [1] 1 2 3
##
## $b
## [1] TRUE FALSE
If we extract an individual entry using single square brackets we obtain a list containing that entry rather
than the entry itself.
example1[1]
## $a
## [1] 1 2 3
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
The output object fit is in fact a list.
is.list(fit)
## [1] TRUE
Let’s look at its entries
names(fit)
## (Intercept) speed
## -17.579095 3.932409
Actually, we do not have to fully spell out the name of the entry after the $, we only have to use the first few
characters until the name is uniquely determined. So we could have also used
fit$coef
## (Intercept) speed
## -17.579095 3.932409
6
We could have not used
fit$c
## NULL
as there is more than one element with a name starting with a “c” ("call" and "coefficients").
Data frames
Data frame basics
Data frames
https://youtu.be/SSei0Sf_nu0
Duration: 10m57s
Last week we have learned how to create matrices. In principle, we could use matrices to store data sets.
However, matrices (and vectors) have one important constraint: all entries of a vector or a matrix have to be
of the same data type. Many data sets we work with have both numerical and factor variables, so we would
need to store our data using different data types for the different columns. We could in theory use lists to
manage such data. However, lists do not enforce the constraint that each variable has the same number of
observations, so if we used lists our data would soon get messy.
Let’s look at an example illustrating this limitation of matrices.
Consider a matrix kids containing age, weight and height of two children.
kids <- rbind(c( 4, 15, 101),
c(11, 28, 132))
colnames(kids) <- c("age", "weight", "height")
rownames(kids) <- c("Sarah", "John")
kids
## [1] TRUE
Let’s now try adding a column gender.
kids2 <- cbind(kids, gender=c("f", "m"))
kids2
7
kids2["John", "age"] > kids2["Sarah", "age"]
## [1] FALSE
Not any more. How come? We have added the variable gender, which is a character vector. As all the data
in the matrix needs to be of the same data type, R had to convert all the columns, including age to character
strings, and character strings compare not in the same way as numbers: in a dictionary 11 would be before 4.
We can see from the quotes around the numerical variables that these have been converted to characters
(accidental conversion to factors is slightly more difficult to spot, as R prints factors without quotes).
Because of situations like this one it is better to use data frames to store data sets.
## [1] TRUE
A data frame can handle column of different data types, so the numeric column is not converted when adding
a character (or factor) column.
Data frames can be created using the function data.frame, so we could have created the data set using
kids <- data.frame(age=c(4,11), weight=c(15,28), height=c(101,132), gender=c("f", "m"))
rownames(kids) <- c("Sarah", "John")
or
kids <- rbind(Sarah=data.frame(age=4, weight=15, height=101, gender="f"),
John=data.frame(age=11, weight=28, height=132, gender="m"))
A data frame has row names and column names like a matrix and these can be set in the same way as for
matrices. Data frames can also be subset like matrices. We will come back to this later on.
## [1] 4 11
So for a data frame, the following four lines of R commands are all equivalent
kids$age
## [1] 4 11
kids[,"age"]
8
## [1] 4 11
kids[,1]
## [1] 4 11
kids[["age"]]
## [1] 4 11
kids[[1]]
## [1] 4 11
As mentioned previously, it is always better to refer to columns by their name rather then their index. The
latter is too likely to change as you work with the data.
We can use the same notation to set values in the data frame. Suppose it was John’s birthday and we want
to change his age to 5. We can use any of the lines below (and there are even more possible ways).
kids["John", "age"] <- 5
kids$age[2] <- 5
kids[2,1] <- 5
kids[[2]][1] <- 5
Again, the command at the top is probably best, because it is the most “human-readable”.
Data manipulation
In this section we now look at various R functions for manipulating data frames. We work with a toy dataset
called chol, which you can load into R using
load(url("http://www.stats.gla.ac.uk/~levers/rp/chol.RData"))
The data set contains (simulated) blood fat measurements for a small number of patients.
Transforming/adding variables
https://youtu.be/XJE9x9qW0zk
Duration: 3m57s
Suppose we want to add to our dataset a new column called log.hdl.ldl which contains the logarithm of
the ratio of HDL and LDL cholesterol.
Using what we have seen so far we could use
chol <- cbind(chol, log.hdl.ldl=log(chol[,"hdl"]/chol[,"ldl"]))
or
chol[,"log.hdl.ldl"] <- log(chol[,"hdl"]/chol[,"ldl"])
or
9
chol$log.hdl.ldl <- log(chol$hdl/chol$ldl)
chol
Removing columns
You can use the subsetting techniques for matrices to remove columns. For example,
chol <- chol[,-3]
removes the column trig (third column). Alternatively, you could use list-style syntax:
chol[[3]] <- NULL
or
chol$trig <- NULL
The null object in R is NULL. You can test whether an object is null by using is.null(object). object==NULL
does however not work.
10
## 13 91 52 69 female current -0.5596158
Because there is a missing value in the smoking status, the new data set starts with a row of missing values.
This is due to the fact that chol$smoke!="no" has as third entry NA.
chol$smoke!="no"
## [1] FALSE FALSE NA TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
## [12] FALSE TRUE
For every NA in the condition, R will put a row of NA’s at the top of the data set.
We then have to remove the row(s) with missing values, which is best done using the function na.omit
chol.smoked <- na.omit(chol.smoked)
We can subset the data frame in one go when we use the function subset. Just like transform it provides a
cleaner solution because it does not require us to use chol$ in the condition.
chol.smoked <- subset(chol, smoke!="no")
chol.smoked
Sorting
The function sort(x) sorts the vector x. The function order(x) returns the permutation required to sort
the vector x. Consider the vector
x <- c(11, 7, 3, 9, 4)
## [1] 3 4 7 9 11
So what does the function order do?
p <- order(x)
p
## [1] 3 5 2 4 1
The first entry of the result p is 3. The third entry of x is the smallest: if we want to sort x we have to put
the third entry first, or, in other words, the first entry of the sorted vector would be the third entry of x.
The second entry of p is 5, because the second smallest entry of x is its fith entry.
We can obtain the sorted vector by applying the permutation obtained from order to x
x[p]
## [1] 3 4 7 9 11
11
This trick can be used to sort an entire dataset by one column. The data frame chol can be sorted by the age
of the patient using
permut <- order(chol$age) # Create the permutation ordering the age
chol <- chol[permut,] # Apply this permutation to the entire data set
chol
Attaching datasets
To access the column height in the data frame kids we have to use kids$height (or one of the above
equivalent statements). Sometimes it would be easier just to refer to it as height (without the kids$) as we
have done inside transform or subset. This can be done using the function attach. After calling
attach(kids)
we can access the columns of kids as if they were variables, i.e. we can use
weight / (height/100)^2
However, attach behaves in an unexpected way if you try to change any of the variables of a data frame you
have already attached. The problem with attach’ed data is that it exists in R twice. Once as inside the data
frame (where it was before you have called attach) and once as a variable in your current environment. R
does however not link those two. If you change one of them, the other one does not get updated automatically.
Suppose we now want to change the weight from kgs to pounds, i.e. divide it by 0.45359237. If you now use
weight <- weight / 0.45359237
to change the unit, you have changed the unit only for the attach’ed variable.
weight
12
kids$weight
## [1] 15 28
weight contains the weight in pounds, whereas kids$weight still contains the “old” weight in kgs.
If we had used
kids$weight <- kids$weight / 0.45359237
the opposite would have happened. weight would have contained the old version (in kgs), and only
kids$weight would have contained the new version (in pounds). In other words, whatever we do, we will
end up with inconsistent data.
Thus, is probably a good idea to avoid using attach. If you use attach remember this important rule: Never
manipulate attach’ed data!
The function with is a safer alternative to attach. We could use
with( kids, {
weight <- weight / 0.45359237
} )
to transform the column weight from kgs to pounds. In scripts using with is actually clearer than attach.
However in the interactive console, with is slightly more awkward to use. Note that in this example it would
have been easiest to use transform.
kids <- transform(kids, weight=weight / 0.45359237)
13
## 2 280 19 TRUE
## 3 1082 25 TRUE
## 4 1280 31 TRUE
## 5 1580 35 TRUE
## 6 1680 28 TRUE
There are good reasons for storing the data in this format. This way less redundant information is stored and
the information about each classes is stored exactly once. This makes it easier to change class properties like
for example the number of pupils.
However, for the analysis of the data it is necessary that we “copy” all information from the data frame
classes into the data frame children: for every child we need to look up which class it belongs to and copy
the information about that class into the row belonging to that child. Of course we cannot simply use cbind:
the child in the second row of the data frame children attended class “180”, which is described in the first
row of classes.
The function merge can be used for such a task. By default, merge merges datasets using the columns both
data frames have in common (in our case class)
data <- merge(children, classes)
It is typically better to explicitly specify which column(s) are to be used for merging the data frames. This
avoids that columns that happen to have the same name in both data frames, but are not related are used in
the merger. The common key(s) to be used for merging can be specified using the argument by, provided
they have the same names in both data frames.
data <- merge(children, classes, by="class")
The arguments by.x and by.y allow merging data frames using columns which do not have the same name
in both data frames (in our case they of course do have the same name).
data <- merge(children, classes, by.x="class", by.y="class")
For each child merge has looked up the information about the class and added it to the row in the resulting
data frame data.
head(data)
14
head(patients)
If you want to save more than one object (say x and y) you can use
save(x, y, file="MyXY.RData")
If you want to save all objects in your workspace to a file (MyVariables.RData), you can use
save.image(file="MyVariables.RData")
When saving the workspace at the end of a session, R simply stores all objects in your workspace in a file
.RData in your home directory.
R’s internal format is a good idea if you only work with R. However, very few other software products support
it.
Path names in Windows typically contain backslashes (\). In R you need to either use a forward slash instead
(for example c:/Users/Ludger/data.RData) or escape the backslash, i.e. use a double \\ (for example
c:\\Users\\Ludger\\data.RData).
15
Text and CSV files
Reading in data
https://youtu.be/OWR6DJKpz3A
Duration: 5m55s
16
chol <- read.table("chol.txt")
head(chol)
## V1 V2 V3 V4 V5 V6
## 1 175 25 148 39 female no
## 2 196 36 92 32 female no
## 3 139 65 NA 42 male <NA>
## 4 162 37 139 30 female ex-smoker
## 5 140 117 59 42 female ex-smoker
## 6 147 51 126 65 female ex-smoker
It is always worth looking at the first few lines of the data you have read in to make sure it was read in
correctly.
If, like in our example, the data file does not contain variables, it is a good idea to set them right after you
have read in the data.
colnames(chol) <- c("ldl", "hdl", "trig", "age", "gender", "smoke")
R tries to guess of what type each column (variable) is, but might not always get it right. It also is worth
checking that each column was read in as the data type you had intended. This can be done using
sapply(chol, class)
or
17
chol <- read.csv("chol.csv", na.strings=".")
read.csv is a sibling of read.table with the main difference that it assumes by default that the data is
comma-separated and that the first line contains the variable names. In other words, you do not need to
specify sep="," and header=TRUE" when using read.csv.
Task 4 Read the data files cars.csv and ships.txt into R.
The arguments col.names and row.names can be used to choose whether the column names and row names
should be exported as well. The function write.csv can be used instead of the additional argument sep=",".
write.csv(chol, file="chol.csv", row.names=FALSE)
18
Solutions to the tasks
Task 1
chol.lowdl <- subset(chol,hdl<40)
chol.lowdl
19
## $ service : int 127 63 NA 1095 1512 3353 0 2244 44882 17176 ...
## $ incidents: int 0 0 3 4 6 18 0 11 39 29 ...
20