0% found this document useful (0 votes)
8 views

Week3 2020

Uploaded by

shuaiwu365
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Week3 2020

Uploaded by

shuaiwu365
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Module 3 - Data Structure and Management

More Data types in R


Character strings
Character strings in R can be defined using double or single quotes.
string <- "A string in single quotes"
another.string <- 'This time defined using single quotes'

Strings can be printed using print (or in the console just the variable name).
string

## [1] "A string in single quotes"


Use the function cat to print the string as it is
cat(string)

## A string in single quotes


R has a variety of built-in functions for manipulating strings. It is however easier to use the functions from
the package stringr.
For example, two strings can be concatenated using the function str_c (the equivalent base R function is
paste).
library(stringr)
str_c("Two strings", "put together", sep=" - ")

## [1] "Two strings - put together"


Note that you cannot use character vectors for any sort of arithmetic. For example
"120" + "5"

## Error in "120" + "5": non-numeric argument to binary operator


You are unlikely to do this deliberately, but you might come across if R treats a supposedly numerical variable
as a character string because of at least one non-numerical entry.
Comparisons between character strings use lexicographic ordering, i.e. they are compared based on where
they would be in a dictionary. For example
"apple">"pear"

## [1] FALSE
However character strings containing numbers are compared in an unexpected way
"120" > "5"

## [1] FALSE
In lexicographic ordering “5” comes after “120” (as only the first digit “1” matters).

1
Factors
Factors are a variant on that theme and designed for use with categorical variables. The main difference
between a factor and a character vector is that factors are only allowed to take a pre-defined set of values
(“levels”).
Any vector can be converted to a factor using the function factor. It has the additional (optional) arguments
levels and labels which can be used to set the labels printed when a vector is displayed.
x <- c(1, 4, 2, 4, 1, 3, 1, 2, 4)
X <- factor(x, levels=1:5, labels=c("one", "two", "three", "four", "five"))
X

## [1] one four two four one three one two four
## Levels: one two three four five
In our example, the vector X is only allowed to take the values "one", "two", "three", "four" and "five"
(the level "five" is currently not being used). Thus we can set for example
X[1] <- "five"
X

## [1] five four two four one three one two four
## Levels: one two three four five
We cannot set the first entry to "six". "six" is not in the set of allowed labels.
X[1] <- "six"

## Warning in `[<-.factor`(`*tmp*`, 1, value = "six"): invalid factor level,


## NA generated
X

## [1] <NA> four two four one three one two four
## Levels: one two three four five
In order to be able to set the first entry to "six" we need to first expand the set of levels.
levels(X) <- c(levels(X), "six")
X[1] <- "six"
X

## [1] six four two four one three one two four
## Levels: one two three four five six
To turn X back into its original numerical format we can use the function unclass.
unclass(X)

## [1] 6 4 2 4 1 3 1 2 4
## attr(,"levels")
## [1] "one" "two" "three" "four" "five" "six"

2
Converting between data types
You can convert between different data types by using as.<target-datatype>. For example you can convert
x <- pi # x is numeric
x <- as.character(x) # now x is a character string
x <- as.numeric(x) # x is numeric again (but we lost some digits)

Often R will convert between different types automatically. The most common data types are

Data type Conversion function Description


numeric as.numeric Floating point numbers
integer as.integer Integer numbers
logical as.logical TRUE or FALSE
character as.character Character string
factor as.factor Factor

## Missing values R uses the special value NA to indicate that a value is missing. It is different from NaN,
which is “not a number”, i.e. a value for which the calculations have gone wrong.
You cat set a value to NA by simply assigning it the value NA.
x <- 1:4
x[4] <- NA
x

## [1] 1 2 3 NA
Calculations (arithmetic, summary functions, etc.) involving NAs have the result set to NA as well: the result
depends on the value that is not available.
mean(x)

## [1] NA
If you want R to omit the missing values you can either use
mean(na.omit(x))

## [1] 2
or
mean(x, na.rm=TRUE)

## [1] 2
The former is more generic as not every function support the additional argument of na.rm=TRUE.
Use the function is.na to test whether a value is missing.
is.na(x)

## [1] FALSE FALSE FALSE TRUE


You cannot use == to test whether a value is missing
x==NA

## [1] NA NA NA NA
The comparison just results in NA.

3
Lists

Lists

https://youtu.be/-MumGJIIrTI

Duration: 6m37s

Creating lists
You can create a list using the list command.
example1 <- list(1:3, c(TRUE,FALSE), 7)
example1

## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] TRUE FALSE
##
## [[3]]
## [1] 7
Lists can be nested within each other.
example2 <- list(list(pi, 1:2), list("some text", 1:3))
example2

## [[1]]
## [[1]][[1]]
## [1] 3.141593
##
## [[1]][[2]]
## [1] 1 2
##
##
## [[2]]
## [[2]][[1]]
## [1] "some text"
##
## [[2]][[2]]
## [1] 1 2 3
Just like vectors, lists can be concatenated using the function c.
c(example1, example2)

## [[1]]
## [1] 1 2 3
##
## [[2]]

4
## [1] TRUE FALSE
##
## [[3]]
## [1] 7
##
## [[4]]
## [[4]][[1]]
## [1] 3.141593
##
## [[4]][[2]]
## [1] 1 2
##
##
## [[5]]
## [[5]][[1]]
## [1] "some text"
##
## [[5]][[2]]
## [1] 1 2 3
The first three entries of the resulting list come from example1, the remaining fourth and fifth entry from
example2.
It is good programming style to name the elements of a list (if possible). This can be done using names:
names(example1) <- c("a", "b", "c")
example1

## $a
## [1] 1 2 3
##
## $b
## [1] TRUE FALSE
##
## $c
## [1] 7
Alternatively, you can specify the names directly in the list command:
example1 <- list(a=1:3, b=c(TRUE,FALSE), c=7)

Accessing elements
Elements of a list can be accessed using either double square brackets or, if the list has names, $. The
following three lines of R code all return the first entry of the list example1 (using either its position or its
name).
example1[[1]]

## [1] 1 2 3
example1[["a"]]

## [1] 1 2 3
example1$a

## [1] 1 2 3

5
if you want to subset a list (in the sense of extracting more than one entry) use single square brackets:
example1[1:2]

## $a
## [1] 1 2 3
##
## $b
## [1] TRUE FALSE
If we extract an individual entry using single square brackets we obtain a list containing that entry rather
than the entry itself.
example1[1]

## $a
## [1] 1 2 3

Lists are everywhere . . .


Most R functions return lists. Take the lm function as an example. The function lm fits a linear regression
model, which is essentially a straight line fit to the data. First of all we fit a linear model:
fit <- lm(dist~speed, data=cars) # Fit a linear model
fit # Print the model

##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
The output object fit is in fact a list.
is.list(fit)

## [1] TRUE
Let’s look at its entries
names(fit)

## [1] "coefficients" "residuals" "effects" "rank"


## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
So, if we for example want to retrieve the regression coefficients we can use
fit$coefficients

## (Intercept) speed
## -17.579095 3.932409
Actually, we do not have to fully spell out the name of the entry after the $, we only have to use the first few
characters until the name is uniquely determined. So we could have also used
fit$coef

## (Intercept) speed
## -17.579095 3.932409

6
We could have not used
fit$c

## NULL
as there is more than one element with a name starting with a “c” ("call" and "coefficients").

Data frames
Data frame basics

Data frames

https://youtu.be/SSei0Sf_nu0

Duration: 10m57s

Last week we have learned how to create matrices. In principle, we could use matrices to store data sets.
However, matrices (and vectors) have one important constraint: all entries of a vector or a matrix have to be
of the same data type. Many data sets we work with have both numerical and factor variables, so we would
need to store our data using different data types for the different columns. We could in theory use lists to
manage such data. However, lists do not enforce the constraint that each variable has the same number of
observations, so if we used lists our data would soon get messy.
Let’s look at an example illustrating this limitation of matrices.
Consider a matrix kids containing age, weight and height of two children.
kids <- rbind(c( 4, 15, 101),
c(11, 28, 132))
colnames(kids) <- c("age", "weight", "height")
rownames(kids) <- c("Sarah", "John")
kids

## age weight height


## Sarah 4 15 101
## John 11 28 132
We can for example see that John is older than Sarah.
kids["John", "age"] > kids["Sarah", "age"]

## [1] TRUE
Let’s now try adding a column gender.
kids2 <- cbind(kids, gender=c("f", "m"))
kids2

## age weight height gender


## Sarah "4" "15" "101" "f"
## John "11" "28" "132" "m"
Let’s see whether John is still older than Sarah.

7
kids2["John", "age"] > kids2["Sarah", "age"]

## [1] FALSE
Not any more. How come? We have added the variable gender, which is a character vector. As all the data
in the matrix needs to be of the same data type, R had to convert all the columns, including age to character
strings, and character strings compare not in the same way as numbers: in a dictionary 11 would be before 4.
We can see from the quotes around the numerical variables that these have been converted to characters
(accidental conversion to factors is slightly more difficult to spot, as R prints factors without quotes).
Because of situations like this one it is better to use data frames to store data sets.

Creating data frames


A matrix can be converted to a data frame using as.data.frame and we can convert a data frame back to a
matrix using as.matrix.
kids <- as.data.frame(kids)
kids <- cbind(kids, gender=c("f","m"))
kids

## age weight height gender


## Sarah 4 15 101 f
## John 11 28 132 m
kids["John", "age"] > kids["Sarah", "age"]

## [1] TRUE
A data frame can handle column of different data types, so the numeric column is not converted when adding
a character (or factor) column.
Data frames can be created using the function data.frame, so we could have created the data set using
kids <- data.frame(age=c(4,11), weight=c(15,28), height=c(101,132), gender=c("f", "m"))
rownames(kids) <- c("Sarah", "John")

or
kids <- rbind(Sarah=data.frame(age=4, weight=15, height=101, gender="f"),
John=data.frame(age=11, weight=28, height=132, gender="m"))

A data frame has row names and column names like a matrix and these can be set in the same way as for
matrices. Data frames can also be subset like matrices. We will come back to this later on.

Data frames also behave like lists


Data frames behave not just only like matrices, they also behave like lists (with the columns being the entries).
We can use $ to access columns:
kids$age

## [1] 4 11
So for a data frame, the following four lines of R commands are all equivalent
kids$age

## [1] 4 11
kids[,"age"]

8
## [1] 4 11
kids[,1]

## [1] 4 11
kids[["age"]]

## [1] 4 11
kids[[1]]

## [1] 4 11
As mentioned previously, it is always better to refer to columns by their name rather then their index. The
latter is too likely to change as you work with the data.
We can use the same notation to set values in the data frame. Suppose it was John’s birthday and we want
to change his age to 5. We can use any of the lines below (and there are even more possible ways).
kids["John", "age"] <- 5
kids$age[2] <- 5
kids[2,1] <- 5
kids[[2]][1] <- 5

Again, the command at the top is probably best, because it is the most “human-readable”.

Data manipulation
In this section we now look at various R functions for manipulating data frames. We work with a toy dataset
called chol, which you can load into R using
load(url("http://www.stats.gla.ac.uk/~levers/rp/chol.RData"))

The data set contains (simulated) blood fat measurements for a small number of patients.

Adding new columns

Transforming/adding variables

https://youtu.be/XJE9x9qW0zk

Duration: 3m57s

Suppose we want to add to our dataset a new column called log.hdl.ldl which contains the logarithm of
the ratio of HDL and LDL cholesterol.
Using what we have seen so far we could use
chol <- cbind(chol, log.hdl.ldl=log(chol[,"hdl"]/chol[,"ldl"]))

or
chol[,"log.hdl.ldl"] <- log(chol[,"hdl"]/chol[,"ldl"])

or

9
chol$log.hdl.ldl <- log(chol$hdl/chol$ldl)
chol

## ldl hdl trig age gender smoke log.hdl.ldl


## 1 175 25 148 39 female no -1.94591015
## 2 196 36 92 32 female no -1.69459572
## 3 139 65 NA 42 male <NA> -0.76008666
## 4 162 37 139 30 female ex-smoker -1.47667842
## 5 140 117 59 42 female ex-smoker -0.17946849
## 6 147 51 126 65 female ex-smoker -1.05860695
## 7 82 81 NA 57 male no -0.01227009
## 8 165 63 120 48 male current -0.96281075
## 9 149 49 NA 32 female no -1.11212601
## 10 95 54 157 55 female ex-smoker -0.56489285
## 11 169 59 67 48 female no -1.05236127
## 12 174 117 168 41 female no -0.39688136
## 13 91 52 146 69 female current -0.55961579
The first two commands work for data frames and matrices whereas the last one only works for data frames.
All of the above commands look slightly messy and use the name of the dataset more than once, which is a
potential source of mistakes when you want to change the name of the dataset in the future.
It is generally better to use the functions transform. Its main advantage is that we do not need to put chol$
everywhere.
chol <- transform(chol, log.hdl.ldl=log(hdl/ldl))

Removing columns
You can use the subsetting techniques for matrices to remove columns. For example,
chol <- chol[,-3]

removes the column trig (third column). Alternatively, you could use list-style syntax:
chol[[3]] <- NULL

or
chol$trig <- NULL

The null object in R is NULL. You can test whether an object is null by using is.null(object). object==NULL
does however not work.

Subsetting data sets


Data frames can be subset just like matrices. For example, to remove all patients who have never smoked
and store the result in a data frame called chol.smoked we can use
chol.smoked <- chol[chol$smoke!="no",]
chol.smoked

## ldl hdl age gender smoke log.hdl.ldl


## NA NA NA NA <NA> <NA> NA
## 4 162 37 30 female ex-smoker -1.4766784
## 5 140 117 42 female ex-smoker -0.1794685
## 6 147 51 65 female ex-smoker -1.0586070
## 8 165 63 48 male current -0.9628107
## 10 95 54 55 female ex-smoker -0.5648928

10
## 13 91 52 69 female current -0.5596158
Because there is a missing value in the smoking status, the new data set starts with a row of missing values.
This is due to the fact that chol$smoke!="no" has as third entry NA.
chol$smoke!="no"

## [1] FALSE FALSE NA TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
## [12] FALSE TRUE
For every NA in the condition, R will put a row of NA’s at the top of the data set.
We then have to remove the row(s) with missing values, which is best done using the function na.omit
chol.smoked <- na.omit(chol.smoked)

We can subset the data frame in one go when we use the function subset. Just like transform it provides a
cleaner solution because it does not require us to use chol$ in the condition.
chol.smoked <- subset(chol, smoke!="no")
chol.smoked

## ldl hdl age gender smoke log.hdl.ldl


## 4 162 37 30 female ex-smoker -1.4766784
## 5 140 117 42 female ex-smoker -0.1794685
## 6 147 51 65 female ex-smoker -1.0586070
## 8 165 63 48 male current -0.9628107
## 10 95 54 55 female ex-smoker -0.5648928
## 13 91 52 69 female current -0.5596158
Task 1 Create a data frame chol.lowhdl containing the data for patients with a HDL cholesterol of less
than 40 mg/dl.

Sorting
The function sort(x) sorts the vector x. The function order(x) returns the permutation required to sort
the vector x. Consider the vector
x <- c(11, 7, 3, 9, 4)

We can sort x using


sort(x)

## [1] 3 4 7 9 11
So what does the function order do?
p <- order(x)
p

## [1] 3 5 2 4 1
The first entry of the result p is 3. The third entry of x is the smallest: if we want to sort x we have to put
the third entry first, or, in other words, the first entry of the sorted vector would be the third entry of x.
The second entry of p is 5, because the second smallest entry of x is its fith entry.
We can obtain the sorted vector by applying the permutation obtained from order to x
x[p]

## [1] 3 4 7 9 11

11
This trick can be used to sort an entire dataset by one column. The data frame chol can be sorted by the age
of the patient using
permut <- order(chol$age) # Create the permutation ordering the age
chol <- chol[permut,] # Apply this permutation to the entire data set
chol

## ldl hdl age gender smoke log.hdl.ldl


## 4 162 37 30 female ex-smoker -1.47667842
## 2 196 36 32 female no -1.69459572
## 9 149 49 32 female no -1.11212601
## 1 175 25 39 female no -1.94591015
## 12 174 117 41 female no -0.39688136
## 3 139 65 42 male <NA> -0.76008666
## 5 140 117 42 female ex-smoker -0.17946849
## 8 165 63 48 male current -0.96281075
## 11 169 59 48 female no -1.05236127
## 10 95 54 55 female ex-smoker -0.56489285
## 7 82 81 57 male no -0.01227009
## 6 147 51 65 female ex-smoker -1.05860695
## 13 91 52 69 female current -0.55961579

Attaching datasets
To access the column height in the data frame kids we have to use kids$height (or one of the above
equivalent statements). Sometimes it would be easier just to refer to it as height (without the kids$) as we
have done inside transform or subset. This can be done using the function attach. After calling
attach(kids)

we can access the columns of kids as if they were variables, i.e. we can use
weight / (height/100)^2

## [1] 14.70444 16.06979


instead of
kids$weight / (kids$height/100)^2

## [1] 14.70444 16.06979


To undo the effects attach simply use
detach(kids)

However, attach behaves in an unexpected way if you try to change any of the variables of a data frame you
have already attached. The problem with attach’ed data is that it exists in R twice. Once as inside the data
frame (where it was before you have called attach) and once as a variable in your current environment. R
does however not link those two. If you change one of them, the other one does not get updated automatically.
Suppose we now want to change the weight from kgs to pounds, i.e. divide it by 0.45359237. If you now use
weight <- weight / 0.45359237

to change the unit, you have changed the unit only for the attach’ed variable.
weight

## [1] 33.06934 61.72943

12
kids$weight

## [1] 15 28
weight contains the weight in pounds, whereas kids$weight still contains the “old” weight in kgs.
If we had used
kids$weight <- kids$weight / 0.45359237

the opposite would have happened. weight would have contained the old version (in kgs), and only
kids$weight would have contained the new version (in pounds). In other words, whatever we do, we will
end up with inconsistent data.
Thus, is probably a good idea to avoid using attach. If you use attach remember this important rule: Never
manipulate attach’ed data!
The function with is a safer alternative to attach. We could use
with( kids, {
weight <- weight / 0.45359237
} )

to transform the column weight from kgs to pounds. In scripts using with is actually clearer than attach.
However in the interactive console, with is slightly more awkward to use. Note that in this example it would
have been easiest to use transform.
kids <- transform(kids, weight=weight / 0.45359237)

Merging data sets


Often, the data required for a certain analysis is stored in more than one data frame. Many transactional
databases are designed using the so-called “third-normal form” design principle, which, despite being optimal
from a transactional point of view, requires combining many tables before being able to analyse the data. R
has a built-in command merge, which allows for merging tables using common keys.
Consider the following example. In a study of 2,287 eighth-grade pupils (aged about 11) a language test score
and the verbal IQ were determined. The question of interest is whether the socio-economic status (SES) of
the parents and various characteristics of the class influence the language test score and the verbal IQ. The
data is stored in two data frames, one containing the data about the children (children) and one containing
the data about the different classes (classes).
The first few rows of children are:
load(url("http://www.stats.gla.ac.uk/~levers/rp/children_classes.RData"))
head(children)

## lang IQ SES class


## 1 46 15.0 23 180
## 2 45 14.5 10 180
## 3 33 9.5 15 180
## 4 46 11.0 23 180
## 5 20 8.0 10 180
## 6 30 9.5 10 180
The first few rows of classes are:
head(classes)

## class size combined


## 1 180 29 TRUE

13
## 2 280 19 TRUE
## 3 1082 25 TRUE
## 4 1280 31 TRUE
## 5 1580 35 TRUE
## 6 1680 28 TRUE
There are good reasons for storing the data in this format. This way less redundant information is stored and
the information about each classes is stored exactly once. This makes it easier to change class properties like
for example the number of pupils.
However, for the analysis of the data it is necessary that we “copy” all information from the data frame
classes into the data frame children: for every child we need to look up which class it belongs to and copy
the information about that class into the row belonging to that child. Of course we cannot simply use cbind:
the child in the second row of the data frame children attended class “180”, which is described in the first
row of classes.
The function merge can be used for such a task. By default, merge merges datasets using the columns both
data frames have in common (in our case class)
data <- merge(children, classes)

It is typically better to explicitly specify which column(s) are to be used for merging the data frames. This
avoids that columns that happen to have the same name in both data frames, but are not related are used in
the merger. The common key(s) to be used for merging can be specified using the argument by, provided
they have the same names in both data frames.
data <- merge(children, classes, by="class")

The arguments by.x and by.y allow merging data frames using columns which do not have the same name
in both data frames (in our case they of course do have the same name).
data <- merge(children, classes, by.x="class", by.y="class")

For each child merge has looked up the information about the class and added it to the row in the resulting
data frame data.
head(data)

## class lang IQ SES size combined


## 1 180 46 15.0 23 29 TRUE
## 2 180 45 14.5 10 29 TRUE
## 3 180 33 9.5 15 29 TRUE
## 4 180 46 11.0 23 29 TRUE
## 5 180 20 8.0 10 29 TRUE
## 6 180 30 9.5 10 29 TRUE
If there were children in a class for which there is no entry in classes R would by default remove these from
the resulting data frame. Similarly, data from classes for which there are no pupils in children will also not
appear in the resulting table.
If you want the resulting data frame to contain all cases from the first data frame even if there is no matching
entry in the second data frame, you need to specify the additional argument (all.x=TRUE). If you want the
resulting data frame to contain all cases from the second data frame even if there is no matching entry in the
first data frame, you need to specify the additional argument (all.y=TRUE).
Task 3 Consider two data frames patients and weights, which you can load into R using
load(url("http://www.stats.gla.ac.uk/~levers/rp/patients_weights.RData"))

The first few rows of the data sets are

14
head(patients)

## PatientID Gender Age Smoke


## 1 1 male 33 no
## 2 2 female 32 no
## 3 3 male 67 ex
## 4 4 male 36 current
## 5 5 female 47 current
head(weights)

## PatientID Week Weight


## 1 1 1 72
## 2 1 2 74
## 3 1 3 71
## 4 2 1 54
## 5 2 3 54
## 6 3 1 96
Merge the data sets such that there is information about the patient for each weighting.

Importing and exporting data from R


IMPORTANT: whenever using Rstudio in the labs and opening or saving R script files, navigate to either
the H: or K: drive and open/save from there, but don’t go directly or navigate into the Documents or Desktop
folder.

Native .RData files


R has an internal binary data format (“.RData files”), which can be used to store one or more objects.
Typically .RData or .rda is used as the file extension. The key advantage of using R’s internal format is that
it stores objects exactly as they are in R. If you load the objects back in you are guaranteed that they are
exactly reproduced.
If you want to save a object x to a file you can use
save(x, file="MyX.RData")

If you want to save more than one object (say x and y) you can use
save(x, y, file="MyXY.RData")

If you want to save all objects in your workspace to a file (MyVariables.RData), you can use
save.image(file="MyVariables.RData")

To load the data you have saved back into R use


load(file="MyVariables.RData")

When saving the workspace at the end of a session, R simply stores all objects in your workspace in a file
.RData in your home directory.
R’s internal format is a good idea if you only work with R. However, very few other software products support
it.
Path names in Windows typically contain backslashes (\). In R you need to either use a forward slash instead
(for example c:/Users/Ludger/data.RData) or escape the backslash, i.e. use a double \\ (for example
c:\\Users\\Ludger\\data.RData).

15
Text and CSV files

Reading in data

https://youtu.be/OWR6DJKpz3A

Duration: 5m55s

### Importing data into R


In most cases data is stored in a table or spreadsheet format. The easiest way of reading external data into R
is to use delimited text files. If the data at hand is say an Excel spreadsheet it is typically easier to open it in
Excel first and convert it to a text (or CSV) file in Excel, and only then open the text file in R.
Before reading a text file into R, it is always a good idea to look at the file first using a raw text editor and
determine the following information, which is easy for you to determine, but hard for R to guess automatically:
• Column names: Does the first line of the file contain data or does it contain the names of the columns
(“variables”)?
• Delimiter: What character is used to delimit the columns? This is typically white space, tabulator, ,
or ;.
• Missing values: Determine how missing values are encoded (if there are any). R uses NA, but * and .
are common as well.
Below are the first few lines of the file chol.txt.
175 25 148 39 female no
196 36 92 32 female no
139 65 NA 42 male NA
We can see that . . .
• The first line contains data and not the names of the columns.
• The columns are delimited using white space.
• The data set uses NA to encode missing values.
Such white space (or tab) delimited files can be read in using the R function read.table. It assumes by
default that the first line of the file does not contain the column names (i.e. the first line already contains
data), that white space is used as delimiter, and that missing values are encoded as NA. If this is not the case,
you need to use the following additional arguments:
• header: Use header=TRUE if the first line of the file contains the names of the columns.
• sep: If the delimiter of the columns is not white space, but another character, you need to use the
additional argument sep. For comma-separated data, use sep=",". The latter is better read in using
the function read.csv.
• na.strings: If missing values are encoded using strings other than “NA”, you need to use the ad-
ditional argument na.strings. If for example “*” is used to denote missing values, you would use
na.strings="*". The argument na.strings can be a character vector if more than one string is used
to denote missing values. You do not need to use this argument if your data set does not contain any
missing values.
• dec: You can use the additional option dec to set the decimal separator (e.g. dec=',')
The file chol.txt uses no column names, white space as a separator, and uses “NA” for missing values. Thus
we do not need to use any additional arguments to read it into R.

16
chol <- read.table("chol.txt")
head(chol)

## V1 V2 V3 V4 V5 V6
## 1 175 25 148 39 female no
## 2 196 36 92 32 female no
## 3 139 65 NA 42 male <NA>
## 4 162 37 139 30 female ex-smoker
## 5 140 117 59 42 female ex-smoker
## 6 147 51 126 65 female ex-smoker
It is always worth looking at the first few lines of the data you have read in to make sure it was read in
correctly.
If, like in our example, the data file does not contain variables, it is a good idea to set them right after you
have read in the data.
colnames(chol) <- c("ldl", "hdl", "trig", "age", "gender", "smoke")

R tries to guess of what type each column (variable) is, but might not always get it right. It also is worth
checking that each column was read in as the data type you had intended. This can be done using
sapply(chol, class)

## ldl hdl trig age gender smoke


## "integer" "integer" "integer" "integer" "factor" "factor"
Alternatively we could use
str(chol)

## 'data.frame': 13 obs. of 6 variables:


## $ ldl : int 175 196 139 162 140 147 82 165 149 95 ...
## $ hdl : int 25 36 65 37 117 51 81 63 49 54 ...
## $ trig : int 148 92 NA 139 59 126 NA 120 NA 157 ...
## $ age : int 39 32 42 30 42 65 57 48 32 55 ...
## $ gender: Factor w/ 2 levels "female","male": 1 1 2 1 1 1 2 2 1 1 ...
## $ smoke : Factor w/ 3 levels "current","ex-smoker",..: 3 3 NA 2 2 2 3 1 3 2 ...
If a variable which you had intended to be numeric (or an integer) shows up as a factor it is likely that
missing values (or other error codes) were in the data that were not read in correctly.
The file chol.csv contains the same data, however in comma-separated (“CSV”) format. The first few lines of
this file are
ldl,hdl,trig,age,gender,smoke
175,25,148,39,female,no
196,36,92,32,female,no
139,65,.,42,male,.
We can see that . . .
• The first line of the file are the column names.
• The columns are delimited using commas.
• The missing values are coded using ..
Thus we can read the file into R using
chol <- read.table("chol.csv", header=TRUE, sep=",", na.strings=".")

or

17
chol <- read.csv("chol.csv", na.strings=".")

read.csv is a sibling of read.table with the main difference that it assumes by default that the data is
comma-separated and that the first line contains the variable names. In other words, you do not need to
specify sep="," and header=TRUE" when using read.csv.
Task 4 Read the data files cars.csv and ships.txt into R.

Exporting data from R


Text files can be used as well to export data from R using the function
write.table(chol, file="chol.csv", sep=",", col.names=TRUE, row.names=FALSE)

The arguments col.names and row.names can be used to choose whether the column names and row names
should be exported as well. The function write.csv can be used instead of the additional argument sep=",".
write.csv(chol, file="chol.csv", row.names=FALSE)

Other file formats


There are many R packages that allow reading in data stored in file formats used by other software products.
In most cases it is best to first convert the file into a text file using the proprietary software the file format
corresponds to and then open the exported text file in R. However, there are R packages which allow opening
various different file formats.

Package Functions Formats that can be read in


readxl read_excel Excel spreadsheets (.xls and .xlsx)
xlsx read.xlsx, write.xlsx Excel spreadsheets (only .xlsx)
foreign read.dta, write.dta Stata binary files
foreign read.spss SPSS files

Reminder: Installing R packages


Before you can load an R package with library(packagename) you need to install it, either using the
RStudio user interface (click on the tab Packages in the bottom right pane and then click on Install or
install the package from command line using the command install.packages("packagename").

18
Solutions to the tasks
Task 1
chol.lowdl <- subset(chol,hdl<40)
chol.lowdl

## ldl hdl trig age gender smoke


## 1 175 25 148 39 female no
## 2 196 36 92 32 female no
## 4 162 37 139 30 female ex-smoker
Task 2
weights.all <- merge(patients,weights,by="PatientID")
head(weights.all)

## PatientID Gender Age Smoke Week Weight


## 1 1 male 33 no 1 72
## 2 1 male 33 no 2 74
## 3 1 male 33 no 3 71
## 4 2 female 32 no 1 54
## 5 2 female 32 no 3 54
## 6 3 male 67 ex 1 96
Task 3 The first line of the file cars.csv contains the variable names and the fields are separated by commas.
Missing values are recorded as aterisks.
cars <- read.csv("cars.csv",na.strings ="*")
str(cars)

## 'data.frame': 20 obs. of 5 variables:


## $ Manufacturer: Factor w/ 10 levels "Cadillac","Chevrolet",..: 2 8 3 2 2 10 3 10 1 6 ...
## $ Model : Factor w/ 20 levels "Achieva","Astro",..: 3 1 17 2 9 8 18 12 10 11 ...
## $ MPG : int 19 NA 22 NA 25 18 18 25 16 29 ...
## $ Displacement: num 3.4 2.3 2.5 4.3 2.2 2.8 3 1.8 4.9 1.5 ...
## $ Horsepower : int 160 155 100 165 110 178 300 81 200 81 ...
or alternatively
cars <- read.table("cars.csv", header = TRUE, sep = ",", na.strings = "*")
str(cars)

## 'data.frame': 20 obs. of 5 variables:


## $ Manufacturer: Factor w/ 10 levels "Cadillac","Chevrolet",..: 2 8 3 2 2 10 3 10 1 6 ...
## $ Model : Factor w/ 20 levels "Achieva","Astro",..: 3 1 17 2 9 8 18 12 10 11 ...
## $ MPG : int 19 NA 22 NA 25 18 18 25 16 29 ...
## $ Displacement: num 3.4 2.3 2.5 4.3 2.2 2.8 3 1.8 4.9 1.5 ...
## $ Horsepower : int 160 155 100 165 110 178 300 81 200 81 ...
The first line of the file ships.txt contains the variable names and the fields are separated by whitespace.
Missing values are encoded as dots.
ships <- read.table("ships.txt", header= TRUE, na.strings= ".")
str(ships)

## 'data.frame': 40 obs. of 5 variables:


## $ type : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ year : int 60 60 65 65 70 70 75 75 60 60 ...
## $ period : int 60 75 60 75 60 75 60 75 60 75 ...

19
## $ service : int 127 63 NA 1095 1512 3353 0 2244 44882 17176 ...
## $ incidents: int 0 0 3 4 6 18 0 11 39 29 ...

20

You might also like