365 Data Science R Course Notes
365 Data Science R Course Notes
Course notes
R fundamentals | Data types | Vectors | Matrices | Data frames
Words of welcome
Commu- Commu-
nity nity
You are here because you want to work better with data. Becoming comfortable with using a programming language
for your data manipulations is a gigantic step forward. This way you will be able to automate, reproduce, and
communicate your manipulations faster and better.
R is (perhaps) the best language to start with: it is an open-source language that is heavily tailored for handling data:
from data wrangling through data visualization to machine learning. R’s functionality is huge and, kind of like our
universe, R’s universe is always expanding.
Packages are the main source of functionality in R: they comprise different families of functions intended for a specific
manipulation or field. The majority of R’s packages are stored on the CRAN project website https://cran.r-project.org/,
but there are CRAN-external sources, too.
The RStudio interface
Objects are named data structures inside of which you store data. An object can be a single digit, a character, a Boolean value,
a sentence, a data frame, a list of data frames, a multi-dimensional structure, and so on. You use functions to create objects.
x <- y()
the object x the function y
is created from
An integer is a whole number; Doubles store regular Character vectors* store text Logical vectors store Boolean
any number that doesn’t need numbers: they can be large, data. A character element can data; these are TRUE and
anything after the decimal small, positive, negative, with be a single letter, number, or a FALSE values.
point is an integer. digits after the decimal or symbol, or a longer string, like
without. a sentence or even a TRUE and FALSE and T and F
R doesn’t usually jump to paragraph. can be used interchangeably
creating an integer vector Because R is heavily used for as R recognises both. It is
when you pass numbers. data analysis and most To create a character string better, however, to use TRUE
Often, you need to explicitly operations typically either you must pass the value you and FALSE because these are
tell it to do so. involve or result in a double, want stored as a string in the reserved words which
this is R’s default way of saving quotation marks. cannot be overwritten by the
You do that by passing the numerical data. user.
letter L after each number in
the integer object you’re TRUE and FALSE values must
creating. be inputted in capital letters,
and without quotation marks.
typeof(object.name) returns the basic data type of the object you pass
is.integer(object.name) returns a Boolean signifying whether the object you pass is an integer or not; an
analogous command exists for the other data types, too
Vector - a sequence of data elements that are of the same type
Coercion rules
R has ways to prevent certain mistakes from happening. For example, if you are trying to create an object, and you are passing the wrong type of
data as an argument, then R will convert the value to the correct type, so you can end up creating your object. The correct type is typically the
simplest type necessary to represent all the information.
Basics
v3 <- v1 × v2 v3 <- v1 × v2
Multiplying two vectors happens element-by-element R repeats the shorter vector (note that the vector is not then
saved this way!)
Names Dimensions
You can give each element in a vector a name value. Changing the dimensions of a vector enables you to transform
When printed, R will display the values with their names above your 1-dimensional vector into an as-many-dimensional object
them. you want.
Use the same command to change the names if needed Notice that R
Example: fills out the
Remove names by setting the names values to null matrix column
names(object) <- NULL by column
Vector slicing and recycling
• Indexing refers to selecting and extracting a single element from a structure (in this case, a vector)
• Slicing is to select and extract a sequence of elements
• R’s notation for selecting a value is object[i], where i is the index to which your value corresponds
name: a b c d e f g h i
11 13 15 17 19 21 23 25 27
i: 1 2 3 4 5 6 7 8 9
Note that indexing with a negative integer works best if you are keeping most of the values and only want to get
rid of some. You can also subset using TRUE and FALSE values, but that isn’t very useful at this stage of learning.
For the curious ones, it looks like this:
x <- 1:5 x[c(TRUE, FALSE, TRUE, TRUE, FALSE)] ## 1 3 4
Getting HELP with R
m3 <- m1 + m2
1
Example: multiplying a 1 4 7 1 4 3 1 4 7 1 16 21
matrix with a vector 2
× 2 5 8 = 2 1 4 × 2 5 8 = 4 5 32
m2 <- m1 × v1
3
3 6 9 3 2 1 3 6 9 9 12 9
4
m1 v1 m1 m2
v1
Relational and Logical Operators
Relational operators (comparison operators), are Logical operators, also called Boolean operators, • A single equals sign “=“ is not a relational
for evaluating R objects in relation to one are useful when you want to combine the results operator; this is used to assign information to
another. of two or more comparisons. an object;
• “==“ equal to; • AND “&” – if all comparison outcomes are • %in% operator – tests if an object is in a
• “!=“ not equal to; TRUE, the result is TRUE (an exclusive group of objects;
• “<“ less than; operator);
• “>” more than; • OR “|” – if there is a TRUE value anywhere in • double AND “&&” – examines only the first
• “<=“ less than or equal to; the expression, R returns TRUE (an inclusive element of each vector; instead of doing a
• “>=“ more than or equal to. operator); logical test on all available data, it does a
• NOT “!” – flips the result of a logical test. single logical test, and outputs a single value.
Can be used with all data types and larger
structures like matrices and data frames.
Comparing two vectors: R compares the first element of the vector on the
right with the first element of the vector on the left, then it proceeds to
the second elements, and so on.
If and Else Statements
The general syntax of an else if statement: • Else if statements tell R to check if a different
condition is met in case the previous one is not;
typically used when there are more than two
condition to be checked mutually exclusive cases or when you want to
specify a special case in which R does something;
command to be executed • There is no limit on the number of else if
if the condition is met statements you can string.
if(A){
Scenario 1 new condition to be checked
in case the previous condition
} else if(B){ isn’t met Important
Scenario 2 • All conditions must evaluate to a single logical
command to be executed
} else if(C){ value; it cannot be a vector of TRUEs and FALSEs. If
if the condition is met it is a vector, R will only look at the first instance in
Scenario 3 new condition to be checked
the vector and execute accordingly;
} else{ in case the previous condition • An if statement needs only a single condition to
evaluate to TRUE in order to stop its search and
Scenario 4 isn’t met execute the code for that condition. This is
} especially relevant for situations in which two or
command to be executed more conditions are not mutually exclusive and
if the condition is met there could be more than one TRUE condition.
Creating a data frame
• Data frames are list structures that can store variables of different
basic types, like numeric and string values
• Data types can differ between, but not within, columns
• A row can be comprised of cells with different data types
• The cells of a column (variable) can only be of a single data type
• You can create a data frame with the data.frame() function:
Months Size Weight Breed
old ✓ my.df <- data.frame(var.a, var.b, var.c,
stringsAsFactors = FALSE)
Flipper 53 medium 21 dog
✓ the stringsAsFactor = argument specifies whether
Bromley 19 small 8 dog the character variables in your data should be coded as
Nox 34 medium 4 cat factors or not; it is often better to set it to FALSE, and
encode the variables which are factors manually with the
Orion 41 large 6 cat as.factor() function.
Dagger 84 small 7 dog • Naming the columns of a data frame:
extra ✓ With the names() functions: this will name your columns
Zizi 140 2 cat
small
one data
✓ rownames() will name your rows if needed
Carrie 109 large 36 dog type
✓ By defining the names as you create your data frame
✓ my.df <- data.frame(“A” = var.a, “B” = var.b)
named rows multiple data types
and columns
Importing and exporting data frames
Importing data is the primary way you will be using to get your data into R. You can import data from text files (CSV, tab-delimited), Excel,
SAS, SPSS, Stata, and so on.
read.table(“file/path”, sep = “”, header = TRUE, …) is the general-purpose family of functions for reading text data into R