0% found this document useful (0 votes)
184 views

365 Data Science R Course Notes

R is a programming language tailored for handling data from wrangling to visualization to machine learning. It has a large collection of packages that add additional functionality. The RStudio interface provides an environment to write code, view outputs, and access help documentation. Objects are named data structures that store data of various types like integers, doubles, characters, and logicals. Functions take arguments as inputs and perform operations to create, manipulate, and analyze objects.

Uploaded by

Gopi Sahane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
184 views

365 Data Science R Course Notes

R is a programming language tailored for handling data from wrangling to visualization to machine learning. It has a large collection of packages that add additional functionality. The RStudio interface provides an environment to write code, view outputs, and access help documentation. Objects are named data structures that store data of various types like integers, doubles, characters, and logicals. Functions take arguments as inputs and perform operations to create, manipulate, and analyze objects.

Uploaded by

Gopi Sahane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

R PROGRAMMING

Course notes
R fundamentals | Data types | Vectors | Matrices | Data frames
Words of welcome

Commu- Commu-
nity nity

You are here because you want to work better with data. Becoming comfortable with using a programming language
for your data manipulations is a gigantic step forward. This way you will be able to automate, reproduce, and
communicate your manipulations faster and better.

R is (perhaps) the best language to start with: it is an open-source language that is heavily tailored for handling data:
from data wrangling through data visualization to machine learning. R’s functionality is huge and, kind of like our
universe, R’s universe is always expanding.

Packages are the main source of functionality in R: they comprise different families of functions intended for a specific
manipulation or field. The majority of R’s packages are stored on the CRAN project website https://cran.r-project.org/,
but there are CRAN-external sources, too.
The RStudio interface

Environment: where the variables,


data structure, functions, etc., you
create are displayed
History: stores the commands
run during the current session

Script pane: stores the code you type allowing you


to keep a record of your work and organize it.
Produces a .R file.

Files: displays all the files in your


workspace (wd).
Plots: displays & stores the data
visualizations you create
Packages: contains a list of all the
packages you have on your
Console: the place where the machine/user library
output of your commands appears.
Code can also be typed in here for Help: the directory storing R’s
quick operations. documentation; searchable
Packages

Preloaded packages: R has a decent functionality before


installing new packages. However, R’s {base} packages are often
outdated, or cumbersome to use.

Installing packages: installing a package saves the package in


our User Library for future use. A package needs to be installed
only once.

Loading packages: If you want to use the functions in a specific


package, you must tell R to load it for you. You only need to load
a package once per session. You do that with the
library(package.name) function.

Distinguishing between functions: some packages have


functions whose names overlap. You can recognise which
package a function belongs to by checking the {package.name}
next to it during autocompletion.

Removing packages: you can uninstall a package with the


remove.packages(“package.name”) function.
Creating an object

Objects are named data structures inside of which you store data. An object can be a single digit, a character, a Boolean value,
a sentence, a data frame, a list of data frames, a multi-dimensional structure, and so on. You use functions to create objects.

x <- y()
the object x the function y
is created from

The object name:


• Must begin with a small letter.
• Longer names can be created by separating the individual words with a dot (.), an underscore (_), or by capitalizing the
first letter of every new word (objectName).
• Once you select a notation, stick with it
Once you store data inside an object, you can use the object name to call that data and do operations with it. An
operation will be carried out on each element of the data structure, systematically.
Data types

Integer Double Character Logical

An integer is a whole number; Doubles store regular Character vectors* store text Logical vectors store Boolean
any number that doesn’t need numbers: they can be large, data. A character element can data; these are TRUE and
anything after the decimal small, positive, negative, with be a single letter, number, or a FALSE values.
point is an integer. digits after the decimal or symbol, or a longer string, like
without. a sentence or even a TRUE and FALSE and T and F
R doesn’t usually jump to paragraph. can be used interchangeably
creating an integer vector Because R is heavily used for as R recognises both. It is
when you pass numbers. data analysis and most To create a character string better, however, to use TRUE
Often, you need to explicitly operations typically either you must pass the value you and FALSE because these are
tell it to do so. involve or result in a double, want stored as a string in the reserved words which
this is R’s default way of saving quotation marks. cannot be overwritten by the
You do that by passing the numerical data. user.
letter L after each number in
the integer object you’re TRUE and FALSE values must
creating. be inputted in capital letters,
and without quotation marks.

typeof(object.name) returns the basic data type of the object you pass

is.integer(object.name) returns a Boolean signifying whether the object you pass is an integer or not; an
analogous command exists for the other data types, too
Vector - a sequence of data elements that are of the same type
Coercion rules
R has ways to prevent certain mistakes from happening. For example, if you are trying to create an object, and you are passing the wrong type of
data as an argument, then R will convert the value to the correct type, so you can end up creating your object. The correct type is typically the
simplest type necessary to represent all the information.

NUMERIC When a numeric and a


character value are present
together, the numeric value
is converted to a character.
When a numeric and a logical
element are present together, the
logical value is converted to a
numerical one. T 1
TRUE is encoded as 1 F 0 CHARACTER
FALSE is encoded as 0
When a logical and a
character value are present
together, the logical
element is coerced to a
LOGICAL character.
Functions and arguments
Think of a function as any other type of object in R. It has a name, stores information (a set of statements organized in a
particular way so as to perform a specific task), and can be called on when needed.

Basics

function.name(argument1 = , argument2 • Functions take arguments – the arguments of a function can be


data, additional instructions on how to carry out the operation,
= , … argumentN = NULL) other functions (called nesting)
some arguments can have default values (e.g., argument = NULL) • To run a function, call the function and pass in parentheses the
data you want the function to operate on; this is an argument
• To save the result of a function into an object for further use, use
the object-creating formula (object <- function(data))
function.name <- function(){ • The arguments of a function have an inherent order when passing
# body of code values you can either explicitly specify the argument they are for
(and thus not follow the inherent order), or you can omit the
return(value) argument name but keep to the inherent argument order.
} Intermediate
creates the function
• A function is comprised of three parts: the function name, the
body of code (containing the statements), and arguments;
the statements the function • Sometimes a fourth component can be included in the basic
what the function executes when called structure of a function – the return value; it specifies what the
returns function returns once executed and is written last in the body
Vector operations
Vector – a sequence of data elements that are of the same type
Vector operations happen in an element-wise fashion
Vector recycling happens when if the two vectors you are doing operations with are different in length; in that case, R repeats the shorter
vector until it matches the length of the longer one.

Example: operation with same-length vectors Example: vector recycling

v3 <- v1 × v2 v3 <- v1 × v2

Multiplying two vectors happens element-by-element R repeats the shorter vector (note that the vector is not then
saved this way!)

2 1 2 Note that the first 2 1 2 1 2 The resulting v3 is the


× = element of v1 is
× = × = same length as the
multiplied by the first longer vector, v2.
4 3 12 4 3 4 3 12
element of v2, the
second element of v1, by If the shorter vector is
6 5 30 the second element of 5 2 5 10 not a multiple of the
v2, and so on. longer one, R will issue
8 7 56 7 4 7 28 a warning, but it will
The resulting v3 is the still carry out the
v1 v2 v3 same length as v1 and v2. v1 v2 v1 v2 v3 operation.
Vector attributes
Attributes are additional information about an object stored in the object; they do not affect the values of the object, instead they provide
us with extra functionality if a function is designed to take into account whether or not an object has a specific attribute
attributes(object.name) checks whether the object has any attributes; an output NULL means the attributes object is empty

Names Dimensions
You can give each element in a vector a name value. Changing the dimensions of a vector enables you to transform
When printed, R will display the values with their names above your 1-dimensional vector into an as-many-dimensional object
them. you want.

You can create a 2-dimensional matrix from a vector by


specifying row and column values in the dim() attribute.

dim(object) <- c(3, 4)


Assign names by creating a character vector and using the
names(object) <- c(“name.1”, “name.2” … “name.k”) The row value comes first, followed by the columns value.
command

Use the same command to change the names if needed Notice that R
Example: fills out the
Remove names by setting the names values to null matrix column
names(object) <- NULL by column
Vector slicing and recycling

• Indexing refers to selecting and extracting a single element from a structure (in this case, a vector)
• Slicing is to select and extract a sequence of elements
• R’s notation for selecting a value is object[i], where i is the index to which your value corresponds

name: a b c d e f g h i

11 13 15 17 19 21 23 25 27

i: 1 2 3 4 5 6 7 8 9

x[2] x[“b”] x[3:5] x[c(1,7,9)] x[-c(1:5,7,9)] x[x>13 & x<21]

Note that indexing with a negative integer works best if you are keeping most of the values and only want to get
rid of some. You can also subset using TRUE and FALSE values, but that isn’t very useful at this stage of learning.
For the curious ones, it looks like this:
x <- 1:5 x[c(TRUE, FALSE, TRUE, TRUE, FALSE)] ## 1 3 4
Getting HELP with R

Description: gives you a brief overview of what a function does

Usage: an example of how the function is written; if the function has


many arguments, here you will see the inherent argument order

Arguments: gives you the complete list of arguments a function can


take, and the kind of information R needs for each. It also tells you
what that information will be used for

Value: an explanation of what the function will return when you


execute it; this is essentially what your output will be

Details/References: with more complex functions this is where the


author of the function may decide to draw your attention to some
specifics of the function you may want to know for highly specific
uses

See also: A useful pointer to related functions

Examples: Example code of the function in action; guaranteed to


work

Use ?function or help(function)to call help on a function


Use ??keyword to call general help on something you want to do
Creating a matrix

• Matrices are a natural extension to vectors: while vectors are 1-


dimensional collections of data, matrices are two-dimensional
arrays
• Matrices have a fixed number of rows and columns
If nrow = is specified, R infers what ncol = should be, and vice versa
• Matrices can contain only one basic data type
• You can create a matrix in the following ways:
a b c d e f g
✓ dim(x) <- c(i, j), where i and j are the values for
r1 2 4 6 8 10 12 14 rows and columns, respectively
✓ array(data, dim = c(i, j))
r2 16 18 20 22 24 26 28 named
✓ matrix(data, nrow =, ncol = , byrow = ,
columns dimnames = )
r3 30 32 34 36 38 40 42 the byrow = argument specifies whether the matrix
ought to be filled out column-by-column or row-by-row
r4 44 46 48 50 52 54 56
• Naming the dimensions of a matrix happens in two ways:
r5 58 60 62 64 66 68 70 ✓ With the rownames() and colnames() functions
✓ By defining the dimnames = argument in the matrix()
function
byrow = TRUE fills out the
named rows matrix row-by-row
Matrix operations
Just like with vectors, matrix operations happen in element-wise fashion
Scaling is when you do an arithmetic operation on a matrix with a single value; it happens on all the values in the matrix
and effectively standardizes it
To do arithmetic operations with matrices, they must be of the same size (rows x columns)

Example: operation with same-length matrices

m3 <- m1 + m2

Adding two matrices together happens element-by-element

Note that to do inner and outer matrix


2 4 6 8 + 1 5 9 13 = 3 9 15 21 multiplication (linear algebra), you need to
specify this to R
16 18 20 22 2 6 10 14 18 24 30 36 m1 %*% m2
creates the product of inner multiplication
30 32 34 36 3 7 11 15 33 39 45 51 m1 %o% m2
creates the product of outer multiplication
44 46 48 50 4 8 12 16 48 54 60 66 Use t() to transpose a matrix if needed
m1 m2 m3
Recycling with matrix operations

1 Creating a matrix – you can create a


Example: creating a matrix 1 4 3 2 matrix from a vector that does not have
from a shorter vector all the values needed to fill out the
2 dimensions you specify, because R will
2 1 4 3 recycle the vector to match the desired
m1 <- v1
3 length
3 2 1 4 Vector × matrix operations –
4 you can do operations with a
m1
matrix and a vector, and if the
v1 matrix(1:4, nrow = 3, ncol = 4) vector has fewer values than
the matrix, R will again recycle

1
Example: multiplying a 1 4 7 1 4 3 1 4 7 1 16 21
matrix with a vector 2
× 2 5 8 = 2 1 4 × 2 5 8 = 4 5 32
m2 <- m1 × v1
3
3 6 9 3 2 1 3 6 9 9 12 9
4
m1 v1 m1 m2

v1
Relational and Logical Operators

Relational Logical Keep in mind!

Relational operators (comparison operators), are Logical operators, also called Boolean operators, • A single equals sign “=“ is not a relational
for evaluating R objects in relation to one are useful when you want to combine the results operator; this is used to assign information to
another. of two or more comparisons. an object;

• “==“ equal to; • AND “&” – if all comparison outcomes are • %in% operator – tests if an object is in a
• “!=“ not equal to; TRUE, the result is TRUE (an exclusive group of objects;
• “<“ less than; operator);
• “>” more than; • OR “|” – if there is a TRUE value anywhere in • double AND “&&” – examines only the first
• “<=“ less than or equal to; the expression, R returns TRUE (an inclusive element of each vector; instead of doing a
• “>=“ more than or equal to. operator); logical test on all available data, it does a
• NOT “!” – flips the result of a logical test. single logical test, and outputs a single value.
Can be used with all data types and larger
structures like matrices and data frames.

Comparing a single value to a vector: R compares vectors in an element-


by-element fashion; if you pass a vector in your command, the output is
a vector of logical values that has the length of the vector you’ve passed.

Comparing two vectors: R compares the first element of the vector on the
right with the first element of the vector on the left, then it proceeds to
the second elements, and so on.
If and Else Statements

The general syntax of an if else statement:

condition to be checked • If else statements build up on logical tests and


allow the programmer to instruct a program to
if(A){ command to be executed take action based on the outcome of a test
(whether it is TRUE or FALSE). An if statement is an
Scenario 1 if the condition is met instruction to R to do something if a condition is
met. An else statement is an instruction to R to do
} else { instruction what to do if something if a condition is not met.
Scenario 2 the condition is not met
• An if statement begins with if, followed by the
} command to be executed condition R should check in parentheses, and curly
brackets that hold the code R must execute if the
if the condition is not met condition in the parentheses is TRUE;
• If an if statement’s condition does not evaluate to
TRUE, the program doesn’t run anything and
terminates, unless an else statement is defined;
• An else statement begins with else, followed by the
code R must execute in curly brackets, in case the
if condition is not met;
Else If Statements

The general syntax of an else if statement: • Else if statements tell R to check if a different
condition is met in case the previous one is not;
typically used when there are more than two
condition to be checked mutually exclusive cases or when you want to
specify a special case in which R does something;
command to be executed • There is no limit on the number of else if
if the condition is met statements you can string.
if(A){
Scenario 1 new condition to be checked
in case the previous condition
} else if(B){ isn’t met Important
Scenario 2 • All conditions must evaluate to a single logical
command to be executed
} else if(C){ value; it cannot be a vector of TRUEs and FALSEs. If
if the condition is met it is a vector, R will only look at the first instance in
Scenario 3 new condition to be checked
the vector and execute accordingly;
} else{ in case the previous condition • An if statement needs only a single condition to
evaluate to TRUE in order to stop its search and
Scenario 4 isn’t met execute the code for that condition. This is
} especially relevant for situations in which two or
command to be executed more conditions are not mutually exclusive and
if the condition is met there could be more than one TRUE condition.
Creating a data frame

• Data frames are list structures that can store variables of different
basic types, like numeric and string values
• Data types can differ between, but not within, columns
• A row can be comprised of cells with different data types
• The cells of a column (variable) can only be of a single data type
• You can create a data frame with the data.frame() function:
Months Size Weight Breed
old ✓ my.df <- data.frame(var.a, var.b, var.c,
stringsAsFactors = FALSE)
Flipper 53 medium 21 dog
✓ the stringsAsFactor = argument specifies whether
Bromley 19 small 8 dog the character variables in your data should be coded as
Nox 34 medium 4 cat factors or not; it is often better to set it to FALSE, and
encode the variables which are factors manually with the
Orion 41 large 6 cat as.factor() function.
Dagger 84 small 7 dog • Naming the columns of a data frame:
extra ✓ With the names() functions: this will name your columns
Zizi 140 2 cat
small
one data
✓ rownames() will name your rows if needed
Carrie 109 large 36 dog type
✓ By defining the names as you create your data frame
✓ my.df <- data.frame(“A” = var.a, “B” = var.b)
named rows multiple data types
and columns
Importing and exporting data frames
Importing data is the primary way you will be using to get your data into R. You can import data from text files (CSV, tab-delimited), Excel,
SAS, SPSS, Stata, and so on.
read.table(“file/path”, sep = “”, header = TRUE, …) is the general-purpose family of functions for reading text data into R

Importing data Exporting data


Importing a text file with Comma Separated Values (CSV) is done Exporting a data frame out of R to share it with others is done
with the read.csv() function from the read.table() family with the write.table() function in a much similar way to
importing data.

write.csv() exports a data frame as a CSV text file, with a


write.csv2() for European users

read.csv2() imports data saved as a CSV but where the decimals


are denoted with a comma (π = 3,14) instead of a dot (π = 3.14)

write.table() with a sep = “\t” argument exports a data


frame as a tab-delimited text file
read.delim() imports data in which the values are separated
by a tab; there is a read.delim2() version of the call, too The row.names = argument is best set to FALSE. The default TRUE
creates a redundant column of numbers in the beginning of your data
frame, if your rows aren’t already named.

You might also like