Course 7 ) Data Analysis Through R-
Programming
1.1) The exicting world of programming
Computer Programming :
* Giving instructions to a computer to perform an action or set of actions
Programming languages :
+ The words and symbols we use to write instructions for computers to follow
+ Programming languages have their own set of rules for how these words and symbols
should be used,called syntax
«Syntax shows you how to arrange the words and and symbols you enter so they make
sense to a computer.
+ Coding is writing instructions to the computer in the syntax of a specific programming
language
Benefits of using Programming Language in Data analysis :
1. Cla
+ Programming languages have specific rules and guidelines for giving instructions to the
computer.
* When you'r telling a computer what to do, your instructions have to be very clear. There
can't be any inconsistency in the way you write code. If there is, the code won't work.
* Translating your thoughts into code forces you to figure out exactly how to write each
step of your analysis and how all the steps fit together.
2. Saves time
+ Using a programming language for data analysis also saves you lots of time.
+ For example, take the process of cleaning and transforming your data. With one line of
code, you can create a separate dataset without any missing values, With another line,
you can apply multiple filters on your data.
* This lets you spend less time preparing your data and more time on the analysis itself
3. Reproduce and share your work:
* Finally, programming languages make it easy to reproduce your analysis.
+ Data analysis is most useful when you can reproduce your work and share it with other
people.+ They can double-check it and help you solve problems. Code automatically stores all of
the steps of your analysis so you can reproduce, and share your work at anytime in the
future, weeks, months, or even years later.
Here's an example. Let's say you're working on a project. You've collected and cleaned your
data and started your analysis, but the results don't add up. You suspect a mistake was
made in the process. You'd like to discuss the issue with a teammate and get their
feedback. If you used a spreadsheet, you both might have to redo the entire analysis to
discover the error. There's no easy way to record and reproduce your steps in a spreadsheet,
but if you use a programming language, all your work can be reproduced and shared in a
moment, from loading the data, to creating visualizations, to reporting the results, Plus, you
can easily update your analysis and fix any errors simply by changing the code.
1.2) Programming as a data analyst
Spreadsheets, SOL, and R: a comparison :
As a data analyst, there is a good chance you will work with SQL, R, and spreadsheets at
some point in your career. Each tool has its own strengths and weaknesses, but they all
make the data analysis process smoother and more efficient. There are two main things
that all three have in common:
+ They all use filters: for example, you can easily filter a dataset using any of these tools.
In R, you can use the filter function. This performs the same task as a basic SELECT-
FROM-WHERE SQL query. In a spreadsheet, you can create a filter using the menu
options.
* They all use functions: In spreadsheets, you use functions in formulas, and in SQL, you
include them in queries. In R, you will use functions in the code that is part of your
analysis
The table below presents key questions to explore a few more ways that these tools
‘compare to each other. You can use this as a general guide as you begin to navigate R
Key question Spreadsheets sau R
What is it? A program that uses A database A general purpose
rows and columns programming programming
to organize data and language used to _language used for
allows for analysis communicate with statistical analysis,
and manipulation _ databases to visualization, and
through formulas, other data analysisWhat is a primary
advantage?
Which datasets
does it work best
with?
What is the source
of the data?
Where is the data
from my analysis
usually stored?
Do! use formulas
and functions?
Can I create
visualizations?
functions, and built-
in features
Includes a variety of
visualization tools
and features
Smaller datasets
Entered manually or
imported from an
external source
In a spreadsheet file
on your computer
Yes
Yes
History of R Programming :
conduct an analysis
of data
Allows users to
manipulate and
reorganize data as
needed to aid
analysis
Larger datasets
Accessed from an
external database
Inside tables in the
accessed database
Yes
Yes, by using an
additional tool like a
database
management
system (DBMS) or a
business
intelligence (B1) tool
Provides an
accessible language
to organize, modify,
and clean data
frames, and create
insightful data
visualizations
Larger datasets
Loaded with R when
installed, imported
from your computer,
or loaded from
external sources
In an R file on your
computer
Yes
Yes
* Risa programming language frequently used for statistical analysis, visualization, and
other data analysis.
* Ris based on another programming language named S.
« In the 1970s, John Chambers created S for internal use at Bell Labs, a famous scientific
research facility.
* In the 1990s, Ross Oaxaca and Robert Gentleman developed R at the University of
Auckland, New Zealand. The title R refers to the first names of its two authors and
plays on a single- letter title of its predecessor S.* Since then, R has become a preferred programming language of scientists,
statisticians and data analysts around the world.
Benefits of R:
1. Accessible :
* First R is an accessible language for beginners. Lots of people without a traditional
programming language learn R.
* Rreally appeals to anyone who wants to solve problems that involve data. And that’s one
of the things that's so great about R.
2. Data-centric :
«Ris what's known as a data-centric programming language. It's specifically designed to
make data analysis easier, more efficient and more powerful.
3. Open source
* Another awesome thing about R is that it's open source. Open source means that the
code is freely available and may be modified and shared by the people who use it.
* Anyone can use R for free. Second, anyone can modify the code, fix bugs and improve it
In fact, over the years, lots of excellent programmers have made improvements and fixes
to the R code.
4, Community :
* Asan R user, you now enjoy the benefit of the shared knowledge.
* And let me just add, the R community is the best. This vibrant, diverse and accessible
community is so supportive of new learners.
* You can go online anytime to find answers to all your R questions. Check out websites
like R for Data Science Online Learning Community and RStudio Community.
+ RStudio is an IDE or Integrated development environment.
* Integrated Development Environment (IDE) means a software application that brings
together all the tools you may want use in a single place.
To make scatter plot graph
1. Install “tidyverse” package
2. Load library > library(ggplot2)
3. Write code for scatter plot
>
ggplot(data=penguins,aes(x=flipper_length_mm)
r=species))
body_mass_g))+geom_point(aes(colou2.1) Understanding basic programm
The Basic Concepts of R :
* Functions
* Comments
* Variables
* Datatypes
* Vectors
* Pipes
concepts
Functions : A body of reusable code used to perform specific tasks in R
Arguement : Information that a function in R needs in order to run
Variable : A representation of a value in R that can be stored for use later during
programming
Vector : Vector is a group of data elements of the same type, stored in a sequence in R.
Pipe : A tool in R for expressing a sequence of multiple operations, represented with “%6>%"
Vector Types ( Atomic }
Type
Logical
Integer
Double
Character
Description
True/False
Positive and negative whole
values
Decimal values
String/character values
Example
TRUE
3
101.175
“Coding”Atomic
Ss
Character
= =
4-level Hierarchy of vectors
2.2) Explore coding in R
Operator : A symbol that names the type of operation or calculation to be performed in a
formula.
Assignment operators : Used to assign values to variables and vectors
Arithmetic operators
Used to complete math calculations
e +(addition)
e - (subtraction)
e * (multiplication)
e / (division)Vectors and lists in R
You can save this reading for future reference. Feel free to download a PDF version of this
reading below:
Vectors and lists in [Link]
PDF File
In programming, a data structure is a format for organizing and storing data. Data structures are
important to understand because you will work with them frequently when you use R for data
analysis. The most common data structures in the R programming language include:
+ Vectors
+ Data frames
+ Matrices
+ Arrays
Think of a data structure like a house that contains your data.
This reading will focus on vectors. Later on, you'll learn more about data frames, matrices, and
arrays.
There are two types of vectors: atomic vectors and lists. Coming up, you'll learn about the
basic properties of atomic vectors and lists, and how to use R code to create them.
Atomic vectors
First, we will go through the different types of atomic vectors. Then, you will learn how to use R
code to create, identify, and name the vectors.
Earlier, you learned that a vector is a group of data elements of the same type, stored in a
sequence in R. You cannot have a vector that contains both logicals and numerics.There are six primary types of atomic vectors: logical, integer, double, character (which contains
strings), complex, and raw. The last two-complex and raw-aren't as common in data analysis,
so we will focus on the first four. Together, integer and double vectors are known as numeric
vectors because they both contain numbers. This table summarizes the four primary types:
Type Description Example
Logical True/False TRUE
Integer Positive and negative whole values 3
Double Decimal values 101.175
Character String/character values “Coding”
This diagram illustrates the hierarchy of relationships among these four main types of vectors:
[ Vector
| Atomic
Numeric
Logical Integer | Double Se
Creating vectors
One way to create a vector is by using the ¢() function (called the “combine” function). The c()
function in R combines multiple values into a vector. In R, this function is just the letter “
followed by the values you want in your vector inside the parentheses, separated by a comma:
(X,Y, Z, -..).
For example, you can use the c() function to store numeric data in a vector.
(2.5, 48.5, 101.5)To create a vector of integers using the c() function, you must place the letter "L" directly after
each number.
e(1L, 5L, 15L)
You can also create a vector containing characters or logicals.
c("Sara” , “Lisa” , “Anna”)
c(TRUE, FALSE, TRUE)
Determining the properties of vectors
Every vector you create will have two key properties: type and length
You can determine what type of vector you are working with by using the typeof() function,
Place the code for the vector inside the parentheses of the function. When you run the function,
R will tell you the type. For example:
typeof (c("a” , “b”))
#> [1] "character"
Notice that the output of the typeof function in this example is “character”. Similarly, if you
use the typeof function on a vector with integer values, then the output will include “integer”
instead
typeof (c(1L , 3L))
#> [1] "integer"
You can determine the length of an existing vector_meaning the number of elements it contains—
by using the length() function. In this example, we use an assignment operator to assign the
vector to the variable x. Then, we apply the length() function to the variable. When we run the
function, R tells us the length is 3.
x <- ¢(33.5, 57.75, 120.05)
Length (x)
# [1] 3
You can also check if a vector is a specific type by using an is function: [Link](), [Link](),
[Link](), [Link](). In this example, R returns a value of TRUE because the vector
contains integers.
x <- o(2L, 5, 111)
[Link] (x)#> [1] TRUE
In this example, R retums a value of FALSE because the vector does not contain characters,
rather it contains logicals.
y <- c(TRUE, TRUE, FALSE)
is. character (y)
#> [1] FALSE
Naming vectors
All types of vectors can be named. Names are useful for writing readable code and describing
objects in R. You can name the elements of a vector with the names() function. As an example,
let's assign the variable x to a new vector with three elements,
x <- c(1, 3, 5)
You can use the names() function to assign a different name to each element of the vector.
names (x) <- ¢("a", "b", "c")
Now, when you run the code, R shows that the first element of the vector is named a, the second
b, and the third ¢.
mabe
piss
Remember that an atomic vector can only contain elements of the same type. If you want to
store elements of different types in the same data structure, you can use a list.
Creating lists
Lists are different from atomic vectors because their elements can be of any type—like dates,
data frames, vectors, matrices, and more. Lists can even contain other lists.
You can create a list with the list() function. Similar to the c() function, the list() function is just
List followed by the values you want in your list inside parentheses: list(x, y, z, ...). In this
example, we create a list that contains four different kinds of elements: character ("a"), integer
(11), double (1.5), and logical (TRUE).
list("a", 1L, 1.5, TRUE)Like we already mentioned, lists can contain other lists. If you want, you can even store a list
inside a list inside a list—and so on,
list(list(list(1 , 3, 5)))
Determining the structure of lists
If you want to find out what types of elements a list contains, you can use the str() function. To
do so, place the code for the list inside the parentheses of the function. When you run the
function, R will display the data structure of the list by describing its elements and their types.
Let's apply the str() function to our first example of a list.
str(list("a", 1L, 1.5, TRUE))
We run the function, then R tells us that the list contains four elements, and that the elements
consist of four different types: character (chr), integer (int), number (num), and logical (Logi)
#> List of 4
#> $ : chr "a"
# $: intl
#> $ : num 1.5
#> $ : logi TRUE
Let's use the str() function to discover the structure of our second example. First, let's assign the
list to the variable z to make it easier to input in the str() function
z <- list(list(list(1 , 3, 5)))
Let's run the function.
str(z)
#> List of 1
#> § iList of 1
#> ..§ rList of 3
oa
#6. 2.8 2 num 3
a : num 5The indentation of the $ symbols reflect the nested structure of this list. Here, there are three
levels (so there is a list within a list within a list),
Naming lists
Lists, like vectors, can be named. You can name the elements of a list when you first create it
with the list() function:
list ('Chicago!' "New York! ‘Los Angeles! = 3)
$Chicago
(1) 2
$°New York”
[1] 2
$"Los Angeles*
1] 3
Additional resource
To learn more about vectors and lists, check out R for Data Science, Chapter 20: Vectors. R for
Data Science is a classic resource for learning how to use R for data science and data analysis.
It covers everything from cleaning to visualizing to communicating your data. If you want to get
more details about the topic of vectors and lists, this chapter is a great place to start.Dates and times in R
In this reading, you will learn how to work with dates and times in R using the lubridate package.
Coming up, you will use tools in the lubridate package to convert different types of data in R into
date and date-time formats
@602306
€¢9o90
e@e98
@e06
$oeeos
€e966
@26e9093
Loading tidyverse and lubridate packages
Before you get started working with dates and times, you should load both tidyverse and
lubridate. Lubridate is part of tidyverse.
First, open RStudio.
Ifyou haven't already installed tidyverse, you can use the [Link]() function to do so:
* [Link] ("tidyverse")
Next, load the tidyverse and lubridate packages using the library() function. First, load the core
tidyverse to make it available in your current R session:
* library (tidyverse)
Then, load the lubridate package:
* library (lubridate)
Now you're ready to be introduced to the tools in the lubridate package.
Working with dates and timesThis section covers the data types for dates and times in R and how to convert strings to date-
time formats.
Types
In R, there are three types of data that refer to an instant in time:
* Adate ("2016-08-16")
+ Atime within a day ("[Link] ure")
+ And a date-time. This is a date plus a time ("2018-03-31 [Link] UTC")
The time is given in UTC, which stands for Universal Time Coordinated, more commonly called
Universal Coordinated Time. This is the primary standard by which the world regulates clocks
and time.
For example, to get the current date you can run the today() function. The date appears as year,
month, and day.
today ()
#> [1] "2021-01-20"
To get the current date-time you can run the now() function. Note that the time appears to the
nearest second.
now()
#> [1] "2021-01-20 [Link] rc"
When working with R, there are three ways you are likely to create date-time formats:
* From a string
+ From an individual date
+ From an existing date/time object
R creates dates in the standard yyyy-mm-dd format by default.
Let's go over each.
Converting from strings
Date/time data often comes as strings. You can convert strings into dates and date-times using
the tools provided by lubridate. These tools automatically work out the date/time format. First,
identify the order in which the year, month, and day appear in your dates. Then, arrange the
letters y, m, and din the same order. That gives you the name of the lubridate function that will
parse your date. For example, for the date 2021-01-20, you use the order yma:ymd ("2021-01-20")
When you run the function, R returns the date in yyyy-mm-dd format.
#> [1] "2021-01-20"
It works the same way for any order. For example, month, day, and year. R still returns the date
in yyyy-mm-dd format.
mdy ("January 20th, 2021")
#> [1] "2021-01-20"
Or, day, month, and year. R still returns the date in yyyy-mm-dd format.
dmy ("20-Jan-2021")
#> [1] "2021-01-20"
These functions also take unquoted numbers and convert them into the yyyy-mm-dd format.
ymd (20210120)
#> [1] "2021-01-20"
Creating date-time components
The ymd() function and its variations create dates. To create a date-time from a date, add an
underscore and one or more of the letters h, m, and s (hours, minutes, seconds) to the name of
the function:
ymd_hms ("2021-01-20 [Link]")
#> [1] "2021-01-20 [Link] UTC"
méy_hm("01/20/2021 08:01")
#> [1] "2021-01-20 08:
:00 UTC”
Optional: Switching between existing date-time objects
Finally, you might want to switch between a date-time and a date
You can use the function as_date() to convert a date-time to a date. For example, put the current
date-time—now/()—in the parentheses of the function.
as_date(now())
#> [1] "2021-01-20"Additional resources
To learn more about working with dates and times in R, check out the following resources:
+ lubridate tidyverse: This is the “lubridate” entry from the official tidyverse documentation,
which offers a comprehensive reference guide to the various tidyverse packages. Check
out this link for an overview of key concepts and functions.
+ Dates and times with lubridate: Cheat Sheet: This “cheat sheet” gives you a detailed map
of all the different things you can do with the lubridate package. You don't need to know all
of this information, but the cheat sheet is a useful reference for any questions you might
have about working with dates and times in R.
VS CompletedOther common data structures
In this reading, you will continue on the topic of data structures with an introduction to data
frames and matrices. You will learn about the basic properties of each structure, and simple
ways to make use of them using R code. You will also briefly explore files, which are often used
to access and store data and related information.
Data structures
Recall that a data structure is like a house that contains your data.
B= 2s
Lae iititlille
Data frames
Data frames are the most common way of storing and analyzing data in R, so it’s important to.
understand what they are and how to create them. A data frame is a collection of columns—
similar to a spreadsheet or SQL table. Each column has a name at the top that represents a
variable, and includes one observation per row. Data frames help summarize data and organize
it into a format that is easy to read and use.
For example, the data frame below shows the “diamonds” dataset, which is one of the preloaded
datasets in R. Each column contains a single variable that is related to diamonds: carat, cut,
color, clarity, depth, and so on. Each row represents a single observation.Filter
* carat cut color clarity depth table price x y 2
1 0.23 Ideal si os 550328 3.85 5.98 2.8
20.21 Premium st sos 610-326-389 3.84 2.3L
3 023 Good iE vst 569650327405 4.07, 2.3
40.29 Premium 1 vs2 624 580334420 4.232.683
5 0.31 Good si2 33 58033543435 2.75
6 0.24 Very Good 3 ws2 28 570336 3.94 3.98248,
7 0.28 Very Good 1 wst 23 570336395398 2.47
8 0.26 Very Good H stu 619 550337407, AL 2.83
9 022 Fale E vs2 61 60,337,387) 3.78 2.49
4100.23. Very Good H vsi 594 610 3384.00 4052.38
a2 0.30 Good su 64055039 4.25428 2.73,
120.23 Ideal vsi 628 560340 3.83 5.902.
130.22. Premium F st. 604 610342 3883.84 2.38
140.31 Ideal se 22 5808443537
450.20. Premium & s2 602 620 453.79 3.78 2.27
16 0.32 Premium a 6095804543826
170.30 Teal si 620 580348431434 2.88
18 0.30 Good 3 su 64 5805423429 2.70
19 0.30 Good 3 stu 6385600351423 4.25 2.7
200.30 VeryGood 1 sn 27 590,527 2.6
210.30 Good se 633 5600 3514.26 4.30 2.7
Showing 1 to 22 of $3,940 entries, 10 total colurns
There are a few key things to keep in mind when you are working with data frames:
+ First, columns should be named.
+ Second, data frames can include many different types of data, like numeric, logical, or
character.
+ Finally, elements in the same column should be of the same type.
You will learn more about data frames later on in the program, but this is a great starting point.
If you need to manually create a data frame in R, you can use the [Link]() function. The
[Link]() function takes vectors as input. In the parentheses, enter the name of the column,
followed by an equals sign, and then the vector you want to input for that column. In this
example, the x column is a vector with elements 1, 2, 3, and the y column is a vector with
elements 1.5, 5.5, 7.5.
[Link](x = c(1, 2, 3) , y = (1.5, 5.5, 7.5))
Ifyou run the function, R displays the data frame in ordered rows and columns.
xy111.5
2 25.5
3.37.5
In most cases, you won't need to manually create a data frame yourself, as you will typically
import data from another source, such as a .csv file, a relational database, or a software
program,
Files
Let's go over how to create, copy, and delete files in R. For more information on working with
files in R, check out R documentation: files. R documentation is a tool that helps you easily find
and browse the documentation of almost all R packages on CRAN. It's a useful reference guide
for functions in R code. Let's go through a few of the most useful functions for working with files.
Use the [Link] function to create a new folder, or directory, to hold your files. Place the name
of the folder in the parentheses of the function.
[Link] ("destination_folder")
Use the [Link]() function to create a blank file. Place the name and the type of the file in the
parentheses of the function. Your file types will usually be something like .txt, .docx, or csv.
[Link] (*new_text_file. txt”)
[Link] ("new_word_file.docx”)
[Link] (“new_csv_file.csv”)
If the file is successfully created when you run the function, R will return a value of TRUE (if not,
R will return FALSE).
[Link] ("new_csv_file.csv”)
[1] TRUE
Copying a file can be done using the [Link]() function. In the parentheses, add the name of
the file to be copied. Then, type a comma, and add the name of the destination folder that you
want to copy the file to.
[Link] (“new_text_file.txt” , “destination_folder”)
If you check the Files pane in RStudio, a copy of the file appears in the relevant folder:Files Plots Packages Help Viewer -o
Qi NewFolder O Upload © Delete = Rename | iG More
Cloud project destination fo ®
4 Name Slee Modifies
t
now. text [Link] os Jan 17, 2021, 3:01 PM
You can delete R files using the unlink() function. Enter the file's name in the parentheses of the
function.
unlink (“some_.[Link]”)
Additional resource
If you want to learn more about working with data frames, matrices, and arrays in R, check out
the Data Wrangling section of Stat Education's Introduction to R course. The section includes
modules on data frames, matrices, and arrays (and more), and each module contains helpful
examples of key coding concepts.
Optional: Matrices
‘A matrix is a two-dimensional collection of data elements, This means it has both rows and
columns. By contrast, a vector is a one-dimensional sequence of data elements. But like vectors,
matrices can only contain a single data type. For example, you can't have both logicals and
numerics in a matrix.
To create a matrix in R, you can use the matrix() function. The matrix() function has two main
arguments that you enter in the parentheses. First, add a vector. The vector contains the values
you want to place in the matrix. Next, add at least one matrix dimension. You can choose to
specify the number of rows or the number of columns by using the code nrow = orncol =.
For example, imagine you want to create a 2x3 (two rows by three columns) matrix containing
the values 3-8. First, enter a vector containing that series of numbers: ¢(3:8). Then, enter a
comma, Finally, enter nzow = 2 to specify the number of rows.matrix(c(3:8), nrow = 2)
If you run the function, R displays a matrix with three columns and two rows (typically referred to
as a “2x3") that contain the numeric values 3, 4, 5, 6, 7, 8. R places the first value (3) of the
vector in the uppermost row, and the leftmost column of the matrix, and continues the sequence
from left to right.
(,1] £,2] [3]
jJo3 5 7
[2,] 4 6 8
You can also choose to specify the number of columns (neo1 = ) instead of the number of rows
(nrow = ).
matrix(c(3:8), ncol = 2)
When you run the function, R infers the number of rows automatically.
(2) 0,2]
jo3 6
[21 4 7
13,1 5 8Logical operators and conditional statements
Tip: You may refresh on the concepts presented in Understanding Boolean logic to help you
understand how logical operators work.
You can save this reading for future reference. Feel free to download a PDF version of this
reading below:
Logical operators and conditional statements. pdf
POF File
Earlier, you leamed that an operator is a symbol that identifies the type of operation or
calculation to be performed in a formula, In this reading, you will learn about the main types of
logical operators and how they can be used to create conditional statements in R code.
Logical operators
Logical operators return a logical data type such as TRUE or FALSE.
There are three primary types of logical operators:
+ AND (sometimes represented as & or && in R)
+ OR (sometimes represented as | or || in R)NOT (!)
Review the summarized logical operators below.
AND operator “&”
The AND operator takes two logical values. It returns TRUE only if both individual values
are TRUE. This means that TRUE & TRUE evaluates to TRUE. However, FALSE & TRUE,
TRUE & FALSE, and FALSE & FALSE all evaluate to FALSE.
if you run the corresponding code in R, you get the following results:
> TRUE & TRUE
[1] TRUE
> TRUE & FALSE
[1] FALSE
> FALSE & TRUE
[1] FALSE
> FALSE & FALSE
[1] FALSE
You can illustrate this using the results of our comparisons. Imagine you create a variable x
that is equal to 10.
x <- 10
To check if x is greater than 3 but less than 12, you can use x > 3 and x < 12 as the values
of an “AND” expression.
x>3&x<12
When you run the function, R returns the result TRUE.
[1] TRUE
The first part, x > 3 will evaluate to TRUE since 10 is greater than 3. The second part, x <
12 will also evaluate to TRUE since 10 is less than 12. So, since both values are TRUE, the
result of the AND expression is TRUE. The number 10 lies between the numbers 3 and 12.
However, if you make x equal to 20, the expression x > 3 & x < 12 will return a different
result,
x <- 20
x>36x< 12
[1] FALSEAlthough x > 3 is TRUE (20> 3), x < 12 is FALSE (20 < 12). If one part of an AND
expression is FALSE, the entire expression is FALSE (TRUE & FALSE = FALSE). So, R
returns the result FALSE,
OR operator “|”
The OR operator (|) works in a similar way to the AND operator (&). The main difference is
that at least one of the values of the OR operation must be TRUE for the entire OR
operation to evaluate to TRUE, This means that TRUE | TRUE, TRUE | FALSE, and FALSE
| TRUE all evaluate to TRUE. When both values are FALSE, the result is FALSE
If you write out the code, you get the following results:
> TRUE | TRUE
[1] TRUE
> TRUE | FALSE
[1] TRUE
> FALSE | TRUE
[1] TRUE
> FALSE | FALSE
[1] FALSE
For example, suppose you create a variable y equal to 7. To check if y is less than 8 or
greater than 16, you can use the following expression:
y<7
y<8ly>16
The comparison result is TRUE (7 is less than 8) | FALSE (7 is not greater than 16). Since
only one value of an OR expression needs to be TRUE for the entire expression to be
TRUE, R retums a result of TRUE.
[1] TRUE
Now, suppose y is 12. The expression y < 8 | y > 16 now evaluates to FALSE (12 <8) |
FALSE (12 > 16). Both comparisons are FALSE, so the result is FALSE.
y<- 12
y<8ly>16
[1] FALSE
NOT operator “!”* The NOT operator (!) simply negates the logical value it applies to. In other words, !TRUE
evaluates to FALSE, and !FALSE evaluates to TRUE.
+ When you run the code, you get the following results:
> !TRUE
[1] FALSE
> 'FALSE
[1] TRUE
Just like the OR and AND operators, you can use the NOT operator in combination with
logical operators. Zero is considered FALSE and non-zero numbers are taken as TRUE.
The NOT operator evaluates to the opposite logical value.
Let's imagine you have a variable x that equals 2:
The NOT operation evaluates to FALSE because it takes the opposite logical value of a
non-zero number (TRUE).
> Ix
[1] FALSE
Let's check out an example of how you might use logical operators to analyze data. Imagine you
are working with the airquality dataset that is preloaded in RStudio. It contains data on daily air
quality measurements in New York from May to September of 1973.
The data frame has six columns: Ozone (the ozone measurement), Solar.R (the solar
measurement), Wind (the wind measurement), Temp (the temperature in Fahrenheit), and the
Month and Day of these measurements (each row represents a specific month and day
combination),
Ozone Solar.R — Wind Temp — Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
Let's go through how the AND, OR, and NOT operators might be helpful in this situation.AND example
Imagine you want to specify rows that are extremely sunny and windy, which you define as
having a Solar measurement of over 150 and a Wind measurement of over 10.
In R, you can express this logical statement as Solar.R > 150 & Wind > 10.
Only the rows where both of these conditions are true fulfil the criteria:
* Ozone SolarR Wind Temp Month Day
1 18 313,011 62 5 4
OR example
Next, imagine you want to specify rows where it's extremely sunny or it's extremely windy, which
you define as having a Solar measurement of over 150 or a Wind measurement of over 10.
In R, you can express this logical statement as Solar.R > 150 | Wind > 10,
All the rows where either of these conditions are true fulfil the criteria:
* ozone ~ SolarR © Wind = Temp — Month ~— Day
1 al 190 74 67 5 1
2 n 149126 74 5 3
3 18 31 1.5 62 5 4
NOT example
Now, imagine you just want to focus on the weather measurements for days that aren't the first
day of the month
In R, you can express this logical statement as Day != 1
The rows where this condition is true fulfill the criteria:
Ozone —SolarR © Wind = Temp = Month ~— Day
i 36 118 8.0 2 5 2
2 2D 149 126 74 5 3
3 18 313, 1s 62 5 4
Finally, imagine you want to focus on scenarios that aren't extremely sunny and not extremely
windy, based on your previous definitions of extremely sunny and extremely windy. In other
words, the following statement should not be true: either a Solar measurement greater than 150or a Wind measurement greater than 10.
Notice that this statement is the opposite of the OR statement used above. To express this
statement in R, you can put an exclamation point (!) in front of the previous OR statement: !
(Solar.R > 150 | Wind > 10). Rwill apply the NOT operator to everything within the
parentheses.
In this case, only one row fulfills the criteria:
Ozone — Solar.R Wind Temp — Month Day
1 36 118 8.0 72 5 2
Optional: Conditional statements
A conditional statement is a declaration that if a certain condition holds, then a certain event
must take place. For example, “ifthe temperature is above freezing, then | will go outside for a
walk.” If the first condition is true (the temperature is above freezing), then the second condition
will occur (I will go for a walk). Conditional statements in R code have a similar logic.
Let's discuss how to create conditional statements in R using three related statements:
+ if0
+ else()
+ else iff)
if statement
The if statement sets a condition, and if the condition evaluates to TRUE, the R code associated
with the if statement is executed.
In R, you place the code for the condition inside the parentheses of the if statement. The code
that has to be executed if the condition is TRUE follows in curly braces (expr). Note that in this
case, the second curly brace is placed on its own line of code and identifies the end of the code
that you want to execute,
if (condition) {
exprFor example, let's create a variable x equal to 4.
<4
Next, let's create a conditional statement: if x is greater than 0, then R will print out the string "x
is a positive number"
if (x > 0) (
print ("x is a positive number")
}
Since x = 4, the condition is true (4 > 0). Therefore, when you run the code, R prints out the
string “x is a positive number".
[1] "x is a positive number"
But if you change x to a negative number, like -4, then the condition will be FALSE (-4 > 0). If you
run the code, R will not execute the print statement. Instead, a blank line will appear as the
result.
else statement
The else statement is used in combination with an if statement. This is how the code is.
structured in R:
if (condition) {
expr
} else {
expr2
}
The code associated with the else statement gets executed whenever the condition of the if
statement is not TRUE. In other words, if the condition is TRUE, then R will execute the code in
the if statement (expr); if the condition is not TRUE, then R will execute the code in the else
statement (expr2).
Let's try an example. First, create a variable x equal to 7.
x<7
Next, let's set up the following conditions:
+ If xis greater than 0, R will print "x is a positive number”+ Ifxis less than or equal to 0, R will print “x is either a negative number or
zero”
In our code, the first condition (x > 0) will be part of the if statement. The second condition of x
less than or equal to 0 is implied in the else statement. If x > 0, then R will print “x is a
positive number”. Otherwise, R will print "x is either a negative number or zero”
x< 7
if (x > 0) (
print ("x is a positive number")
} else {
print ("x is either a negative number or zero")
)
Since 7 is greater than 0, the condition of the if statement is true. So, when you run the code, R
prints out “x is a positive number”.
[1] "x is a positive number"
But if you make x equal to -7, the condition of the if statement is not true (-7 is not greater than
0). Therefore, R will execute the code in the else statement. When you run the code, R prints out
“x is either a negative number or zero”.
x <
if (x > 0) (
print("x is a positive number")
} else {
print ("x is either a negative number or zero")
[1] "x is either a negative number or zero"
else if statement
In some cases, you might want to customize your conditional statement even further by adding
the else if statement. The else if statement comes in between the if statement and the else
statement. This is the code structure:
if (condition1) {expri
} else if (condition2) {
expr2
} else {
expr3
}
If the if condition (condition) is met, then R executes the code in the first expression (expr1). If
the if condition is not met, and the else if condition (condition2) is met, then R executes the code
in the second expression (expr2). If neither of the two conditions are met, R executes the code in
the third expression (expr3).
In our previous example, using only the if and else statements, R can only print “x is either
a negative number or zero” if x equals 0 or x is less than zero, Imagine you want R to print
the string “x is zero” if x equals 0. You need to add another condition using the else if
statement.
Let's try an example. First, create a variable x equal to negative 1
x <
Now, you want to set up the following conditions:
+ If xis less than 0, print "x is a negative number”
+ Ifxequals 0, print "x is zero”
+ Otherwise, print"x is a positive number”
In the code, the first condition will be part of the if statement, the second condition will be part of
the else if statement, and the third condition will be part of the else statement. Ifx <0, then R will
print™x is a negative number”. Ifx=0, then R will print x is zero”. Otherwise, R will
print “x is a positive number”
x <
if (x <0) {
print("x is a negative number")
} else if (x == 0) {
print("x is zero")
} else {print ("x is a positive number")
}
Since -1 is less than 0, the condition for the if statement evaluates to TRUE, and R prints “x is
a negative number”
[1] "x is a negative number"
If you make x equal to 0, R will first check the if condition (x < 0), and determine that it is
FALSE. Then, R will evaluate the else if condition. This condition, x==0, is TRUE. So, in this
case, R prints “x is zero”.
If you make x equal to 1, both the if condition and the else if condition evaluate to FALSE. So, R
will execute the else statement and print “x is a positive number”.
As soon as R discovers a condition that evaluates to TRUE, R executes the corresponding code
and ignores the rest.
Additional resource
To learn more about logical operators and conditional statements, check out DataCamp's tutorial
Conditionals and Control Flow in R. DataCamp is a popular resource for people learning about
computer programming. The tutorial is filled with useful examples of coding applications for
logical operators and conditional statements (and relational operators), and offers a helpful
overview of each topic and the connections between themMore on R operators
You might remember that an operator is a symbol that identifies the type of operation or
calculation to be performed in a formula, In an earlier video, you leamed how to use the
assignment and arithmetic operators to assign variables and perform calculations. In this
reading, you will review a detailed summary of the main types of operators in R, and learn how to
use specific operators in R code.
Operators
In R, there are four main types of operators:
1. Arithmetic
2. Relational
3. Logical
4, Assignment
Review the specific operators in each category and check out some examples of how to use
them in R code.
Arithmetic operators
Arithmetic operators let you perform basic math operations like addition, subtraction,
multiplication, and division
The table below summarizes the different arithmetic operators in R. The examples used in the
table are based on the creation of two variables: : x equals 2 and y equals 5. Note that you use
the assignment operator to store these values:
x <2
y<-5
Operator | Description Example Result!
Code Output
+ Addition x+y tt7
: Subtraction xey (1)-3
* Multiplication xty [1] 10
I Division x/y (N04
Kuh Modulus (returns the remainder after y 0b x 1
division)i Example Result!
Operator | Description
Code Output
eu Integer division (returns an integer value youre x t]2
after division)
‘ Exponent y*x (125
Relational operators
Relational operators, also known as comparators, allow you to compare values. Relational
operators identify how one R object relates to another—like whether an object is less than, equal
to, or greater than another object. The output for relational operators is either TRUE or FALSE
(which is a logical data type, or boolean).
The table below summarizes the six relational operators in R. The examples used in the table
are based on the creation of two variables: x equals 2 and y equals 5. Note that you use the
assignment operator to store these values.
If you perform calculations with each operator, you get the following results. In this case, the
output is boolean: TRUE or FALSE. Note that the [1] that appears before each output is used to
represent how output is displayed in RStudio.
Operator Description Example Code Result/Output
< Less than x
Greater than xy (1] FALSE
<= Less than or equal to x<=2 [1] TRUE
>= Greater than or equal to y>=10 [1] FALSE
== Equal to yo5 [1] TRUE
I= Not equal to x!=2 [1] FALSE
Logical operators
Logical operators allow you to combine logical values. Logical operators retum a logical data
type or boolean (TRUE or FALSE). You encountered logical operators in an earlier reading,
Logical operators and conditional statements, but here is a quick refresher.
The table below summarizes the logical operators in R.Operator Description
& Element-wise logical AND
&& Logical AND
I Elementwise logical OR
ll Logical OR
! Logical NOT
Next, check out some examples of how logical operators work in R code.
Element-wise logical AND (&) and OR (|)
You can illustrate logical AND (&) and OR (|) by comparing numerical values. Create a variable x
that is equal to 10.
x <- 10
The AND operator returns TRUE only if both individual values are TRUE.
x>2@x<12
[1] TRUE
10 is greater than 2 and 10 is less than 12. So, the operation evaluates to TRUE
The OR operator (|) works in a similar way to the AND operator (&). The main difference is that
just one of the values of the OR operation needs to be TRUE for the entire OR operation to
evaluate to TRUE. Only if both values are FALSE will the entire OR operation evaluate to FALSE.
Now try an example with the same variable (x <- 10):
x>2|x<8
[1] TRUE
10 is greater than 2, but 10 is not less than 8. But since at least one of the values (10>2) is
TRUE, the OR operation evaluates to TRUE.
Logical AND (&&) and OR (||)
The main difference between element-wise logical operators (&, |) and logical operators (&&, ||)
is the way they apply to operations with vectors. The operations with double signs, AND (&&)
and logical OR (||), only examine the first element of each vector. The operations with single
signs, AND (&) and OR (|), examine all the elements of each vector.For example, imagine you are working with two vectors that each contain three elements: ¢ (3,
5, 7) andc(2, 4, 6). The element-wise logical AND (&) will compare the first element of the
first vector with the first element of the second vector (382), the second element with the second
element (584), and the third element with the third element (786).
Now check out this example in R code.
First, create two variables, x and y, to store the two vectors:
x <- (3, 5, 7)
y < c(2, 4, 6)
Then run the code with a single ampersand (&). The output is boolean (TRUE or FALSE).
x<56y<5
[1] TRUE FALSE FALSE
When you compare each element of the two vectors, the output is TRUE, FALSE, FALSE. The
first element of both x (3) and y (2) is less than 5, so this is TRUE. The second element of x is
not less than 5 (it's equal to 5) but the second element of y is less than 5, so this is FALSE
(because you used AND). The third element of both x and y is not less than 5, so this is also
FALSE.
Now, run the same operation using the double ampersand (&&)
x<566y<5
[1] TRUE
In this case, R only compares the first elements of each vector: 3 and 2. So, the output is TRUE
because 3 and 2 are both less than 5.
Depending on the type of work you do, you might make use of single sign operators more often
than double sign operators. But itis helpful to know how all of the operators work regardless.
Logical NOT (!)
The NOT operator simply negates the logical value, and evaluates to its opposite. In R, zero is
considered FALSE and all non-zero numbers are considered TRUE.
For example, apply the NOT operator to your variable (x <- 10):
' (x < 15)
[1] FALSEThe NOT operation evaluates to FALSE because it takes the opposite logical value of the
statement x < 15, which is TRUE (10 is less than 15).
Assignment operators
Assignment operators let you assign values to variables.
In many scripting programming languages you can just use the equal sign (=) to assign a
variable. For R, the best practice is to use the arrow assignment (<-). Technically, the single
arrow assignment can be used in the left or right direction. But the rightward assignment is not
generally used in R code
You can also use the double arrow assignment, known as a scoping assignment. But the
scoping assignment is for advanced R users, so you won't learn about it in this reading.
The table below summarizes the assignment operators and example code in R. Notice that the
output for each variable is its assigned value.
Example Code (after the sample code Result!
Operator | Description | below, typing x will generate the output in stout
the next column) m
Leftwards
< x<2 12
assignment
Leftwards
<< x<<7 (17
assignment
Leftwards
= ow x=9 i
assignment
Rightwards
> 3 Wx (yt
assignment
Rightwards
> 21->>x 121
assignment
The operators you learned about in this reading are a great foundation for using operators in R.2.3) Learning about R packages
Packages (R)
* Units of reproducible R code that you can use to add more functionality to R & they
make it easier to keep track of code.
« They're created by members of the R community to keep track of the R functions that
they write & reuse.
+ These community members might then make the packages available to other users.
Packages include :
© Reusable R functions
* Documentation about the functions
* Sample datasets
* Tests for checking your code
CRAN( Comprehensive R Archive Network )
* Anonline archive with R packages, source code, manuals, and documentation
Tidyverse :
+ Assystem of packages in R with a common design philosophy for data manipulation,
exploration, and visualization
«Using tidyverse can help you work your way through pretty much the entire data
analysis process.
* The packages in tidyverse work together naturally.
8 Core Tidyverse Packages :
* ggplot2
* tibble
* tidyr
* readr
© purr
+ dplyr
* string
* forcats
2.4) Explore the TidyverseFour packages that are an essential part of the workflow for data analysts:
1. ggplot2:
* Ggplot2 is used for data visualization specifically plots.
* With ggplot2, we can create a variety of data viz by applying different visual properties
to the data variables in R
2. tidyr:
* Tidyr is a package used for data cleaning to make tidy data.
3. readr:
* Readr is used for importing data.
* The most common function from readr is read_csv. This will import a CSV file into R
4, dplyr:
‘* Dplyr offers a consistent set of functions that help you complete some common data
manipulation tasks
Nesting : In programming, nesting describes code that performs a particular function and is
contained within code that performs a broader function
When using pipes:
¢ Add the pipe operator at the end of each line of the piped
operation except the last one
Check your code after you've programmed your pipe
e Revisit piped operations to check for parts of your code to fix
3.1) Explore Data &R
Data frames : A colle
n of columns.
Rules of Data frames -
* Column should be named
* Data stored can be of mant different types, like numeric, factor, or character
* each column should contain the same number of data itemsTibbles :
« In the tidyverse, tibbles are like streamlined data frames they make working with data
easier, but they're a little different from standard data frames
+ Tibbles never change the data types of the inputs. They won't change your strings to
factors or anything else. You can make more changes to base data frames, but tibbles
are easier to use. This saves time because you won't have to do as much cleaning or
changing data types in tibbles.
+ Never change the names of your variables
# Never create row names
* Make printing easier
Tidy Data:
«Away of standardizing the organization of data within R
‘Tidy data standards -
1. Variables are organized into columns
2. Observations are organized into rows
3. Each value must have its own cell
Basics functions of R:
1. head() : to get first 6 rows in the table
2. str() : to get the structure of the table
3. colnames() : to get a list of all column names in the table
4, mutate() : to create new column in the table
5. rename() : to rename a certain column name
6. rename_with() :
10 rename all column names
7. clean_name() : to get column names in clean format
8. arrange() : to arrange columns in order
9. group_by() : to group by rows
10. summarize() : used with group by function to get aggregate data
11. separate() : to separate one column into 212. unite() : to merge columns
4.1) Creating Data Visualization in R
Some usefuls packages for visualization in R:
. ggplot2
. Plotly
. La
RGL
. Dygraphs
. Leaflet
. Highcharter
patchwork
. gganimate
40. ggridges
©Ernegsone
* Ggplot2 is the most favourite visualization package of data analyst. It is both powerful &
flexible With little bit of codeyou can create all kinds of different plots.
* Ggplot2 was originally created by the statisitician and developer Hadley Wickham in
2005.
+ Wickham's inspiration for creating ggplot2 came from the 1999 book The Grammar of
Graphics, a scholarly study of data visualization by computer scientist Leland
Wilkinson.
* The first two letters of ggplot2 actually stand for grammar of graphics
+ And in the same way the grammar of a human language gives us rules to build any kind
of sentence, the grammar of graphics gives us rules to build any kind of visual.
Benefits of Ggplot2:
1. You can create all different types of plots including scatter plots, bar charts, line
diagrams and tons more.
2. You can change the colors, layout and dimensions of your plots and add text elements
like titles, captions and labels.
3. With just a little bit of code you can create high-quality visuals.
4, Plus ggplot2 lets you combine data manipulation and visualization using the pipe
operator.
Core Concepts in ggplot2:
Aesthetic:
« In ggplot2 an asethethic is a visual property of an object in your plot.
‘* For example, in a scatter plot aesthetics include things like the size, shape or color of
your data points.+ Think of an aesthetic as a connection or mapping between a visual feature in your plot
and a variable in your data.
Geom:
+ The geometric object used to represent your data,
+ For example, you can use points to create a scatter plot, bars to create a bar chart, or
lines to create a line diagram.
+ You can choose a geom to fit the type of data you have
Facet:
« Facets let you display smalled groups, or subsets of data.
+ With facets, you can create separate plots for all the variables in your dataset.
Labels and annotations:
# Let you customize your plot.
+ You can add text like titles, subtitles and captions to communicate the purpose of your
plot or highlight important data
To create your own plot using code, follow these three steps:
4. Start with the ggplot() function and choose a dataset to work with
2. Add a geom_function to display your data,
3, Map the variables you want to plot in the arguments of the aes() function.
ggplot(data=)+
(mapping=aes() )
4.2) Explore Asethetics in analysisa
Aesthetics in ggplot2:
Ggplot2 is an R package that allows you to create different types of data visualizations right
in your R workspace. In ggplot2, an aesthetic is defined as a visual property of an object in
your plot.
There are three aesthetic attributes in ggplot2:
* Color: this allows you to change the color of all of the points on your plot, or the color of
each data group
* Size: this allows you to change the size of the points on your plot by data group
* Shape: this allows you to change the shape of the points on your plot by data group
Here's an example of how aesthetic attributes are displayed in R:
ggplot(data) + geom_point (mapping = aes(x=distance, y= dep_delay, colo
rscarrier, size=air_time, shape = carrier))
Asethetics for points:
x
“Y
* Colour
* Shape
* Size
* Alpha
Facets in ggplot2:Facet functions:
1. Facet_wrap() «- To facet your plot by a single variable use facet_wrap
2, Facet_grid() :- To facet your plot by 1 or more variable use facet_grip
4.3) Annotate & save visualization
Annotate :
+ To add notes to a document or diagram to explain or comment upon it
* In ggplot 2 adding annotations to your plot can help explain the plot's purpose or
highlight important data
‘* When you present your data visuals to stakeholders, you may not have much time to
meet with them.
* Labels and annotations will point their attention to key things and help them quickly
understand your plot.
5A) Explore documentation & reports in R
R Markdown :
* A file format for making dynamic documents with R.
‘+ R Markdown lets you create a record of your analysis and conclusions in a document.
« Itties together your code and your report so you can share every step of your analysis.
«This document will help stakeholders and team members understand what you did in
your analysis to reach your conclusions. Their feedback will also help you improve your
analysis.
* R Markdown lets you convert your files into lots of different formats too. You can create
HTML, PDF, and Word documents, or you can convert to a slide presentation or
dashboard.
+ R Markdown documents are written in Markdown.
Markdown :
* Assyntax for formatting plain text files.
* Using Markdown makes it easier to write and format text in your document.
* Markdown is also easy to read and to learn
RNotebook :
* Lets users run your code and show the graphs and charts that visualize the code
+ Any R Markdown document can be used as a notebook. This creates a clear overall
picture of your analysis and conclusions.5.2) Create R Markdown documents
YAML :
* A language for data that translates it so it’s readable
+ YAML originally stood for Yet Another Markup Language
Code Chunk
* Code added in an .Rmd file
* Code chunk delimiters
or}
* Code chunk keyboard shortcuts -
PC/Chromebook: ctrl + alt +1
Mac: command + option + |