MR4103
WEEK 6A
DATA
PROCESSING
AND
VISUALIZATION
PRADITYA
AJIDARMA
[email protected]
A BRIEF INTRO TO R PROGRAMMING
R Demo presented during Class Session
MERGE DATASET
Joining rows and columns in R:
cbind(): merge multiple data frames with the same number of rows; join data frames horizontally
rbind(): merge multiple data frames with the same number of columns; join data frames vertically
RBIND CBIND
MERGE DATASET
merge() function: Merge two data frames by a common category in two columns or row names
RESHAPE AND PIVOT DATASET (MELT AND CAST)
melt() function takes data in wide format and stacks some columns into a single one (long format)
dcast() function aggregates molten data frame and widen it into multiple columns (wide format)
BASIC DATA TRANSFORMATION
Five functions from dplyr package that allow you to manipulate the majority of data:
Pick observations by their values filter()
Reorder the rows arrange()
Pick variables by their names select()
Create new variables with functions of existing variables mutate()
Collapse many values down to a single summary summarise()
All dplyr functions work similarly:
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame, using variable names
The result is a new data frame.
BASIC DATA TRANSFORMATION
BASIC DATA TRANSFORMATION
USE LOGIC OPERATORS EFFECTIVELY
filter(data, month == 11 | month == 12) is equal to filter(data, month %in% c(11, 12))
filter(data, !( X > 120 | Y > 120)) is equal to filter( X <= 120, Y <= 120)
filter(data, !is.na(x) , X > 1) to find match X > 1 that is also not NA
near(sqrt(2) ^ 2, 2) to find match where the two values are not exact, but close enough
IMAGE PROCESSING –
PIXEL TO NUMERICAL DATA TRANSFORMATION
IMAGE PROCESSING –
PIXEL TO NUMERICAL DATA TRANSFORMATION
Handwriting Recognition Sentence Completion
Each pixel represents a spectrum of color (scaled Each word is labeled as one certain ID.
between 0 and 255), for 0 equals lowest intensity
Using training set, the model learn how each ID is
(white) and 255 equals the highest intensity
followed by another word (another ID) and learn
based on the frequency of occurrence
(Remember Naïve Bayes?)
DATE AND TIME R allows you to work with date and time easily
DATA MANIPULATION The simplest data type to use for dates is the ”Date” class.
These will be internally stored as integers.
DATE AND TIME
Date and time manipulation using built-in POSIXt functionality
DATA MANIPULATION
WHEN DO I TURN ONE BILLION SECONDS OLD?
billbday = function(bday, age = 10^9, format = "%Y-%m-%d %H:%M:%S") {
x = as.POSIXct(bday, format = format) + age
togo = round(difftime(x, Sys.time(), units = "days"))
if (togo > 0) {
msg = sprintf("You will be %s seconds old on %s, which is %s days from now.", age,
format(x, "%Y-%m-%d"), togo)} else {
msg = sprintf("You turned %s seconds old on %s, which was %s days ago.", age, format(x, "%Y-%m-%d"), -1 * togo)}
print(msg)
format(x, "%Y-%m-%d")}
DATA VISUALIZATION: DENSITY PLOT
Library(“ggplot2”)
ggplot(data, aes(x=assets)) + geom_density() ggplot(data, aes(x=log10(assets))) + geom_density()
DATA VISUALIZATION: QUANTILE - QUANTILE PLOT
qplot(sample = assets, data) qplot(sample = profits, data) qplot(sample = marketvalue, data)
DATA VISUALIZATION: DENSITY PLOT FOR TWO CATEGORY
newdata = data %>% as_tibble() %>% mutate(
US_Country = country == "United States")
ggplot(newdata, aes(x=log10(marketvalue), fill=US_Country)) +
geom_density(alpha=0.4)
WEEK 6 GROUP EXERCISE
Forbes Dataset Auto-MPG Dataset
FORBES DATASET PRACTICE
(SUBMIT: R CODE, CONSOLE SCREENSHOT, & REPORTS)
AUTO-MPG DATASET PRACTICE
(SUBMIT: R CODE, CONSOLE SCREENSHOT, & REPORTS)
MIDTERM PROJECTS
Find an interesting dataset related to Indonesia from any literature/paper/website:
State the Background “why you are choosing the data”, including its significance in your perspectives
Describe the Structure of the data, what are the variables within the data
Conduct any data cleaning and processing, if necessary
Perform a Diagnostics regarding the data, using descriptive statistics, visualizations, and ML model (if
possible)
Conclude a preliminary Analysis, which is the insights that you acquire after performing the activities
above
Due on Thursday 08 October 2020, 09.00 AM; Submit to LMS
Attached Submission: Power Point, R File