MKTG 631
Marketing Analytics
Module2 – R tutorial
Jinhee Huh
Marketing Analytical Tools
Data Type
• XLS and XLSX
• CSV (Comma Separated Value)
• Each row of data is stored in a text file with
a comma separating each column’s values from one another.
• Tab delimited file
• The columns of data are stored as a text file with
a TAB character between values.
Open and Proprietary Programming Tools
• Open-source programming tools:
• Programming tools that are made freely available, often developed by
and for the community.
• Adapting for new methodology
• Proprietary programming tools:
• Programming tools that are developed by a firm and distributed for
sale to the public.
• Expensive
• It will take longer time
R and Python
• R program advantages
• Developed by data scientists
• Large number of ready-made packages for statistical analysis than Python
• Supported by an integrated development environment called RStudio
• Built-in ways to professionally visualize data
• Easy to install and set up the work environment
• R program disadvantages
• Less efficient for general computations, sometimes due to inefficiently
written packages
R and Python
R Program Practice
R Program Practice
• Let’s practice the descriptive statistics calculation, t-test, multivariate
descriptive statistics, and plot drawing.
• How can we open the ”Module2_Demographics.csv”
• demo <- read.csv(“your directory path/Module2_Demographics.csv”)
Descriptive Statistics
• Central tendency measurement: mean, median, and mode
• Mean function: mean()
• Median function: median()
• Mode function: Mode() in DescTools package
• What if there is a missing value in a vector?
• If you just type in mean(”variable name”), then R will produce “NA”.
• You can get the central tendency measurement statistics after dropping such missing
observations by adding “na.rm=TRUE” or “na.rm=T” parameter.
• mean(demo$age, na.rm=T)
Descriptive Statistics
• Measures of variability: range, variance, and standard deviation
• Range: max() - min()
• Variance: var()
• Standard deviation: sd() or sqrt(var())
• Let’s try the following code
• max(demo$age) – min(demo$age)
• var(demo$age)
• sd(demo$age)
• sqrt(var(demo$age))
Descriptive Statistics
• Frequency table
• Function: table()
• How can I make the frequency table with pre-defined bin ranges?
• table(cut(x, bin range vector))
• Let’s try the following code
• br <- c(0, 20, 30, 50, 60, 70)
• table(cut(demo$age, br))
In-class practice
• Use ”Module2_Demographics.csv” to answer the questions below.
• Q1. What are the mean, median, and mode of female, age, and income?
• Q2. What are the range, variance, and standard deviation of female, age, and
income?
• Q3. Create frequency tables for female, age, and income using the following bins.
• female: c(-2, -1, 0, 1, 2)
• age: c(20, 30, 50, 60, 70)
• income: c(10000, 30000, 50000, 70000, 90000)
One-Sample t-test
• Hypotheses
• 𝐻0: 𝜇=𝜇_0; 𝐻1: 𝜇≠𝜇_0
• 𝐻0: 𝜇≤𝜇_0 𝐻1: 𝜇>𝜇_0
• 𝐻0: 𝜇≥𝜇_0;𝐻1: 𝜇<𝜇_0
• A t-test is suitable if a variable is believed to be drawn from a normal
distribution, or if the sample size is large.
One-Sample t-test
• How can I conduct the one-sample t-test?
• t.test(demo$age, mu=0)
• Reject the null hypothesis, which means that the null hypothesis that average age is 0 can be rejected.
• t.test(demo$age, mu=40)
• Cannot reject the null hypothesis, which means that the null hypothesis that the average age is 40 cannot be rejected.
• t.test(demo$age, mu=45, alternative="greater”)
• 𝐻0: 𝜇≤𝜇_0 𝐻1: 𝜇>𝜇_0
• Cannot reject the null hypothesis, which means that the null hypothesis that the average age is smaller than or equal to 45
cannot be rejected.
• t.test(demo$age, mu=45, alternative=”less”)
• 𝐻0: 𝜇≥𝜇_0;𝐻1: 𝜇<𝜇_0
• Reject the null hypothesis, which means that the null hypothesis that the average age is greater than or equal to 45 can be
rejected.
In-class practice
• Q4. One sample t-test
• Q4-1. Is the female variable average significantly different from zero?
• Q4-2. Is the age variable average significantly different from 40?
• Tage<-t.test(demo$age, mu=40)
• Tage
• Tage$statistic
• Tage$p.value
• If(tage$p.value<.05)
• Q4-3. Is the income variable average significantly greater than 30000?
Two-sample t-test
• Hypotheses
• 𝐻0: 𝜇_1=𝜇_2; 𝐻1:𝜇_1≠𝜇_2
• Test the null hypothesis that the means of groups are equal.
• Test the null hypothesis that the means of variables are equal.
• 𝐻0: 𝜇_1≤𝜇_2; 𝐻1:𝜇_1>𝜇_2
• 𝐻0: 𝜇_1≥𝜇_2; 𝐻1:𝜇_1<𝜇_2
Two-sample t-test
• R function
• t.test(x, y, alternative = “two.sided”, var.equal = FALSE)- if the variable the
same or not
• alternative = “two.sided”, “greater”, “less”
• alternative = “greater”
• 𝐻0: 𝜇_1≤𝜇_2; 𝐻1:𝜇_1>𝜇_2
• alternative = “less”
• 𝐻0: 𝜇_1≤𝜇_2; 𝐻1:𝜇_1>𝜇_2
• var.equal = “TRUE” or “FALSE”; “T” or “F”
Two-sample t-test
• Do we need to use var.equal = FALSE or TRUE?
• var(female$age); var(male$age)
• var.test(female$age, male$age, alternative = “two.sided”)
• Cannot reject the null hypothesis that the variance of female age and the male age are the same.
•
• t.test(female$age, male$age, var.equal=T)
• Can reject the null hypothesis that the average female age and male age are the same (given the
equal variance assumption).
In-class practice
• Q5. Two-sample t-test
• Tip: For the following questions, you need to test if the variances are the same. then
use the appropriate argument and parameter.
• Q5-1. Is average female income significantly different from average male income?
• Q5-2. Is average female age significantly different from average male age?
Multivariate Descriptive Statistics
• Measures of the relationship between two variables
• Covariance: cov()
• Correlation
• cor(x, y, method = c("pearson", "kendall", "spearman"))
• cor.test(x, y, method=c("pearson", "kendall", "spearman"))
In-class practice
• Q6. What are the correlations and covariances of the following pairs? Use
Pearson correlation for correlation coefficient calculation. Are the
correlation coefficients significantly different from zero?
• female and age
• female and income
• age and income
Plotting
• Histogram
• hist()
• Let’s try the following code
• hist(demo$female, main = "Female histogram", xlab="Female or not", col="blue")
• Histogram by group
• ggplot2 package
• Data visualization package.
• Lots of resources to learn about the details. Ex) https://ggplot2.tidyverse.org/
Plotting
• Let’s try the following code
• install.packages(“ggplot2”)
• library(ggplot2)
• ggplot(demo, aes(x = female)) +
geom_histogram(color = "grey30", fill = "blue") +
ggtitle("Female ggplot histrogram")
Plotting
• Let’s try the following code
• ggplot(demo, aes(x = age, fill = female)) +
geom_histogram(binwidth = 1)
• ggplot(demo, aes(x = age, fill = female)) +
geom_histogram(binwidth = 5)
• ggplot(demo, aes(x=age, fill = female)) +
geom_histogram(binwidth = 5, position = "dodge")
• ggplot(demo, aes(x = age)) +
geom_histogram(binwidth = 1, color = "grey30") +
facet_grid(female ~ .)
In-class practice
• Q7. Plots
• Q7-1. Create histograms that show the distribution of female, age, and income.
• Q7-2. Create a histogram that shows the distribution of age by sex (female or male).
Use binwidth = 5.
• Q7-3. Create a histogram that shows the distribution of income by sex (female or
male) Use binwidth = 5000.