0% found this document useful (0 votes)
16 views25 pages

Module2 Analytical Tool

Module2_Analytical_Tool

Uploaded by

olihu767
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views25 pages

Module2 Analytical Tool

Module2_Analytical_Tool

Uploaded by

olihu767
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

MKTG 631

Marketing Analytics
Module2 – R tutorial

Jinhee Huh
Marketing Analytical Tools
Data Type

• XLS and XLSX

• CSV (Comma Separated Value)


• Each row of data is stored in a text file with
a comma separating each column’s values from one another.

• Tab delimited file


• The columns of data are stored as a text file with
a TAB character between values.
Open and Proprietary Programming Tools

• Open-source programming tools:


• Programming tools that are made freely available, often developed by
and for the community.
• Adapting for new methodology

• Proprietary programming tools:


• Programming tools that are developed by a firm and distributed for
sale to the public.
• Expensive
• It will take longer time
R and Python

• R program advantages
• Developed by data scientists
• Large number of ready-made packages for statistical analysis than Python
• Supported by an integrated development environment called RStudio
• Built-in ways to professionally visualize data
• Easy to install and set up the work environment

• R program disadvantages
• Less efficient for general computations, sometimes due to inefficiently
written packages
R and Python
R Program Practice
R Program Practice
• Let’s practice the descriptive statistics calculation, t-test, multivariate
descriptive statistics, and plot drawing.

• How can we open the ”Module2_Demographics.csv”


• demo <- read.csv(“your directory path/Module2_Demographics.csv”)
Descriptive Statistics
• Central tendency measurement: mean, median, and mode
• Mean function: mean()
• Median function: median()
• Mode function: Mode() in DescTools package

• What if there is a missing value in a vector?


• If you just type in mean(”variable name”), then R will produce “NA”.
• You can get the central tendency measurement statistics after dropping such missing
observations by adding “na.rm=TRUE” or “na.rm=T” parameter.
• mean(demo$age, na.rm=T)
Descriptive Statistics
• Measures of variability: range, variance, and standard deviation
• Range: max() - min()
• Variance: var()
• Standard deviation: sd() or sqrt(var())

• Let’s try the following code


• max(demo$age) – min(demo$age)
• var(demo$age)
• sd(demo$age)
• sqrt(var(demo$age))
Descriptive Statistics

• Frequency table
• Function: table()

• How can I make the frequency table with pre-defined bin ranges?
• table(cut(x, bin range vector))
• Let’s try the following code
• br <- c(0, 20, 30, 50, 60, 70)
• table(cut(demo$age, br))
In-class practice
• Use ”Module2_Demographics.csv” to answer the questions below.

• Q1. What are the mean, median, and mode of female, age, and income?
• Q2. What are the range, variance, and standard deviation of female, age, and
income?
• Q3. Create frequency tables for female, age, and income using the following bins.
• female: c(-2, -1, 0, 1, 2)
• age: c(20, 30, 50, 60, 70)
• income: c(10000, 30000, 50000, 70000, 90000)
One-Sample t-test

• Hypotheses
• 𝐻0: 𝜇=𝜇_0; 𝐻1: 𝜇≠𝜇_0
• 𝐻0: 𝜇≤𝜇_0 𝐻1: 𝜇>𝜇_0
• 𝐻0: 𝜇≥𝜇_0;𝐻1: 𝜇<𝜇_0

• A t-test is suitable if a variable is believed to be drawn from a normal


distribution, or if the sample size is large.
One-Sample t-test
• How can I conduct the one-sample t-test?

• t.test(demo$age, mu=0)
• Reject the null hypothesis, which means that the null hypothesis that average age is 0 can be rejected.
• t.test(demo$age, mu=40)
• Cannot reject the null hypothesis, which means that the null hypothesis that the average age is 40 cannot be rejected.
• t.test(demo$age, mu=45, alternative="greater”)
• 𝐻0: 𝜇≤𝜇_0 𝐻1: 𝜇>𝜇_0
• Cannot reject the null hypothesis, which means that the null hypothesis that the average age is smaller than or equal to 45
cannot be rejected.
• t.test(demo$age, mu=45, alternative=”less”)
• 𝐻0: 𝜇≥𝜇_0;𝐻1: 𝜇<𝜇_0
• Reject the null hypothesis, which means that the null hypothesis that the average age is greater than or equal to 45 can be
rejected.
In-class practice

• Q4. One sample t-test

• Q4-1. Is the female variable average significantly different from zero?

• Q4-2. Is the age variable average significantly different from 40?


• Tage<-t.test(demo$age, mu=40)
• Tage
• Tage$statistic
• Tage$p.value
• If(tage$p.value<.05)

• Q4-3. Is the income variable average significantly greater than 30000?


Two-sample t-test
• Hypotheses

• 𝐻0: 𝜇_1=𝜇_2; 𝐻1:𝜇_1≠𝜇_2


• Test the null hypothesis that the means of groups are equal.
• Test the null hypothesis that the means of variables are equal.

• 𝐻0: 𝜇_1≤𝜇_2; 𝐻1:𝜇_1>𝜇_2

• 𝐻0: 𝜇_1≥𝜇_2; 𝐻1:𝜇_1<𝜇_2


Two-sample t-test
• R function

• t.test(x, y, alternative = “two.sided”, var.equal = FALSE)- if the variable the


same or not

• alternative = “two.sided”, “greater”, “less”


• alternative = “greater”
• 𝐻0: 𝜇_1≤𝜇_2; 𝐻1:𝜇_1>𝜇_2
• alternative = “less”
• 𝐻0: 𝜇_1≤𝜇_2; 𝐻1:𝜇_1>𝜇_2

• var.equal = “TRUE” or “FALSE”; “T” or “F”


Two-sample t-test

• Do we need to use var.equal = FALSE or TRUE?

• var(female$age); var(male$age)

• var.test(female$age, male$age, alternative = “two.sided”)


• Cannot reject the null hypothesis that the variance of female age and the male age are the same.

• t.test(female$age, male$age, var.equal=T)
• Can reject the null hypothesis that the average female age and male age are the same (given the
equal variance assumption).
In-class practice
• Q5. Two-sample t-test

• Tip: For the following questions, you need to test if the variances are the same. then
use the appropriate argument and parameter.

• Q5-1. Is average female income significantly different from average male income?

• Q5-2. Is average female age significantly different from average male age?
Multivariate Descriptive Statistics

• Measures of the relationship between two variables

• Covariance: cov()
• Correlation
• cor(x, y, method = c("pearson", "kendall", "spearman"))
• cor.test(x, y, method=c("pearson", "kendall", "spearman"))
In-class practice

• Q6. What are the correlations and covariances of the following pairs? Use
Pearson correlation for correlation coefficient calculation. Are the
correlation coefficients significantly different from zero?

• female and age


• female and income
• age and income
Plotting
• Histogram
• hist()
• Let’s try the following code
• hist(demo$female, main = "Female histogram", xlab="Female or not", col="blue")

• Histogram by group
• ggplot2 package
• Data visualization package.
• Lots of resources to learn about the details. Ex) https://ggplot2.tidyverse.org/
Plotting
• Let’s try the following code

• install.packages(“ggplot2”)
• library(ggplot2)

• ggplot(demo, aes(x = female)) +


geom_histogram(color = "grey30", fill = "blue") +
ggtitle("Female ggplot histrogram")
Plotting
• Let’s try the following code

• ggplot(demo, aes(x = age, fill = female)) +


geom_histogram(binwidth = 1)

• ggplot(demo, aes(x = age, fill = female)) +


geom_histogram(binwidth = 5)

• ggplot(demo, aes(x=age, fill = female)) +


geom_histogram(binwidth = 5, position = "dodge")

• ggplot(demo, aes(x = age)) +


geom_histogram(binwidth = 1, color = "grey30") +
facet_grid(female ~ .)
In-class practice

• Q7. Plots

• Q7-1. Create histograms that show the distribution of female, age, and income.

• Q7-2. Create a histogram that shows the distribution of age by sex (female or male).
Use binwidth = 5.

• Q7-3. Create a histogram that shows the distribution of income by sex (female or
male) Use binwidth = 5000.

You might also like