In R, the chi-square statistic is used to check if the distributions of categorical variables are different from each other. It's used when comparing the counts of categories between two or more independent groups.
The chi-square test of independence helps to find out if there's a relationship between the categories of two variables. There are two main types of data: numerical (numbers) and categorical (categories) on which it is performed.
Syntax:
chisq.test(data)
Parameters:
- data: table containing count values of the variables in the table.
Implementation of Chi-Square test
We can implement the Chi- Square test in R programming language, using the MASS package.
1. Installing the libraries
We can install the MASS package using the install.packages() function and load it using library() function once installed. MASS library contains various datasets and functions for statistical analysis. We will use the str() function to display the structure of the survey dataset.
R
install.packages("MASS")
library(MASS)
print(str(survey))
Output:
Summary of survey dataset2. Creating a Contingency Table from Survey Data
We are creating a data frame stu_data from the survey dataset, selecting the Smoke and Exer variables. The Smoke column records the students smoking habits while the Exer column records their exercise level. Then, we use the table() function to create a contingency table, which summarizes the relationship between the Smoke and Exer variables. Our aim is to test the hypothesis whether the students smoking habit is dependent of their exercise level at .05 significance level.
R
stu_data = data.frame(survey$Smoke,survey$Exer)
stu_data = table(survey$Smoke,survey$Exer)
print(stu_data)
Output:
Table of smoke and exercise variables3. Applying Chi-Square Test
We are applying the chisq.test() function to the stu_data contingency table to perform a chi-square test. This test evaluates whether there is a significant association between the Smoke and Exer variables in the dataset.
R
print(chisq.test(stu_data))
Output:
Chi-Square test resultAs the p-value 0.4828 is greater than the .05, we conclude that the smoking habit is independent of the exercise level of the student and hence there is a weak or no correlation between the two variables.
4. Visualize the Chi-Square Test data
We are creating a bar plot using the barplot() function to visualize the relationship between smoking habits and exercise levels from the stu_data contingency table. The bars are grouped by exercise level (beside = TRUE), with different colors (lightblue for smokers and lightgreen for non-smokers).
R
barplot(stu_data, beside = TRUE, col = c("red", "lightgreen","lightblue","blue"),
main = "Smoking Habits vs Exercise Levels",
xlab = "Exercise Level", ylab = "Number of Students")
legend("center", legend = rownames(stu_data), fill = c("red", "lightgreen","lightblue","blue"))
Output:
barplot In this article, we explored how to create a contingency table, apply the chi-square test and visualize the relationship between two variables using a bar plot in R.