R Programming Language - Introduction

Chi-Square Test for Independence

Last Updated : 23 Jul, 2025

The Chi-Square Test for Independence is a statistical method used to determine whether there is a significant association between two categorical variables. It helps assess whether the distribution of one variable differs depending on the level of another variable. This test is widely applied in various fields, including market research, social sciences, and healthcare.

Objective of the Chi-Square Test for Independence

The Chi-Square Test for Independence determines if two categorical variables are independent or associated.

Null Hypothesis (H₀): There is no association between the two variables, implying they are independent.
Alternative Hypothesis (H₁): There is an association between the two variables, implying they are dependent.

Mathematical Formula

The Chi-Square statistic is calculated using the formula:

\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

Where:

𝑂_i is the observed frequency in the contingency table.
E_i is the expected frequency.

Assumptions of the Chi-Square Test

Categorical Variables: Both variables should be categorical.
Independence: The observations should be independent.
Sufficient Sample Size: The expected frequency in each cell should be at least 5 for the Chi-Square approximation to be valid.

Chi-Square Test for Independence in Python

Let’s implement a Chi-Square Test for Independence using Python. We will analyze whether gender and preference for a particular brand are independent.

Python Code Implementation

Python

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Creating the contingency table
data = np.array([[25, 20, 15],   # Male preferences
                 [30, 10, 20]])  # Female preferences

# Converting to DataFrame for better visualization
df = pd.DataFrame(data, columns=['Prefer_A', 'Prefer_B', 'Prefer_C'], index=['Male', 'Female'])

print("Contingency Table:")
print(df)

# Performing Chi-Square Test
chi2_stat, p_value, dof, expected = chi2_contingency(data)

# Displaying results
print(f"\nChi-Square Statistic: {chi2_stat:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies Table:\n", expected)
print(f"P-value: {p_value:.4f}")

# Significance level
alpha = 0.05

# Decision
if p_value <= alpha:
    print("\nReject Null Hypothesis: There is a significant association between gender and brand preference.")
else:
    print("\nFail to Reject Null Hypothesis: No significant association between gender and brand preference.")

Output

chi_sqaure_ind

Interpreting the Results

Chi-Square Statistic: The value χ2 = 4.5022 measures the difference between the observed and expected frequencies.

Degrees of Freedom: The degrees of freedom for a contingency table are calculated as:

dof = (number of rows − 1) × (number of columns − 1)

In this case: dof = (2 − 1) × (3 − 1) = 2

P-value: The p-value is 0.1053, which is greater than the significance threshold α = 0.05. As the p-value exceeds the threshold, we fail to reject the null hypothesis.

When to Use Chi-Square Test for Independence?

When you need to check for associations between two categorical variables.
When analyzing survey responses or contingency tables.
Useful in fields like market research, sociology, and healthcare.

Limitations of the Chi-Square Test

Sample Size Dependency: The test can be unreliable if expected cell frequencies are too small.
Non-Causality: It only identifies associations, not causal relationships.
Categorical Limitation: Only applicable to categorical data.

R Programming Language - Introduction

B

Bhumi Mittal

Improve

Article Tags :

Similar Reads

R Tutorial | Learn R Programming Language

R is an interpreted programming language widely used for statistical computing, data analysis and visualization. R language is open-source with large community support. R provides structured approach to data manipulation, along with decent libraries and packages like Dplyr, Ggplot2, shiny, Janitor a