Exploratory Data Analysis
• Your goal during EDA is to develop an understanding of your data. The easiest
way to do this is to use questions as tools to guide your investigation.
• When you ask a question, the question focuses your attention on a specific part
of your dataset and helps you decide which graphs, models, or transformations
to make.
• You can loosely word these questions as: 1. What type of variation occurs within
my variables? 2. What type of covariation occurs between my variables?
Important terms in EFA
• To make the discussion easier, let’s define some terms:
• • A variable is a quantity, quality, or property that you can measure.
• • A value is the state of a variable when you measure it. The value of a
variable may change from measurement to measurement.
• • An observation, or a case, is a set of measurements made under similar
conditions (you usually make all of the measurements in an observation at the
same time and on the same object). An observation will contain several values,
each associated with a different variable. I’ll sometimes refer to an observation
as a data point.
• • Tabular data is a set of values, each associated with a variable and an
observation. Tabular data is tidy if each value is placed in its own “cell,” each
variable in its own column, and each obser vation in its own row.
Variation
• Variation is the tendency of the values of a variable to change from
measurement to measurement.
• Categorical variables can also vary if you measure across different subjects
(e.g., the eye colors of different people), or different times (e.g., the energy
levels of an electron at different moments).
• Every variable has its own pattern of variation, which can reveal interesting
information. The best way to understand that pattern is to visualize the
distribution of variables’ values.
Visualizing distributions
• How you visualize the distribution of a variable will depend on whether the
variable is categorical or continuous. A variable is categorical if it can only
take one of a small set of values. In R, categorical variables are usually saved
as factors or character vectors. To examine the distribution of a categorical
variable, use a bar chart.
• A variable is continuous if it can take any of an infinite set of ordered values.
Numbers and date-times are two examples of continuous variables. To examine
the distribution of a continuous variable, use a histogram.
Typical Values
• In both bar charts and histograms, tall bars show the common values of a variable,
and shorter bars show less-common values. Places that do not have bars reveal
values that were not seen in your data.
• To turn this information into useful questions, look for anything unexpected:
• Which values are the most common? Why?
• Which values are rare? Why?
• Does that match your expectations?
• Can you see any unusual patterns? What might explain them? As an example, the
following histogram suggests several interesting questions: • Why are there more
diamonds at whole carats and common fractions of carats? • Why are there more
diamonds slightly to the right of each peak than there are slightly to the left of each
peak? • Why are there no diamonds bigger than 3 carats?
Unusual values
• Outliers are observations that are unusual; data points that don’t seem to fit the pattern.
Sometimes outliers are data entry errors; other times outliers suggest important new
science. When you have a lot of data, outliers are sometimes difficult to see in a histogram.
• Minimum and maximum
• The first step to detect outliers in R is to start with some descriptive statistics, and in
particular with the minimum and maximum.
• In R, this can be done using summary() function.
• Histogram
• Another basic way to detect outliers is to draw a histogram of the data.
• library(ggplot2) ggplot(dat) + aes(x = hwy) + geom_histogram( bins =
round(sqrt(length(dat$hwy))), # set number of bins fill = "steelblue", color = "black" ) +
theme_minimal()
Outliers
• Boxplot
• In addition to histograms, boxplots are also useful to detect potential outliers.
• Using R base: boxplot(dat$hwy, ylab = "hwy")
• A boxplot helps to visualize a quantitative variable by displaying five common location summary
(minimum, median, first and third quartiles and maximum) and any observation that was classified as a
suspected outlier using the interquartile range (IQR) criterion.
• The IQR criterion means that all observations above q0.75+1.5⋅IQRq0.75+1.5⋅IQR or
below q0.25−1.5⋅IQRq0.25−1.5⋅IQR (where q0.25q0.25 and q0.75q0.75 correspond to first and third
quartile respectively, and IQR is the difference between the third and first quartile) are considered as
potential outliers by R.
• In other words, all observations outside of the following interval will be considered as potential outliers:
• I=[q0.25−1.5⋅IQR;q0.75+1.5⋅IQR]I=[q0.25−1.5⋅IQR;q0.75+1.5⋅IQR]
• Observations considered as potential outliers by the IQR criterion are displayed as points in the
boxplot.
Outliers
• Percentiles
• This method of outliers detection is based on the percentiles.
• With the percentiles method, all observations that lie outside the interval formed
by the 2.5 and 97.5 percentiles will be considered as potential outliers. Other
percentiles such as the 1 and 99, or the 5 and 95 percentiles can also be
considered to construct the interval.
• The values of the lower and upper percentiles (and thus the lower and upper
limits of the interval) can be computed with the quantile() function:
• lower_bound <- quantile(dat$hwy, 0.025)
• lower_bound
• upper_bound <- quantile(dat$hwy, 0.975) upper_bound
Missing values
• Dealing Missing Values in R
• Missing Values in R, are handled with the use of some pre-defined functions:
• [Link]() Function for Finding Missing values:
• A logical vector is returned by this function that indicates all the NA values
present. It returns a Boolean value. If NA is present in a vector it returns TRUE
elsec(NA,
x<- FALSE.
3, 4, NA, NA, NA)
[Link](x)
[1] TRUE FALSE FALSE TRUE TRUE TRUE
Missing values
• [Link] − It simply rules out any rows that contain any missing value and
forgets those rows forever.
• [Link] − This agument ignores rows having at least one missing value.
• [Link] − Take no action.
• [Link] − It terminates the execution if any of the missing values are found.
• myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
• [Link](myVector)
• OUTPUT:
[1] "TP" "4" "6.7" "c" "12" attr(,"[Link]") [1] 1 6
attr(,"class") [1] "exclude"
Missing values
• myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
• [Link](myVector)
[1] "TP" "4" "6.7" "c" "12" attr(,"[Link]") [1] 1 6 attr(,"class") [1] "omit"
• myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
• [Link](myVector)
Error in [Link](myVector) : missing values in object
Importance of Data Transformation and
Normalization
Data transformation and normalization are crucial preprocessing
steps in data analysis, especially for machine learning models. They
help ensure that features contribute equally to the model, improving
convergence speed and prediction accuracy.
Normalization adjusts the values in a dataset to a common scale
without distorting differences in ranges, which is particularly
important when features have different units or scales.
Methods of Normalization in R
Min-Max Normalization:
This method scales the data values to a specific range, typically [0, 1]. It is useful when you
want to maintain the relationships between values while ensuring they fit within a defined scale.
Implementation in R:
• r
• # Sample data
• data <- c(1200, 34567, 3456, 12, 3456)
• min_max_normalization <- function(x) {
• return((x - min(x)) / (max(x) - min(x)))
• }
• normalized_data <- min_max_normalization(data)
• print(normalized_data)
Methods of Normalization in R
Z-Score Normalization (Standardization):
This technique transforms data to have a mean of 0 and a standard deviation of 1. It is
particularly beneficial when dealing with algorithms that assume normally distributed data.
Implementation in R:
• r
• # Sample data
• data <- c(5, 10, 15, 20, 25)
• z_score_normalization <- function(x) {
• return((x - mean(x)) / sd(x))
• }
• standardized_data <- z_score_normalization(data)
• print(standardized_data)
•
Role of Inferential Statistics
• Inferential statistics involves making predictions or inferences about a population based on a sample of data. It allows
researchers to draw conclusions and make decisions based on statistical evidence, rather than relying solely on
descriptive statistics.
• Key concepts include hypothesis testing, confidence intervals, and correlation analysis, which help in understanding
relationships between variables and generalizing findings beyond the sample.
• Correlation Analysis:
• Correlation analysis is a specific method within inferential statistics used to assess the strength and direction of the
relationship between two quantitative variables.
• The Pearson correlation coefficient is the most commonly used measure for linear relationships.
• Pearson Correlation Coefficient:
• The Pearson correlation coefficient (denoted as rr) ranges from -1 to 1:
• 1: Perfect positive correlation
• -1: Perfect negative correlation
• 0: No correlation
• It quantifies how much one variable changes when another variable changes.
Pearson Correlation
• Assumptions for Pearson Correlation:
• Both variables should be continuous and measured on an interval or ratio scale.
• The data should be normally distributed for both variables.
• There should be a linear relationship between the two variables (linearity).
• Homoscedasticity should be present, meaning that the spread of residuals is
constant across all levels of the independent variable.
• Computing Pearson Correlation in R:
To compute the Pearson correlation coefficient in R, you can use the cor() function
or [Link]() function.
Pearson Correlation
• # Sample data
• x <- c(10, 20, 30, 40, 50)
• y <- c(15, 25, 35, 45, 55)
• # Compute Pearson correlation
• correlation_coefficient <- cor(x, y, method = "pearson")
• print(paste("Pearson Correlation Coefficient:", correlation_coefficient))
• # Perform correlation test
• correlation_test <- [Link](x, y, method = "pearson")
• print(correlation_test)
• Interpreting Results:
• The output from [Link]() will provide the Pearson correlation coefficient along with a p-value.
• A significant p-value (typically < 0.05) indicates that there is a statistically significant linear relationship between the two
variables.
• Practical significance should also be considered; even if a correlation is statistically significant, it may not always imply a strong
Non Parametric Test
• A non-parametric test for investigating associations between two categorical,
binary variables
• There are many occasions when we want to investigate whether there is an
association between two categorical, binary variables, such as looking at
whether gender (male/female) and voting (yes/no) are related.
• Fisher’s exact test is a statistical hypothesis test used to assess the association
between two binary variables in a contingency table and is particularly useful
when working with small sized samples. It is a non-parametric test, meaning it
assumes no distribution in the data and it is analogous to the
Chi-squared test for independence.
Fishers Exact Test
• Fisher’s exact test makes use of contingency tables to calculate the probability of
observing the data as it is, considering all other possible arrangements of the observed
data while maintaining the row and column totals fixed.
• Assumptions for Fisher’s exact test are as follows:
• Both variables should be categorical and binary, meaning they can take one of two
values, so that a 2 x 2 contingency table can be populated.
• Data should be randomly selected from independent samples; groups should have no
relationship to each other, and observations cannot fall into more than one category
simultaneously.
• One or more cell value counts in the contingency table is small (less than 5). Where all
values are more than 5, a chi-squared test should be performed instead. While Fisher’s
exact test is theoretically valid when samples are large, it is computationally intensive
and so usually only used for small samples.
Mann Whitney U-Test
• Mann-Whitney U Test: Overview
• The Mann-Whitney U test (also known as the Wilcoxon rank-sum test) is a non-
parametric statistical test used to compare two independent groups to determine
whether their distributions are the same or if one tends to have higher values than the
other. It is an alternative to the independent samples t-test when the assumptions of
normality or homogeneity of variance are not met.
• When to Use the Mann-Whitney U Test
[Link]-Normal Data: When the data does not follow a normal distribution.
[Link] Data: When the data is ordinal (ranked) rather than continuous.
[Link] Sample Sizes: When the sample size is small, and parametric tests are not
suitable.
[Link] Samples: When comparing two independent groups.
Mann Whitney U-Test
• Mann-Whitney U Test: Overview
• The Mann-Whitney U test (also known as the Wilcoxon rank-sum test) is a non-
parametric statistical test used to compare two independent groups to determine
whether their distributions are the same or if one tends to have higher values than the
other. It is an alternative to the independent samples t-test when the assumptions of
normality or homogeneity of variance are not met.
• When to Use the Mann-Whitney U Test
[Link]-Normal Data: When the data does not follow a normal distribution.
[Link] Data: When the data is ordinal (ranked) rather than continuous.
[Link] Sample Sizes: When the sample size is small, and parametric tests are not
suitable.
[Link] Samples: When comparing two independent groups.
Mann Whitney U-Test
• Scenario:
• A researcher wants to compare the effectiveness of two teaching methods
(Method A and Method B) on student test scores. The test scores (out of 100)
for two independent groups of students are as follows:
• Method A: 78, 85, 80, 90, 88
• Method B: 72, 75, 70, 80, 77
• Step 1: Formulate Hypotheses
• H₀: There is no difference in test scores between Method A and Method B.
• H₁: There is a difference in test scores between Method A and Method B.
Mann Whitney U-Test
1. Hypotheses:
Null Hypothesis (H0): There is no difference in the central tendency between the two
groups.
Alternative Hypothesis (H1): There is a difference in the central tendency between the
two groups.
2. Implementation in R:
To perform a Mann-Whitney U-Test in R, you can use the [Link]() function. Here’s a
simple code snippet:
r
# Sample data
group1 <- c(5, 7, 8, 6, 9)
group2 <- c(4, 3, 5, 2, 6)
# Mann-Whitney U Test
result <- [Link](group1, group2)
# Output results
print(result)
Mann Whitney U test
Interpretation of Results:
The output will provide a U statistic and a p-value.
If the p-value is less than the significance level
(commonly set at 0.05), you would reject the null
hypothesis, indicating a statistically significant
difference between the two groups.
Wilcoxon Signed-Rank Test
• The Wilcoxon Signed-Rank Test is a non-parametric statistical test used to
compare two related samples or repeated measurements on a single sample to
assess whether their population mean ranks differ. It is an alternative to the
paired samples t-test when the assumptions of normality are not met.
• When to Use the Wilcoxon Signed-Rank Test
[Link] Data: When comparing two related samples (e.g., pre-test and post-
test scores).
[Link]-Normal Data: When the differences between pairs are not normally
distributed.
[Link] Data: When the data is ordinal (ranked) rather than continuous.
[Link] Sample Sizes: When the sample size is small, and parametric tests are
not suitable.
Wilcoxon Signed-Rank Test
1. Assumptions of the Wilcoxon Signed-Rank Test:
The samples must be paired or matched (dependent
samples).
The differences between paired observations should be
continuous and can be ranked.
The distribution of differences should be symmetric
around the median.
2. Hypotheses:
Null Hypothesis (H0): The median difference between the paired
observations is zero.
Alternative Hypothesis (H1): The median difference between the
paired observations is not equal to zero.
Wilcoxon Signed-Rank Test
[Link] in R:
To perform a Wilcoxon Signed-Rank Test in R, you can use
the [Link]() function. Here’s a simple code snippet:
• r
• # Sample data
• before <- c(5, 7, 8, 6, 9)
• after <- c(4, 3, 5, 2, 6)
• # Wilcoxon Signed-Rank Test
• result <- [Link](before, after, paired = TRUE)
• # Output results
• print(result
Wilcoxon Signed-Rank Test
[Link] in R:
To perform a Wilcoxon Signed-Rank Test in R, you can use
the [Link]() function. Here’s a simple code snippet:
• r
• # Sample data
• before <- c(5, 7, 8, 6, 9)
• after <- c(4, 3, 5, 2, 6)
• # Wilcoxon Signed-Rank Test
• result <- [Link](before, after, paired = TRUE)
• # Output results
• print(result
Wilcoxon Signed-Rank Test
[Link] of Results:
The output will provide a W statistic and a p-value.
If the p-value is less than the significance level
(commonly set at 0.05), you would reject the null
hypothesis, indicating a statistically significant
difference in the medians of the paired samples.
Kruskal-Wallis Rank Sum Test: Overview
• The Kruskal-Wallis Rank Sum Test is a non-parametric statistical test
used to compare three or more independent groups to determine if their
distributions are the same or if at least one group tends to have higher or
lower values than the others. It is an extension of the Mann-Whitney U test
for more than two groups and is an alternative to the one-way ANOVA when
the assumptions of normality or homogeneity of variance are not met.
• When to Use the Kruskal-Wallis Test
[Link]-Normal Data: When the data does not follow a normal distribution.
[Link] Data: When the data is ordinal (ranked) rather than continuous.
[Link] Samples: When comparing three or more independent groups.
[Link] Sample Sizes: When the sample size is small, and parametric tests are not
suitable.
Kruskal-Wallis Rank Sum Test: Overview
• # Data
• diet_a <- c(2.5, 3.0, 2.8, 3.2)
• diet_b <- c(1.8, 2.0, 1.9, 2.1)
• diet_c <- c(3.5, 3.7, 3.6, 3.8)
• # Combine data into a list
• data <- list(DietA = diet_a, DietB = diet_b, DietC = diet_c)
• # Perform Kruskal-Wallis test
• result <- [Link](data) Interpretation:
•The p-value (0.007) is less than 0.05,
• # Print the result
indicating a significant difference in
• print(result) weight loss among the three diets.
•The chi-squared statistic (9.846) with
• Kruskal-Wallis rank sum test
2 degrees of freedom confirms this
• Output: result.
• data: data
• Kruskal-Wallis chi-squared = 9.846, df = 2, p-value = 0.007