Box plot
Pivot table
Box Plot
The method to summarize a set of data that is measured
using an interval scale is called a box and whisker plot
Parts of Box Plots
Minimum: The minimum value in the given dataset
First Quartile (Q1): The first quartile is the median of the lower
half of the data set.
Median: The median is the middle value of the dataset, which
divides the given dataset into two equal parts. The median is
considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the
upper half of the data.
Maximum: The maximum value in the given dataset.
Interquartile Range (IQR): The difference between the third
quartile and first quartile is known as the interquartile range.
(i.e.) IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the
ordered data is tested to be the outliers. Generally, the outliers
fall more than the specified distance from the first and third
quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than
Q1-(1.5 . IQR).
Suppose you have the math test results for a class of 15
students. Here are the results:
91 95 54 69 80 85 88 73 71 70 66 90 86 84 73
Step 1: Order the data points from least to greatest.
54 66 69 70 71 73 73 80 84 85 86 88 90 91 95
Step 2: Find the median of the data:
finding the median
Step 3: Find the middle points of the two halves divided by the median (find the
upper and lower quartiles).
Step 4: Find the extreme values.
This is the easiest part. You need to find the largest and
smallest data values.
Extreme values = 54 and 95.
So, we can determine that the five-number summary for the
class of students is 54, 70, 80, 88, 95.
Now we are absolutely ready to draw our box and whisker
plot.
As you see, the plot is divided into four groups: a lower
whisker, a lower box half, an upper box half, and an upper
whisker. Each of those groups shows 25% of the data because
we have an equal amount of data in each group.
Interpreting the box and whisker plot results:
✔ The box and whisker plot shows that 50% of the students
have scores between 70 and 88 points.
✔ In addition, 75% scored lower than 88 points, and 50% have
test results above 80. So, if you have test results somewhere
in the lower whisker, you may need to study more.
Comparative double box and whisker plot
Suppose an IT company has two stores that sell computers. The
company recorded the number of sales each store made each
month. In the past 12 months, we have the following numbers of
sold computers:
Store 1:
350, 460, 20, 160, 580, 250, 210, 120, 200, 510, 290, 380.
Store 2:
520, 180, 260, 380, 80, 500, 630, 420, 210, 70, 440, 140.
Syntax:boxplot()
x: This parameter sets as a vector or a formula.
data: This parameter sets the data frame.
main: This parameter is the title of the chart.
names: This parameter are the group labels that will be
showed under each boxplot.
✔ The mtcars dataset is a built-in dataset in R that contains
measurements on 11 different attributes for 32 different cars.
✔ Load the mtcars Dataset
✔ data(mtcars)
Summarize the mtcars Dataset
We can use the summary() function to quickly summarize each
variable in the dataset:
summary(mtcars)
dim(mtcars)
names(mtcars)
hist(mtcars$mpg,
col='steelblue',
main='Histogram',
xlab='mpg',
ylab='Frequency')
boxplot(mtcars$mpg,
main='Distribution of mpg values',
ylab='mpg',
col='steelblue',
border='black')
plot(mtcars$mpg, mtcars$wt,
col='steelblue',
main='Scatterplot',
xlab='mpg',
ylab='wt',
pch=19)
Pivot table
✔ The Pivot table is one of Microsoft Excel’s most powerful features
that let us extract the significance from a large and detailed data
set.
✔ A Pivot Table often shows some statistical value about the dataset
by grouping some values from a column together, To do so in the
R programming Language, we use the group_by() and the
summarize() function of the dplyr package library.
Pivot table
✔ The dplyr package in the R Programming Language is a
structure of data manipulation that provides a uniform set of
verbs that help us in preprocessing large data.
✔ The group_by() function groups the data using one or more
variables and then summarize function creates the summary
of data by those groups using aggregate function passed to it
Pivot table
Syntax:
df %>% group_by( grouping_variables) %>% summarize( label =
aggregate_fun() )
Parameter:
df: determines the data frame in use.
grouping_variables: determine the variable used to group data.
aggregate_fun(): determines the function used for summary. for
example, sum, mean, etc.
sample_data <- data.frame(label=c(‘x', ‘y', ‘z', ‘x',
‘y', ‘z', ‘x', ‘y',
‘z'),
value=c(222, 18, 51, 52, 44, 19, 100, 98, 34))
# load library dplyr
library(dplyr)
# create pivot table with sum of value as summary
sample_data %>% group_by(label) %>%
summarize(sum_values = sum(value))
Pivot table
1x 374
2y 160
3z 104
# create sample data frame
sample_data <- data.frame(label=c(‘x', ‘y', ‘z', ‘x',
‘y', ‘z', ‘x', ‘y',
‘z'),
value=c(222, 18, 51, 52, 44, 19, 100, 98, 34))
# load library dplyr
library(dplyr)
# create pivot table with sum of value as summary
sample_data %>% group_by(label) %>%
summarize(average_values = mean(value))
1x 125.
2y 53.3
3z 34.7