Notes On Statistics
Notes On Statistics
That
process involves designing studies, collecting good data, describing the data with numbers and graphs, analyzing the data,
and then making conclusions.
In statistics, we have a saying: “Garbage in equals garbage out.” If you select your subjects in a way that is biased — that
is, favoring certain individuals or groups of individuals — then your results will also be biased.
Descriptive statistics are numbers that summarize some characteristic about a set of data. They provide you with easy-to-
understand information that helps answer questions. They also help researchers get a rough idea about what’s happening in
their experiments so later they can do more formal and targeted analyses. Descriptive statistics make a point clearly and
concisely.
The data are the individual pieces of factual information recorded, and it is used for the purpose of the analysis
process. The two processes of data analysis are interpretation and presentation. Statistics are the result of data analysis.
There are two main types of data: categorical (or qualitative) data and numerical (or quantitative data).
Categorical data record qualities or characteristics about the individual, such as eye color, gender, political party, or
opinion on some issue (using categories such as agree, disagree, or no opinion). Categorical data place individuals into
groups. For example, male/female, own your home/don’t own, or Democrat/ Republican/Independent/Other.
Numerical data record measurements or counts regarding each individual, which may include weight, age, height, or time
to take an exam; counts may include number of pets, or the number of red lights you hit on your way to work.
The important difference between the two is that with categorical data, any numbers involved do not have real numerical
meaning (for example, using 1 for male and 2 for female), while all numerical data represents actual numbers for which
math operations make sense.
A third type of data, ordinal data, falls in between, where data appear in categories, but the categories have a
meaningful order, such as ratings from 1 to 5, or class ranks of freshman through senior. Ordinal data can be
analyzed like categorical data, and the basic numerical data techniques also apply when categories are represented by
numbers that have meaning.
Qualitative or Categorical Data
Qualitative data, also known as the categorical data, describes the data that fits into the categories. Qualitative data are
not numerical. The categorical information involves categorical variables that describe the features such as a person’s
gender, home town etc. Categorical measures are defined in terms of natural language specifications, but not in terms of
numbers.
Sometimes categorical data can hold numerical values (quantitative value), but those values do not have a mathematical
sense. Examples of the categorical data are birthdate, favourite sport, school postcode. Here, the birthdate and school
postcode hold the quantitative value, but it does not give numerical meaning.
Nominal Data
Nominal data is one of the types of qualitative information which helps to label the variables without providing the
numerical value. Nominal data is also called the nominal scale. It cannot be ordered and measured. But sometimes,
the data can be qualitative and quantitative. Examples of nominal data are letters, symbols, words, gender etc.
The nominal data are examined using the grouping method. In this method, the data are grouped into categories, and then
the frequency or the percentage of the data can be calculated. These data are visually represented using the pie charts.
Ordinal Data
Ordinal data/variable is a type of data that follows a natural order. The significant feature of the nominal data is that the
difference between the data values is not determined. This variable is mostly found in surveys, finance, economics,
questionnaires, and so on.
The ordinal data is commonly represented using a bar chart. These data are investigated and interpreted through many
visualisation tools. The information may be expressed using tables in which each row in the table shows the distinct
category.
Quantitative or Numerical Data
Quantitative data is also known as numerical data which represents the numerical value (i.e., how much, how often,
how many). Numerical data gives information about the quantities of a specific thing. Some examples of numerical data
are height, length, size, weight, and so on. The quantitative data can be classified into two different types based on the data
sets. The two different classifications of numerical data are discrete data and continuous data.
Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite number of possible values.
Those values cannot be subdivided meaningfully. Here, things can be counted in whole numbers.
Example: Number of students in the class
Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values that can be selected within
a given specific range.
Example: Temperature range
There are two major types of random variables: discrete and continuous. Discrete random variables basically count things
(number of heads on 10 coin flips, number of female Democrats in a sample, and so on). The most well known discrete
random variable is the binomial. A continuous random variable measures things and takes on values within an interval, or
they have so many possible values that they might as well be deemed continuous (for example, time to complete a task,
exam scores, and so on).
We say that X has a normal distribution if its values fall into a smooth (continuous) curve with a bell-shaped, symmetric
pattern, meaning it looks the same on each side when cut down the middle. The total area under the curve is 1.
Figure below illustrates three different normal distributions with different means and standard deviations.
Note that the saddle points (highlighted by arrows in figure above on either side of the mean) on each graph are
where the graph changes from concave down to concave up. The distance from the mean out to either saddle point
is equal to the standard deviation for the normal distribution. For any normal distribution, almost all its values lie
within three standard deviations of the mean.
In ordinal measurement the attributes can be rank-ordered. Here, distances between attributes do not have any
meaning. For example, on a survey you might code Educational Attainment as 0=less than high school; 1=some
high school.; 2=high school degree; 3=some college; 4=college degree; 5=post college. In this measure, higher
numbers mean more education. But is distance from 0 to 1 same as 3 to 4? Of course not. The interval between
values is not interpretable in an ordinal measure.
Individuals competing in a contest may be fortunate to achieve first, second, or third place. First, second, and third
place represent ordinal data. If Roscoe takes first and Wilbur takes second, we do not know if the competition was
close; we only know that Roscoe outperformed Wilbur. Likert-type scales (such as "On a scale of 1 to 10 with one
being no pain and ten being high pain, how much pain are you in today?") also represent ordinal data.
Fundamentally, these scales do not represent a measurable quantity. An individual may respond 8 to this question
and be in less pain than someone else who responded 5.
A scale which represents quantity and has equal units but for which zero represents simply an additional point of
measurement is an interval scale. The Fahrenheit scale is a clear example of the interval scale of measurement.
Thus, 60 degree Fahrenheit or -10 degrees Fahrenheit are interval data. Measurement of Sea Level is another
example of an interval scale. With each of these scales there is direct, measurable quantity with equality of units. In
addition, zero does not represent the absolute lowest value. Rather, it is point on the scale with numbers both above
and below it (for example, -10 degrees Fahrenheit). In interval measurement the distance between
attributes does have meaning. For example, when we measure temperature (in Fahrenheit), the distance from 30-40
is same as distance from 70-80. The interval between values is interpretable. Because of this, it makes sense to
compute an average of an interval variable, where it doesn’t make sense to do so for ordinal scales. But note that in
interval measurement ratios don’t make any sense - 80 degrees is not twice as hot as 40 degrees (although the
attribute value is twice as large).
Finally, in ratio measurement there is always an absolute zero that is meaningful. This means that you can
construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social
research most “count” variables are ratio, for example, the number of clients in past six months. Why? Because
you can have zero clients and because it is meaningful to say that “…we had twice as many clients in the past six
months as we did in the previous six months.”
It’s important to recognize that there is a hierarchy implied in the level of measurement idea. At lower levels of
measurement, assumptions tend to be less restrictive and data analyses tend to be less sensitive. At each level up the
hierarchy, the current level includes all of the qualities of the one below it and adds something new. In general, it is
desirable to have a higher level of measurement (e.g., interval or ratio) rather than a lower one (nominal or
ordinal).
Measures of Variability
Variability is what the field of statistics is all about. Results vary from individual to individual, from group to group, from
city to city, from moment to moment. Variation always exists in a data set, regardless of which characteristic you’re
measuring, because not every individual will have the same exact value for every characteristic you measure. Without a
measure of variability you can’t compare two data sets effectively. What if in both sets two sets of data have about the
same average and the same median? Does that mean that the data are all the same? Not at all. For example, the data sets
199, 200, 201, and 0, 200, 400 both have the same average, which is 200, and the same median, which is also 200.
Yet they have very different amounts of variability. The first data set has a very small amount of variability
compared to the second.
By far the most commonly used measure of variability is the standard deviation. The standard deviation of a data set,
denoted by s, represents the typical distance from any point in the data set to the center. It’s roughly the average distance
from the center, and in this case, the center is the average. Most often, you don’t hear a standard deviation given just by
itself; if it’s reported (and it’s not reported nearly enough) it’s usually in the fine print, in parentheses, like “(s = 2.68).”
Percentiles :The most common way to report relative standing of a number within a data set is by using percentiles. A
percentile is the percentage of individuals in the data set who are below where your particular number is located. If your
exam score is at the 90th percentile, for example, that means 90% of the people taking the exam with you scored lower
than you did (it also means that 10 percent scored higher than you did.)
Finding a percentile
To calculate the kth percentile (where k is any number between one and one hundred), do the following steps:
1. Order all the numbers in the data set from smallest to largest.
2. Multiply k percent times the total number of numbers, n.
3a. If your result from Step 2 is a whole number, go to Step 4. If the result from Step 2 is not a whole number, round it up
to the nearest whole number and go to Step 3b.
3b. Count the numbers in your data set from left to right (from the smallest to the largest number) until you reach the value
from Step 3a. This corresponding number in your data set is the kth percentile.
4. Count the numbers in your data set from left to right until you reach that whole number. The kth percentile is the
average of that corresponding number in your data set and the next number in your data set.
For example, suppose you have 25 test scores, in order from lowest to highest: 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71,
72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99. To find the 90th percentile for these (ordered) scores start by
multiplying 90% times the total number of scores, which gives 90% × 25 = 0.90 × 25 = 22.5 (Step 2). This is not a whole
number; Step 3a says round up to the nearest whole number — 23 — then go to step 3b. Counting from left to right (from
the smallest to the largest number in the data set), you go until you find the 23rd number in the data set. That number is 98,
and it’s the 90th percentile for this data set. If you want to find the 20th percentile, take 0.20 × 25 = 5; this is a whole
number so proceed to Step 4, which tells us the 20th percentile is the average of the 5th and 6th numbers in the ordered
data set (62 and 66). The 20th percentile then comes to (62+66)/2 = 64.
The median is the 50th percentile, the point in the data where 50% of the data fall below that point and 50% fall above it.
The median for the test scores example is the 13th number, 77.
A percentile is not a percent; a percentile is a number that is a certain percentage of the way through the data set,
when the data set is ordered. Suppose your score on the GRE was reported to be the 80th percentile. This doesn’t
mean you scored 80% of the questions correctly. It means that 80% of the students’ scores were lower than yours,
and 20% of the students’ scores were higher than yours.
The five-number summary is a set of five descriptive statistics that divide the data set into four equal sections. The five
numbers in a five number summary are:
1. The minimum (smallest) number in the data set.
2. The 25th percentile, aka the first quartile, or Q1.
3. The median (or 50th percentile).
4. The 75th percentile, aka the third quartile, or Q3.
5. The maximum (largest) number in the data set.
For example, we can find the five-number summary of the 25 (ordered) exam scores 43, 54, 56, 61, 62, 66, 68, 69, 69, 70,
71, 72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99. The minimum is 43, the maximum is 99, and the median is the
number directly in the middle, 77.
To find Q1 and Q3, you use the steps shown in the section, “Finding a percentile,” where n = 25. Step 1 is done since the
data are ordered. For Step 2, since Q1 is the 25th percentile, multiply 0.25 * 25 = 6.25. This is not a whole number, so
Step 3a says round it up to 7 and proceed to Step 3b. Count from left to right in the data set until you reach the 7th number,
68; this is Q1. For Q3 (the 75th percentile) multiply 0.75 * 25 = 18.75; round up to 19, and the 19th number on the list is
89, or Q3. Putting it all together, the five-number summary for the test scores data is 43, 68, 77, 89, and 99.
The purpose of the five-number summary is to give descriptive statistics for center, variability, and relative
standing all in one shot. The measure of center in the five-number summary is the median, and the first quartile, median,
and third quartiles are measures of relative standing. To obtain a measure of variability based on the five-number
summary, you can find what’s called the Interquartile Range (or IQR). The IQR equals Q3 – Q1 and reflects the distance
taken up by the innermost 50% of the data. If the IQR is small, you know there is much data close to the median. If the
IQR is large, you know the data are more spread out from the median. The IQR for the test scores data set is 89 – 68 = 21,
which is quite large seeing as how test scores only go from 0 to 100.
Boxplots: Boxplots are a standardized way of displaying the distribution of data based on a five-number summary
(“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum” Use box and whisker plots when
you have multiple data sets from independent sources related to each other in some way. Examples include, Test
scores between schools or classrooms; Data from duplicate machines manufacturing the same products etc.
Suppose you wanted to compare three lathes’ performance responsible for the rough turning of a motor shaft. The
design specification is 18.85 +/- 0.1 mm. Diameter measurements from a sample of shafts taken from each roughing
lathe are displayed in a box and whisker plot in the figure
A boxplot is a one-dimensional graph of numerical data based on the five-number summary, which includes the
minimum value, the 25th percentile (known as Q1), the median, the 75th percentile (Q3), and the maximum value.
In essence, these five descriptive statistics divide the data set into four equal parts. Box and Whisker Plot
Worksheets
Box and whisker plots are used to display and analyze data conveniently. They include many important parameters
required for further analysis, like mean, 25 percentile mark, and the outliers in the data. This helps in a lot of fields
like machine learning, deep learning, etc. which include the representation of huge amounts of data. It can also
represent multiple sets of data in the same graph.
We will first find the median and then the lower quartile and the upper quartile using the quartile formula to draw
the box and whisker.
Step 1: Arrange the data in ascending order.
19, 20, 20, 21, 21, 25, 25, 26, 28 28, 29, 29, 30, 32, 35, 35, 35, 36.
Step 2: Find the median of the data
M = [(n/2)th term + (n + 1/2)th ]/2
M = [9th term + 10th ]/2
M = (28 + 28)/2 = 28
M = 28
Step 3: Find the minimum and maximum values of the data.
The minimum is the lowest number in the data set which is 19.
The maximum is the highest number in the data set which is 36.
Step 4: Find the first quartile which lies at 25% of the data and the third quartile which lies at 75% of the data.
The first quartile (Q1) is the median of the lower half of data which lies at 25% of the data.
19, 20, 20, 21, 21, 25, 25, 26, 28.
Q1 = 21
The third quartile (Q3) is the median of the upper half of data which lies at 75% of the data.
28, 29, 29, 30, 32, 35, 35, 35, 36.
Q3 = 32
Step 5: Draw a box and whisker using the data below.
Minimum: 19
First quartile: 21
Median: 28
Third quartile: 32
Maximum: 36
The Binomial Distribution: A random variable is a characteristic, measurement, or count that changes randomly
according to some set of probabilities; its notation is X, Y, Z, and so on. A list of all possible values of a random variable,
along with their probabilities is called a probability distribution. One of the most well-known probability distributions is
the binomial. Binomial means “two names” and is associated with situations involving two outcomes: success or failure
(hitting a red light or not; developing a side effect or not).
Characteristics of a Binomial: A random variable has a binomial distribution if all of following conditions are met:
1. There are a fixed number of trials (n).
2. Each trial has two possible outcomes: success or failure.
3. The probability of success (call it p) is the same for each trial.
4. The trials are independent, meaning the outcome of one trial doesn’t influence that of any other.
Let X equal the total number of successes in n trials; if all of the above conditions are met, X has a binomial distribution
with probability of success equal to p.
Checking the Binomial conditions step by step:
You flip a fair coin 10 times and count the number of heads. Does this represent a binomial random variable? You can
check by reviewing your responses to the questions and statements in the list that follows:
1. Are there a fixed number of trials? You’re flipping the coin 10 times, which is a fixed number. Condition 1 is met, and n
= 10.
2. Does each trial have only two possible outcomes — success or failure? The outcome of each flip is either heads or tails,
and you’re interested in counting the number of heads, so flipping a head represents success and flipping a tail is a failure.
Condition 2 is met.
3. Is the probability of success the same for each trial? Because the coin is fair the probability of success (getting a head) is
p = 1 ⁄2 for each trial. You also know that 1 – 1 ⁄2 = 1 ⁄2 is the probability of failure (getting a tail) on each trial. Condition
3 is met.
4. Are the trials independent? We assume the coin is being flipped the same way each time, which means the outcome of
one flip doesn’t affect the outcome of subsequent flips. Condition 4 is met.
Because the coin-flipping example meets the four conditions, the random variable X, which counts the number of
successes (heads) that occur in 10 trials, has a binomial distribution with n = 10 and p = 1 ⁄2.
Properties of Measurement
Identity: Identity refers to each value having a unique meaning.
Magnitude: Magnitude means that the values have an ordered relationship to one another, so there is a
specific order to the variables.
Equal intervals: Equal intervals mean that data points along the scale are equal, so the difference between
data points one and two will be the same as the difference between data points five and six.
A minimum value of zero: A minimum value of zero means the scale has a true zero point. Degrees, for
example, can fall below zero and still have meaning. But if you weigh nothing, you don’t exist.
Ratio variables, on the other hand, never fall below zero. A ratio scale permits not only addition, subtraction,
and multiplication but also division. That is, you can calculate the ratio of the values. A ratio variable, has all
the properties of an interval variable, but also has a clear definition of 0.0.
To summarise, nominal scales are used to label or describe values. Ordinal scales are used to provide
information about the specific order of the data points, mostly seen in the use of satisfaction surveys. The
interval scale is used to understand the order and differences between them. The ratio scales gives more
information about identity, order and difference, plus a breakdown of the numerical detail within each data
point.
Using quantitative and qualitative data in statistics: Once data scientists have a conclusive data set from
their sample, they can start to use the information to draw descriptions and conclusions. To do this, they can use
both descriptive and inferential statistics.
Descriptive statistics
Descriptive statistics help demonstrate, represent, analyse and summarise the findings contained in a sample. They
present data in an easy-to-understand and presentable form, such as a table or graph. Without description, the data
would be in its raw form with no explanation.
Frequency counts
One way data scientists can describe statistics is using frequency counts, or frequency statistics, which describe the
number of times a variable exists in a data set. For example, the number of people with blue eyes or the number of
people with a driver’s license in the sample can be counted by frequency. Other examples include qualifications of
education, such as high school diploma, a university degree or doctorate, and categories of marital status, such as
single, married or divorced.
Frequency data is a form of discrete data, as parts of the values can’t be broken down. To calculate continuous data
points, such as age, data scientists can use central tendency statistics instead. To do this, they find the mean or
average of the data point. Using the age example, this can tell them the average age of participants in the sample.
While data scientists can draw summaries from the use of descriptive statistics and present them in an
understandable form, they can’t necessarily draw conclusions. That’s where inferential statistics come in.
Inferential statistics
Inferential statistics are used to develop a hypothesis from the data set. It would be impossible to get data from an
entire population, so data scientists can use inferential statistics to extrapolate their results. Using these statistics,
they can make generalisations and predictions about a wider sample group, even if they haven’t surveyed them all.
An example of using inferential statistics is in an election. Even before the entire country has voted, data scientists
can use these kinds of statistics to make assumptions regarding who might win based on a smaller sample size.
Stem and Leaf Plot: A stem and leaf plot is a unique table where values of data are split into a stem and leaf. The
first digit or digits will be written in stem and the last digit will be written in leaf. Let us learn more about this
interesting
What is Stem and Leaf Plot?
A stem and leaf plot also called a stem and leaf diagram is a way of organizing data into a form that makes it easy to
observe the frequency of different types of values. It is a graph that shows numerical data arranged in order. Each
data value is broken into a stem and a leaf.
A stem and leaf plot is represented in form of a special table where each first digit or digit of data value is split into a
stem and the last digit of data in a leaf. This " | " symbol is used to show stem values and leaf values and it is called
as stem and leaf plot key. For example, 46 is represented as 4 on the stem and 6 on the leaf and shown using stem
and leaf plot key like this 4 | 6.
In the image given below, the stem values are listed one below the other in ascending order and the leaf values are
listed left to right from the stem values in ascending order.
As the stem and leaf plot definition states,
For example,
6 | 7 ⇒ 6 on the stem and 7 on the leaf read as 67.
6 | 8 ⇒ 6 on the stem and 8 on the leaf read as 68.
Finding the mean, median, and mode are a part of stem and leaf plot statistics. Let's understand with help of an
example. Consider the following stem and leaf plot worksheet which shows 5
data values.
The data values are already in ascending order. They are 20, 32, 32, 35 and 41
Mean of the data = Sum of data values ÷ Total number of values = (20 + 32 + 32
+ 35 + 41) ÷ 5 = 160 ÷ 5 = 32.
Mean = 32
Mode is the data value that frequently appears.
Mode = 32
Median = The middle value of the data.
Median = 32
Thus, calculating the stem and leaf plot statistics is very easy.