0% found this document useful (0 votes)
67 views35 pages

Variance and Standard Deviation

Vaariyaansii fi istaandard daayiveeshinii

Uploaded by

Aliye mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views35 pages

Variance and Standard Deviation

Vaariyaansii fi istaandard daayiveeshinii

Uploaded by

Aliye mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 35

Udacity part of Accenture logo

✨ Welcome to your new Classroom design! ✨ What's new?

Introduction to the Standard Deviation and Variance

Lesson

Downloads

Other Measures of Spread

5 Number Summary

In the previous sections, we have seen how to calculate the values associated with the five-number
summary (min,

3
Q

, max), as well as the measures of spread associated with these values (range and IQR).

For datasets that are not symmetric, the five-number summary and a corresponding box plot are a great
way to get started with understanding the spread of your data. Although I still prefer a histogram in
most cases, box plots can be easier to compare two or more groups. You will see this in the quizzes
towards the end of this lesson.

Variance and Standard Deviation

Two additional measures of spread that are used all the time are the variance and standard deviation. At
first glance, the variance and standard deviation can seem overwhelming. If you do not understand the
expressions below, don't panic! In this section, I just want to give you an overview of what the next
sections will cover. We will walk through each of these parts thoroughly in the next few sections, but the
big picture goal is to generally understand the following:

How the mean, variance, and standard deviation are calculated.

Why the measures of variance and standard deviation make sense to capture the spread of our data.

Fields, where you might see these values used.

Why we might use the standard deviation or variance as opposed to the values associated with the 5
number summary for a particular dataset.

Calculation

We calculate the variance in the following way:

i
=

i=1

(x

ˉ
)

The variance is the average squared difference of each observation from the mean.

To calculate the variance of a set of 10 values in a spreadsheet application, with our 10 data points in
column A, we would create a new column B by typing in something like =A1-AVERAGE(A$1:A$10) and
copying this down for all 10 rows. This would find us the difference between each data point and the
mean average of all the data. Then we create a new column C having the square of these differences,
using the formula =B1^2 in cell C1, and copying that down for all rows. Then in the cell below this new
column, cell C11, type in =SUM(C1:C10). This adds up all these values in column C. Finally in cell C12, we
divide this sum by the number of data points we have, in this case, ten: =C11/10. This cell C12 now
contains the variance for our 10 data points.

More detailed guidance on using spreadsheets like this may be included in a future lesson in your
program.

The standard deviation is the square root of the variance. Therefore, the formula for the standard
deviation is the following:

(
x

i=1

(x

2
In the same spreadsheet as above, to find the standard deviation of our same set of 10 data values, we
would use another cell like C13 to take the square root of our variance measure, by typing in =sqrt(C12).

The standard deviation is a measurement that has the same units as our original data, while the units of
the variance are the square of the units in our original data. For example, if the units in our original data
were dollars, then units of the standard deviation would also be dollars, while the units of the variance
would be dollars squared.

Again, this section is designed as background knowledge for the following sections. If it doesn't make
sense on this first pass, do not worry. You will be guided in future sections in performing these
calculations, and building your intuition, as you work through an example using the salary data. Then we
will provide context about why these calculations are important, and where you might see them!

✨ Welcome to your new Classroom design! ✨ What's new?

Quiz: Applied Standard Deviation and Variance

Lesson

Downloads

Investment Data

Consider we have two investment opportunities:


Returns

Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

Investment 1 5% 5% 5% 5% 5% 5%

Investment 2 12% -2% 10% 0% 7% 3%

The returns for 6 consecutive years for each investment are shown above. Use this information to
answer the questions below.

Question 1 of 3

Use the information above to match the mean/expected return for each investment.

Scenario:

Investment 1

Service Model:

Scenario:

Investment 2

Service Model:
Investment Data

In the previous two questions, you should have found that these investments have the same mean! That
is, regardless of which investment opportunity you choose, you are expected to earn the same amount.
So how are they different? Let's look at some additional questions to see if we can find some
differences.

The same data as above is provided again (to minimize scrolling).

Returns

Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

Investment 1 5% 5% 5% 5% 5% 5%

Investment 2 12% -2% 10% 0% 7% 3%

The returns for 6 consecutive years for each investment are shown above. Use this information to
answer the questions below.

Question 2 of 3

Using the information above, mark all of the below that are true statements.

Question 3 of 3

Based on the observed data, which of the above two investments has the best opportunity of earning
more than 7%?
Useful Insight

The above example is a simplified version of the real world but does point out something useful that you
may have heard before. Notice if you were not fully invested in either Investment 1 or fully invested in
Investment 2, but instead, you were diversified across both investment options, you could earn more
than either investment individually. This is the benefit of diversifying your portfolio for long-term gains.
For short-term gains, you might not need or want to diversify. You could get lucky and hit short-term
gains associated with the upswings (12%, 10%, or 7%) of Investment 2. However, you might also get
unlucky, and hit a down term and earn nothing or even lose money on your investment using this same
strategy.

12/12/2016

Udacity part of Accenture logo

✨ Welcome to your new Classroom design! ✨ What's new?

Shape

Lesson

Downloads
Shape

Histograms

We learned how to build a histogram in this video, as this is the most popular visual for quantitative
data.

Shape

From a histogram, we can quickly identify the shape of our data, which helps influence all of the
measures we learned in the previous concepts. We learned that the distribution of our data is frequently
associated with one of the three shapes:

1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)

Summary

Shape Mean vs. Median Real-World Applications

Symmetric (Normal) Mean equals Median Height, Weight, Errors, Precipitation

Right-skewed Mean greater than Median Amount of drug remaining in a bloodstream, Time
between phone calls at a call center, Time until light bulb dies

Left-skewed Mean less than Median Grades as a percentage in many universities, Age of death, Asset
price changes

The mode of a distribution is essentially the tallest bar in a histogram. There may be multiple modes
depending on the number of peaks in our histogram.

Udacity part of Accenture logo


✨ Welcome to your new Classroom design! ✨ What's new?

The Shape For Data In The World

Lesson

Downloads

The Shape For Data In The World

When working with data, building a quick plot lets you quickly see the shape of your data.

Distribution Shape Types of Data

Bell Shaped Heights, Weight, Scores

Left Skewed GPA, Age of Death, Price

Right Skewed Distribution of Wealth, Athletic Abilities

References

These are the references used to pull the applications of each shape.

Quora(opens in a new tab)

University of Texas(opens in a new tab)

Stack Exchange(opens in a new tab)


Supporting Materials

Quora(opens in a new tab)

Stack Exchange(opens in a new tab)

Udacity part of Accenture logo

✨ Welcome to your new Classroom design! ✨ What's new?

Quiz: Shape and Outliers (What's the Impact?)

Lesson

Downloads

Question 1 of 2

Match the distribution shape with the correct relationship in comparing the mean to the median.

Scenario:

Right-skewed

Service Model:

Scenario:
Left-skewed

Service Model:

Scenario:

Symmetric

Service Model:

Question 2 of 2

Check all of the below that must be true.

14/12/2016 KIIBXATA

✨ Welcome to your new Classroom design! ✨ What's new?

Measures of Center and Spread Summary

Lesson

Downloads

Recap

Variable Types

We have covered a lot up to this point! We started with identifying data types as either categorical or
quantitative. We then learned we could identify quantitative variables as either continuous or discrete.
We also found we could identify categorical variables as either ordinal or nominal.
Categorical Variables

When analyzing categorical variables, we commonly just look at the count or percent of a group that
falls into each level of a category. For example, if we had two levels of a dog category: lab and not lab.
We might say, 32% of the dogs were lab (percent), or we might say 32 of the 100 dogs I saw were labs
(count).

However, the 4 aspects associated with describing quantitative variables are not used to describe
categorical variables.

Quantitative Variables

Then we learned there are four main aspects used to describe quantitative variables:

Measures of Center

Measures of Spread

Shape of the Distribution

Outliers

We looked at calculating measures of Center

Means

Medians

Modes

We also looked at calculating measures of Spread

Range

Interquartile Range

Standard Deviation

Variance
Calculating Variance

We saw that we could calculate the variance as:

i=1

n
(x

You will also see:

i

n−1

i=1

(x

The reason for this is beyond the scope of what we have covered thus far, but you can find an
explanation here(opens in a new tab).
You can commonly find answers to your questions with a quick Google search(opens in a new tab). Now
is a great time to get started with this practice! This answer should make more sense at the completion
of this lesson.

Standard Deviation vs. Variance

The standard deviation is the square root of the variance. In practice, you usually use the standard
deviation rather than the variance. The reason for this is because the standard deviation shares the
same units with our original data, while the variance has squared units.

What Next?

In the next sections, we will be looking at the last two aspects of quantitative variables: shape and
outliers. What we know about measures of center and measures of spread will assist in your
understanding of these final two aspects.

27/12/2016 or MONDAY, SEPTEMBER 2, 2024

Descriptive Statistics Summary

Lesson

Downloads

Recap

Variable Types

We have covered a lot up to this point! We started with identifying data types as either categorical or
quantitative. We then learned we could identify quantitative variables as either continuous or discrete.
We also found we could identify categorical variables as either ordinal or nominal.

Categorical Variables

When analyzing categorical variables, we commonly just look at the count or percent of a group that
falls into each level of a category. For example, if we had two levels of a dog category: lab and not lab.
We might say, 32% of the dogs were lab (percent), or we might say 32 of the 100 dogs I saw were labs
(count).
However, the 4 aspects associated with describing quantitative variables are not used to describe
categorical variables.

Quantitative Variables

Then we learned there are four main aspects used to describe quantitative variables:

Measures of Center

Measures of Spread

Shape of the Distribution

Outliers

Measures of Center

We looked at calculating measures of Center

Means

Medians

Modes

Measures of Spread

We also looked at calculating measures of Spread

Range

Interquartile Range

Standard Deviation

Variance

Shape

We learned that the distribution of our data is frequently associated with one of the three shapes:
1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)

Depending on the shape associated with our dataset, certain measures of center or spread may be
better for summarizing our dataset.

When we have data that follows a normal distribution, we can completely understand our dataset using
the mean and standard deviation.

However, if our dataset is skewed, the 5 number summary (and measures of center associated with it)
might be better to summarize our dataset.

Outliers

We learned that outliers have a larger influence on measures like the mean than on measures like the
median. We learned that we should work with outliers on a situation by situation basis. Common
techniques include:

1. At least note they exist and the impact on summary statistics.

2. If typo - remove or fix

3. Understand why they exist, and the impact on questions we are trying to answer about our data.

4. Reporting the 5 number summary values is often a better indication than measures like the mean and
standard deviation when we have outliers.
5. Be careful in reporting. Know how to ask the right questions.

Histograms and Box Plots

We also looked at histograms and box plots to visualize our quantitative data. Identifying outliers and
the shape associated with the distribution of our data are easier when using a visual as opposed to using
summary statistics.

What Next?

Up to this point, we have only looked at Descriptive Statistics, because we are describing our collected
data. In the final sections of this lesson, we will be looking at the difference between Descriptive
Statistics and Inferential Statistics.

27/12/2016 or MONDAY, SEPTEMBER 2, 2024


Descriptive Statistics Summary

Lesson

Downloads

Recap

Variable Types

We have covered a lot up to this point! We started with identifying data types as either categorical or
quantitative. We then learned we could identify quantitative variables as either continuous or discrete.
We also found we could identify categorical variables as either ordinal or nominal.

Categorical Variables

When analyzing categorical variables, we commonly just look at the count or percent of a group that
falls into each level of a category. For example, if we had two levels of a dog category: lab and not lab.
We might say, 32% of the dogs were lab (percent), or we might say 32 of the 100 dogs I saw were labs
(count).

However, the 4 aspects associated with describing quantitative variables are not used to describe
categorical variables.

Quantitative Variables

Then we learned there are four main aspects used to describe quantitative variables:

Measures of Center

Measures of Spread

Shape of the Distribution

Outliers

Measures of Center

We looked at calculating measures of Center

Means

Medians

Modes

Measures of Spread

We also looked at calculating measures of Spread

Range

Interquartile Range

Standard Deviation

Variance
Shape

We learned that the distribution of our data is frequently associated with one of the three shapes:

1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)

Depending on the shape associated with our dataset, certain measures of center or spread may be
better for summarizing our dataset.

When we have data that follows a normal distribution, we can completely understand our dataset using
the mean and standard deviation.

However, if our dataset is skewed, the 5 number summary (and measures of center associated with it)
might be better to summarize our dataset.

Outliers

We learned that outliers have a larger influence on measures like the mean than on measures like the
median. We learned that we should work with outliers on a situation by situation basis. Common
techniques include:

1. At least note they exist and the impact on summary statistics.

2. If typo - remove or fix


3. Understand why they exist, and the impact on questions we are trying to answer about our data.

4. Reporting the 5 number summary values is often a better indication than measures like the mean and
standard deviation when we have outliers.

5. Be careful in reporting. Know how to ask the right questions.

Histograms and Box Plots

We also looked at histograms and box plots to visualize our quantitative data. Identifying outliers and
the shape associated with the distribution of our data are easier when using a visual as opposed to using
summary statistics.

What Next?

Up to this point, we have only looked at Descriptive Statistics, because we are describing our collected
data. In the final sections of this lesson, we will be looking at the difference between Descriptive
Statistics and Inferential Statistics.

Descriptive vs. Inferential Statistics

Lesson

Downloads

Descriptive vs. Inferential Statistics

Video Transcript
0:00

The topics covered this far have all been aimed at descriptive statistics.

0:06

That is, describing the data we've collected.

0:09

There's an entire other field of statistics

0:13

known as inferential statistics that's aimed at drawing

0:16

conclusions about a population of individuals


0:19

based only on a sample of individuals from that population.

0:23

Imagine I want to understand what proportion of all Udacity students drink coffee.

0:30

We know you're busy,

0:31

and in order to get projects in on time,

0:33

we assume you almost drink a ton of coffee.


0:37

I send out an email to all Udacity alumni and

0:40

current students asking the question, do you drink coffee?

0:44

For purposes of this exercise,

0:46

let's say the list contained 100,000 emails.

0:50

Unfortunately, not everyone responds to my email blast.


0:54

Some of the emails don't even go through.

0:57

Therefore, I only receive 5,000 responses.

1:00

I find that 73% of the individuals that responded to my email blast,

1:05

say they do drink coffee.

1:08

Descriptive statistics is about describing the data we have.


1:13

That is, any information we have and share regarding the 5,000 responses is descriptive.

1:20

Inferential statistics is about drawing conclusions

1:24

regarding the coffee drinking habits of all Udacity students,

1:28

only using the data from the 5,000 responses.

1:32

Therefore, inferential statistics in our example is all about drawing conclusions


1:38

regarding all 100,000 Udacity students using only the 5,000 responses from our sample.

1:45

The general language associated with this scenario is as shown here.

1:50

We have a population which is our entire group of interest.

1:54

In our case, the 100,000 students.

1:57

We collect a subset from this population which we call a sample.


2:01

In our case, the 5,000 students.

2:05

Any numeric summary calculated from the sample is called a statistic.

2:10

In our case, the 73% of the 5,000 that drink coffee.

2:15

This 73% is the statistic.

2:18

A numeric summary of the population is known as a parameter.


2:23

In our case, we don't know this value as it's

2:27

a number that requires information from all Udacity students.

2:31

Drawing conclusions regarding a parameter based on our statistics is known as inference.

In this section, we learned about how Inferential Statistics differs from Descriptive Statistics.

Descriptive Statistics

Descriptive statisticsis about describing our collected data.

Inferential Statistics

Inferential Statisticsis about using our collected data to draw conclusions about a larger population.

We looked at specific examples that allowed us to identify the

Population - our entire group of interest.


Parameter - numeric summary about a population

Sample - a subset of the population

Statistic numeric summary about a sample

DESCRIPTIVE Vs INFERENTIAL STATICS VIDEO TRANSRIPT

Descriptive vs. Inferential Statistics

Lesson

Downloads

Descriptive vs. Inferential Statistics

Video Transcript

0:00 The topics covered this far have all been aimed at descriptive statistics.

0:06 That is, describing the data we've collected.

0:09 There's an entire other field of statistics

0:13 known as inferential statistics that's aimed at drawing

0:16 conclusions about a population of individuals

0:19 based only on a sample of individuals from that population.

0:23 Imagine I want to understand what proportion of all Udacity students drink coffee.

0:30 We know you're busy,

0:31 and in order to get projects in on time,


0:33 we assume you almost drink a ton of coffee.

0:37 I send out an email to all Udacity alumni and

0:40 current students asking the question, do you drink coffee?

0:44 For purposes of this exercise,

0:46 let's say the list contained 100,000 emails.

0:50 Unfortunately, not everyone responds to my email blast.

0:54 Some of the emails don't even go through.

0:57 Therefore, I only receive 5,000 responses.

1:00 I find that 73% of the individuals that responded to my email blast,

1:05 say they do drink coffee.

1:08 Descriptive statistics is about describing the data we have.

1:13 That is, any information we have and share regarding the 5,000 responses is descriptive.

1:20 Inferential statistics is about drawing conclusions

1:24 regarding the coffee drinking habits of all Udacity students,

1:28 only using the data from the 5,000 responses.

1:32 Therefore, inferential statistics in our example is all about drawing conclusions

1:38 regarding all 100,000 Udacity students using only the 5,000 responses from our sample.

1:45 The general language associated with this scenario is as shown here.

1:50 We have a population which is our entire group of interest.

1:54 In our case, the 100,000 students.

1:57 We collect a subset from this population which we call a sample.

2:01 In our case, the 5,000 students.

2:05 Any numeric summary calculated from the sample is called a statistic.

2:10 In our case, the 73% of the 5,000 that drink coffee.
2:15 This 73% is the statistic.

2:18 A numeric summary of the population is known as a parameter.

2:23 In our case, we don't know this value as it's

2:27 a number that requires information from all Udacity students.

2:31 Drawing conclusions regarding a parameter based on our statistics is known as inference.

You might also like