0% found this document useful (0 votes)
9 views

EDA_W3_Obtaining-Data

The document outlines various statistical measures for data analysis, including measures of central tendency (mean, median, mode), measures of variation (range, variance, standard deviation), and measures of shape (skewness, kurtosis). It provides definitions, computational procedures, and examples for each measure, emphasizing their importance in understanding data distribution and variability. Additionally, it includes exercises for calculating these measures using grouped data.

Uploaded by

Grizelle Mae
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

EDA_W3_Obtaining-Data

The document outlines various statistical measures for data analysis, including measures of central tendency (mean, median, mode), measures of variation (range, variance, standard deviation), and measures of shape (skewness, kurtosis). It provides definitions, computational procedures, and examples for each measure, emphasizing their importance in understanding data distribution and variability. Additionally, it includes exercises for calculating these measures using grouped data.

Uploaded by

Grizelle Mae
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

ENGINEERING DATA ANALYSIS

Obtaining Data

Editha A. Macorol
Measures of Describing Data

• Measure of Central Tendency


- Also known as Measure of Central Location
- Measure of finding the mean, median or mode of
the dataset

• Measure of Position
- Measure of finding the kth element of the
distribution
Measures of Describing Data

• Measure of Variation
- Measure of how the data is distributed about the
mean.

• Measure of Shape
- Measure of the degree of symmetry of a
distribution.
The Mean
Weighted Mean
Example:
1. The Carter Construction Company pays its hourly
employees $16.50, $19.00, or $25.00 per hour. There
are 26 hourly employees, 14 of which are paid at the
$16.50 rate, 10 at the $19.00 rate, and 2 at $25.00
rate. What is the mean hourly rate paid of the 26
employees?
The Median
Characteristics
• There is a unique median for each data set.
• It is not affected by extremely large or small values and
is therefore a valuable measure of central tendency
when such values occur.
• It can be computed for ratio-level, interval-level, and
ordinal-level data.
• It can be computed for an open-ended frequency
distribution if the median does not lie in an open-
ended class.
The Median
• The midpoint of the values after they have been
ordered from the smallest to largest
• There are as many values above the median as below it
in the data array.
• For an even set of values, the median will be the
arithmetic average of the two middle numbers.
Median: Computational Procedure
First Procedure
Arrange the observations in an ordered array.
If there is an odd number of terms, the median is the middle term of
the ordered array.
If there is an even number of terms, the median is the average of the
middle two terms.
Second Procedure
The median’s position in an ordered array is given by (n+1)/2.
The Median
Example:

Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22

There are 17 terms in the ordered array.


Position of median = (n+1)/2 = (17+1)/2 = 9
The median is the 9th term, 15.
If the 22 is replaced by 100, the median is 15.
If the 3 is replaced by -103, the median is 15.
The Mode
• The value of the observation that appears most
frequently
The Mode
Class Interval Frequency
25-29 1
30-34 1
35-39 5
40-44 8
45-49 15
50-54 4
55-59 4
60-64 3
65-69 4
70-74 3
75-79 2
– sample size
class mark
frequency
Median of Grouped Data

• - lower boundary of the median class


• - cumulative frequency for class interval preceding the
median class
• - frequency in the median class
• – class width or the interval size
• – sample size
Mode of Grouped Data

• - lower boundary of the modal class


• - difference between the frequency in the modal class and the
frequency in the preceding class interval

• - difference between the frequency in the modal class and the


frequency in the succeeding class interval

• – class width or the interval size


Measures of Location
Quartiles
• Dividing the dataset into 4 groups.

Deciles
• Dividing the dataset into 10 groups.

Percentiles
• Dividing the dataset into 100 groups.
Measures of Location
• Quartile – One fourth
First (1/4), Second (1/2), Third (3/4)
Quartile locator (Lq):
• Decile – One tenth
10%, 20%, …, 90%
Decile locator (Ld):
• Percentile − One hundredth
1%, 2%, …, 99%
Measures of Variation (Dispersion)
Why study dispersion?
• A second reason is to compare the spread in two or
more distributions.
• These are measures of the average distance of each
observation from the center of distribution.
• They measure the homogeneity or heterogeneity of a
particular group.
Measures of Variation (Dispersion)
Why study dispersion?
• A measure of location, such as the mean or the median does not
tell us anything about the spread of the data.
• For example, if your nature guide told you that the river averaged
3 feet in depth, would you want to wade across on foot without
additional information? Probably not. You would want to know
something about the variation in depth.
Measures of Variation
• Range
- The difference between the largest and smallest number in
the set
• Interquartile Range
- Range of values between the first and third quartiles
- Range of the “middle half”
• Mean Deviation
• The average of unsigned deviations from mean
• Variance
- The average of square deviations
• Standard Deviation (SD)
- The population/sample standard deviation is given as the
positive square root of population/sample variance
• Coefficient of Variation (CV)
- The percentage of the ratio of standard deviation to the
mean
Range
R=H─L
Consider the following data.
Grades in Statistics
Jon 100 Ann 84
Ron 65 Ria 86
Dan 75 Let 85
Tom 85 Bel 82
Bob 95 Nel 83
Range 35 Range 4
Range
Conclusion: Grades of males are more scattered while
grades of females are more compressed. Females are
more homogeneous in their math ability.

Disadvantages of the range:


1. Unstable for a very large class
2. Unreliable since only two values are taken into
account
3. Range of two sets of data with unequal number of
scores are not directly comparable
Variance and Standard Deviation
• Sample variance ()

• Sample standard deviation ()


- Positive square root of

The quantity is often called the degree-of-freedom associated with


the variance estimate.
• Mean Deviation
Variance and Standard Deviation
• Population variance ()

• Population standard deviation ()


- Positive square root of
Variance
Determine the variance in the previous example treating
the data as a population and sample.
Grades in Statistics
Jon 100 Ann 84
Ron 65 Ria 86
Dan 75 Let 85
Tom 85 Bel 82
Bob 95 Nel 83
84 84
Variance
Males
Variance
Females
Variance
Conclusion: Males showed more variability. The higher
the variance, the more variable or far apart the values are
from each other.

Remark: Since the variance is in squared units, it does not


reflect the true meaning of data being measured.
Standard Deviation
Males

Females
Measures of Variation
Example:

Consider the following test scores:


Test 1 2 3 4 5 6 7 8 9 10
Student 12 6 13 2 5 0 9 6 10 7
Student 8 10 9 12 5 1 4 7 9 3
a. Who performed better?
b. Who is more consistent?
Measures of Variation
a. Compute the average score of each student.

Student performed better because of the higher


computed average.
Measures of Variation
b. Compute the sample standard deviations.

Student is more consistent because of lower standard


deviation.
Measures of Variation
Remark: Standard deviation and variance are both reliable
but cannot be used in comparing two sets of data of
different units.

Example: Consistency of a player − assist or making points


Measures of Variation

• Interquartile Range (IR)

• Quartile Deviation (QD)

• Coefficient of Variation

s
CV = ( 100 % )
𝑥
Coefficient of Variation
Measures of Shape
• Skewness
- Degree of asymmetry of distribution about a mean. It
is a measure on how the data departs from being
symmetrical
- Can be interpreted as symmetric, positively skewed or
negatively skewed

• Kurtosis
- The degree of peakedness exhibited by the distribution
- Computed as the fourth degree moment from the
mean
Skewness
Pearsonian Coefficient of Skewness (Pearson’s Coefficient
of Skewness)

Interpretation of values:
1. Sk < 0, “negatively skewed” or “skewed to the left”
2. Sk = 0, symmetrical
3. Sk > 0, “positively skewed” or “skewed to the right”
Skewness
• A measure of the asymmetry of the frequency distribution

a. Positive skewness: mode < median < mean


b. Symmetrical: mode = median = mean
c. Negative skewness: mode > median > mean
Skewness
Other formulas

Interpretation of values from formulas above:


1. Sk < 0, “negatively skewed” or “skewed to the left”
2. Sk = 0, symmetrical
3. Sk > 0, “positively skewed” or “skewed to the right”
Kurtosis
• A measure of the degree to which a uni-modal
distribution is peaked
• The state or quality of flatness or peakedness of the
curve describing a frequency distribution about its
mode

Leptokurtic Platykurtic

Mesokurtic
Kurtosis
Moment Based Coefficient of Kurtosis

Interpretation of values from


formulas above:
1. K < 3, “platykurtic”
2. K = 3, “mesokurtic”
3. K > 3, “leptokurtic
SOLVE THE FOLLOWING:
1. Mean
2. Median
3. Mode
4. 1st quartile
5. 3rd Quartile
6. 35th Percentile
7. 67th Percentile
8. IQR
9. Mean Deviation
10. Standard Deviation
11. Skewness
12. Kurtosis
Class
Frequency
Interval

25-29 1

30-34 1

35-39 5

40-44 8

45-49 15

50-54 4

55-59 4

60-64 3

65-69 4

70-74 3

75-79 2
Class
Frequency
Interval

25-29 1 27 1

30-34 1 32 2

35-39 5 37 7

40-44 8 42 15

45-49 15 47 30

50-54 4 52 34

55-59 4 57 38

60-64 3 62 41

65-69 4 67 45

70-74 3 72 48

75-79 2 77 50
Class
Frequency
Interval

27 1 27
25-29 1
32 2 32
30-34 1
37 7 185
35-39 5
42 15 336
40-44 8
47 30 705
45-49 15
52 34 208
50-54 4
57 38 228
55-59 4
62 41 186
60-64 3
67 45 268
65-69 4
72 48 216
70-74 3
77 50 154
75-79 2
2545
50
Class
Frequency
Interval

27 1 27 23.9 571.21 326280.86


25-29 1
32 2 32 18.9 357.21 127598.98
30-34 1
37 7 185 69.5 966.05 186650.52
35-39 5
42 15 336 71.2 633.68 50196.79
40-44 8
47 30 705 58.5 228.15 3470.16
45-49 15
52 34 208 4.4 4.84 5.86
50-54 4
57 38 228 24.4 148.84 5538.34
55-59 4
62 41 186 33.3 369.63 45542.11
60-64 3
67 45 268 64.4 1036.84 268769.3
65-69 4
72 48 216 63.3 1335.63 594635.83
70-74 3
77 50 154 52.2 1362.42 928094.13
75-79 2
2545 484 7014.5 2536769.88
50
– sample size
class mark
frequency

- lower boundary of the median class


- cumulative frequency for class interval preceding the median class
- frequency in the median class
– class width or the interval size
– sample size

𝑚𝑑𝑛= 44.5+ ( 25 −15


15
5 )
𝑚𝑑𝑛=47.83
- lower boundary of the modal class
- difference between the frequency in the modal class and the frequency in the
preceding class interval
- difference between the frequency in the modal class and the frequency in the
succeeding class interval
– class width or the interval size

𝑚𝑜= 44.5 + ( 7
7 + 11 )
5

𝑚𝑜=46.44
- lower boundary of the 1st quartile class
- cumulative frequency for class interval preceding the 1 st quartile class
- frequency in the 1st quartile class
– class width or the interval size

( )5
– sample size
1 5 −7
𝐷 3 =39.5 +
8

( )
35 𝑛
− 𝑐𝑓

( 1 2.5− 7
) 𝐷 3= 44.5
100
𝑄1 =39.5+ 5 𝑃 35 = 𝑥 𝑙𝑏 + 𝑖
𝑓 𝑚
8

𝑄1 =42.94 𝑃 35 = 45.33 𝐷 8=59.5 + ( 4 0 − 38


3 )5
( )
67 𝑛
𝑄 3=54.5+ (
3 7.5 −34
4
5 ) 𝑃 67 =𝑥 𝑙𝑏 +
100
𝑓𝑚
− 𝑐𝑓
𝑖

𝑄 3=58.88 𝑃 67 =53.88 𝐷 8=62.83


42.94

IR=15.94 3 ( 50.9−47.83 )
𝑆𝑘=
11.97
484
𝑚𝑑=
49 𝑆𝑘=0.77

𝑚𝑑=9.9

𝑠=

7014.5
49 𝑘=
2536769.88
50 ¿ ¿
𝑠=11.96 1.12
SOLVE THE FOLLOWING:
1. Mean
2. Median
3. Mode
4. 1st quartile
5. 3rd Quartile
6. 35th Percentile
7. 67th Percentile
8. IQR
9. Mean Deviation
10. Standard Deviation
11. Skewness
12. Kurtosis

You might also like