0% found this document useful (0 votes)
5 views

Stats_Lecture-4

Uploaded by

zeyneperolmez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Stats_Lecture-4

Uploaded by

zeyneperolmez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Probability and

Statistics Lecture -4
Dr Sumeyye BAKIM
2024
1
Outline

• Inferential Statistics
✓ The Normal Curve
✓ Sample and Population
✓ Probability

2
The Normal Curve
Graphs showing the distributions of some of the variables engineers work with are unimodal, roughly
symmetric, bell-shaped curves. These bell-shaped, smooth histograms represent a precise and
significant mathematical distribution called the normal distribution, or more simply, the normal
curve. The normal curve is a mathematical (or theoretical) distribution. Researchers often compare
the actual distributions of the variables they study (i.e., the distributions they find in their studies)
with the normal curve. They do not expect the distributions of their variables to match the normal
curve perfectly (as the normal curve is a theoretical distribution); however, they check whether the
distributions of their variables are approximately normal.

A normal curve. 3
For example, let’s consider the number of different letters a specific person can correctly remember
across various tests (with different random letters each time). In some tests, the number of
remembered letters is high, in others it’s low, and in most cases, it may fall somewhere in between.
In other words, the number of different letters a person can remember across various tests likely
follows a normal curve.

Let’s assume that the person has a basic ability to remember seven letters in such a memory task.
However, the actual number remembered in any given test will be influenced by various factors—
such as the noise in the room, their current mood, a combination of random letters resembling a
familiar name, etc.

These various effects cause the person to remember more than seven letters in some tests and
fewer than seven in others. However, the specific combination of these effects that emerges in any
given test is essentially random; therefore, in most tests, the positive and negative effects should
cancel each other out. When none of the random positive effects occur, it is unlikely that all the
random negative effects will combine in one test. Thus, in general, the person remembers an
average number of letters where all opposing effects cancel each other out. Very high or very low
scores are much less common.

This creates a distribution where the scores are concentrated around the midpoint, with fewer
scores at the extreme points => Normal Distribution
4
Central Limit Theorem
The Central Limit Theorem states that, regardless of the shape of the population
distribution, the sampling distribution approaches a normal distribution as the sample size
increases.

With each increase in sample size beyond n>30, the distribution becomes
more peaked. 5
The Normal Curve and the Percentage of
Scores Between the Mean
The shape of the normal curve is standard. Therefore, a known
percentage of scores falls above or below a certain score. For
example, exactly 50% of the scores in a normal curve are below Reminder:
the mean, as half of the scores in any symmetric distribution
are below the mean. More interestingly, as shown in the figure, The Z-Score is a unit of
approximately 34% of the scores are always within about 1 measurement given in terms of
standard deviation from the mean. standard deviation (it indicates
how many standard deviations
a score is from the mean).

6
Ex 1 :IQ Test
In commonly used intelligence tests, the average IQ is set at 100, with a standard deviation of 15,
and IQ scores are represented using a normal curve.

A person in the top 2% is


expected to have a score
at least 2 standard deviations
above the mean, meaning
above 130.

Since the normal curve and the percentage of scores around the mean are known, one standard
deviation above the mean shows that 34% of IQ scores fall between 100 and 115.

Similarly, because the normal curve is symmetric, we conclude that another 34% of IQ scores fall
between 100 and 85. Thus, 68% (34% + 34%) of scores are within the range of 85 to 115.

Additionally, 14% of scores fall between 115 (one standard deviation above the mean) and 130 (two
standard deviations above the mean)
7
Ex 2: Equipment Selection
Imagine you need to select equipment for a project. Assume that you want to choose equipment
with a typical level of precision, avoiding the extremes (not the highest or lowest precision). The
precision capabilities of the equipment follow a normal distribution, and we are looking for the
middle 2/3 group that represents average performance

2
= %𝟔𝟔, 𝟔 ≈ %68
3

In this case, the equipment should be selected from the range between 1 standard deviation above and
1 standard deviation below the mean (34% + 34% = 68% — the desired percentage).

Remember that 1 standard deviation above the mean is represented by Z = +1, and 1 standard
deviation below the mean is represented by Z = -1.

(We discussed converting raw scores to Z-scores and transforming Z-scores back to raw scores in the
previous class.)
8
The Normal Curve Table and Z-Scores
The table showing percentages of scores associated with the normal curve; the table usually includes
percentages of scores between the mean and various numbers of standard deviations above the mean and
percentages of scores more positive than various numbers of standard deviations above the mean.
The percentages 50%, 34%, and 14%
are important practical guidelines when
working with a group of scores that
follow a normal distribution. However,
in many research and applied
situations, scientists need more precise
information.
Since the normal curve is an exact
mathematical curve, you can determine
the exact percentage of scores between
any two points on the normal curve.

According to the table, when the Z-


score is 0.07, the percentage distance
from the mean is 2.79%, and the
percentage distance to the tail is
47.21%
Can a Z-score not be negative? 9
Ex 3: Techincal Skills Assessment
Imagine that 30% of engineering students scored higher than
Alex on a technical skills assessment. If these assessment
scores follow a normal curve, you can figure out Alex’s Z-score
as follows: if 30% of students scored higher than he did, then
30% of the scores are in the tail above his score.

To find his Z-score, look at the “% in Tail” column of the Z-


table and find the percentage closest to 30%. In this case, the
closest value is 30.15%. Now, check the “Z” column to the left
of this percentage, which shows a Z-score of 0.52. Thus, Alex’s
Z-score for his technical skills assessment is 0.52.

If you know the mean and standard deviation for the technical
skills assessment scores of engineering students, you can
calculate Alex’s actual raw score on the test by converting his
Z-score of 0.52 to a raw score using the formula:

X=(Z×SD)+ Mean

where X is Alex’s raw score, Z is his Z-score (0.52), SD is the


standard deviation, and Mean is the average score for the
engineering students. 10
The Normal Curve Table

11
12
Steps Required to Find the Percentage of Scores Above or Below a Specific Raw Score or Z-Score Using
the Normal Curve Table:

1. If the initial score is a raw score, first convert it to a Z-score.


𝑍 = 𝑋 − 𝑀 /𝑆𝐷

2. Draw a normal curve graph. Determine where the Z-score falls on this curve (if the Z-
score is positive, it is above the mean; if the Z-score is negative, it is below the mean) and
shade the area for which you want to find the percentage.

3. Estimate the percentage of the shaded area approximately as 50%, 34%, or 14%.

4. Using the normal curve table, calculate the exact percentage corresponding to the Z-
score.

13
Example 4
Let's assume that in an IQ test, the average IQ is 100, and the standard deviation is 15.

A person's IQ score is 125. What percentage of people have an IQ score higher than 125?

1. Z score: Z=(125-100)/15=+1.67

2.

3. If the shaded area started exactly at Z=+1, the area above it would be expressed as 16%. If it started exactly
at Z=+2, we would say 2%. In this case, the value should be somewhere between 2% and 16%.

4. According to the table, the tail percentage corresponding to a Z-score of +1.67 is 4.75%. This means that 4.75%
of the people who took the test have an IQ higher than 125.

(4.75 is a value between 2 and 16, confirming the rough estimate.) 14


Example 5
Let's assume that in an IQ test, the average IQ is 100, and the standard deviation is 15.

A person's IQ score is 95. What percentage of people have an IQ score higher than 95?

1. Z score: Z=(95-100)/15= -.33

2.

3.If the shaded area started exactly at Z=0, the area above it would be expressed as 50%. If it started
exactly at Z=−2, we would say 84%. In this case, the value should be somewhere between 50% and
84%.

4. According to the table, the percentage between the mean and a Z-score of -0.33 is 12.93%. Thus,
the percentage above a Z-score of -0.33 up to the mean is 12.93%, and after the mean, there is
another 50%, making a total of 62.93%.

(62.93 is a value between 50 and 84, confirming the rough estimate.) 15


Steps for Finding Z-Scores and Raw Scores Using Percentages in the Normal Curve
Table:

1. Draw a normal curve graph. Shade the area using approximate percentages of 50%, 34%,
or 14%, based on your percentage.

2. Make a rough estimate of the Z-score based on where the shaded area ends.

3. Use the normal curve table to find the exact Z-score that corresponds to the percentage.

4. Check if your calculated Z-score falls within your estimated range.

5. If you are looking for a raw score, convert from the Z-score using the formula:

𝑍 = 𝑋 − 𝑀 /𝑆𝐷

16
Example 6
Let's assume that in an IQ test, the average IQ is 100, and the standard deviation is 15.

What IQ score does a person need to score within the top 5%?

1. Since a score of 130 corresponds to +2 standard deviations, which is in the top 2%, the top 5%
should be as follows:

2. The Z-score should be between +1 and +2.


3. In the table, look for 5 in the “% in tail” column. According to the table, the closest values are 5.05
(or 4.95). Thus, the corresponding Z-score is +1.64 (or +1.65 for 4.95).

4. The estimate was that the Z-score should be between +1 and +2: +1.64 is within this range.

5. Raw score: X=Z×SD+M=(1.64)(15)+100=124.60.


6. (To be in the top 5%, a person needs to score at least 124.60 on the IQ test.) 17
Example 7
Let's assume that in an IQ test, the average IQ is 100, and the standard deviation is 15.

What IQ score does a person need to score within the top 55%?

Since a score of 100 corresponds to 0 standard deviations, the top 55% should be as follows:

1. The Z-score should be between -1 and 0.

2.The percentage between the mean and the Z-score should be 5%. According to the table, the closest
value is 5.17. Thus, the corresponding Z-score is 0.13. Since it is on the left side of the mean, it’s -
0.13.

3.The estimate was that the Z-score should be between -1 and 0: -0.13 is within this range.

4.Raw score: X=Z×SD+M=(−0.13)(15)+100=98.05. (To be in the top 55%, a person needs to score at
least 98.05 on the IQ test.)
18
Sample and Population
The scores of a specific data being studied; The entire group of data that a researcher aims to make
generally accepted as representative of the conclusions about in a study; a larger group from which
scores in a larger population. conclusions are drawn based on the smaller groups (samples)
examined.

(a) Populations and samples: The entire pot of beans is the population, and the spoonful is the
sample.
(b) The entire large circle represents the population, and the inner circle is the sample.
(c) The histogram represents the population, with the shaded scores indicating the sample. 19
In engineering studies, samples of data are
often analyzed to make inferences about a
larger group (population). All Steel
Beams
Produced
For example, a sample might consist of by the
Factory
measurements from 50 steel beams produced
in a factory to determine the overall quality.
Steel
In this case, the population would be the Bea ms

quality of all steel beams produced by that


factory.

In a survey on energy consumption, 1,000


households might be selected from a country’s All
total residential population to ask about their Households
in the
energy usage. The responses of these 1,000 Country

households form the sample, while the entire


population consists of all households whose
energy consumption patterns the engineers Households

aim to understand.
20
The entire purpose of research is generally to make generalizations or
predictions about events that you cannot directly access.

A researcher may conduct an experiment with 20 students to understand how


people store words in short-term memory. However, the purpose of the
experiment is not to find out how these 20 students respond under
experimental conditions. Rather, the aim is to learn something about human
memory in general under these conditions.

In almost all research, the strategy is to examine a sample of data that is


believed to be representative of the general population (entire dataset).

21
Methods of Sampling
Typically, the ideal method for selecting a sample to Haphazard Selection:
study is called random selection (which generally
means each person in the population has an equal This is a type of sampling where no
chance of being chosen). The researcher starts with a systematic method is followed in the
complete list of the population and selects a random selection of participants. It does not
portion to study. guarantee representation of the entire
population.
An example of random selection would be putting each
name on a ping-pong ball, placing all the balls in a For example, imagine conducting a
large container, shaking it, and having a blindfolded survey about your statistics professor by
person select as many as needed. (In practice, most only asking friends sitting closest to you.
researchers use a list of random numbers generated by This survey would be influenced by
a computer.) where students sit (which might
indirectly relate to how much they like
the professor or the class). Therefore,
asking students who sit near you will
lead to opinions more like yours, rather
than a truly random sample.

22
Statistical Terminology for Sample and Population
The mean, variance, and standard deviation of a population are called population
parameters. A population parameter is generally unknown and can only be estimated
based on what you know from a sample taken from that population.

We don’t taste all the beans; we taste just a spoonful and say, “The beans are cooked,”
making an inference about the entire pot.

The mean, variance, and standard deviation calculated from scores obtained from a sample are
called sample statistics.

23
PROBABILITY
Probability is very important in science. In inferential statistics, in particular, the methods scientists use
to move from the results of research studies to conclusions about theories or practical applications are
crucial.

In statistics, we usually define probability as the expected relative frequency of a particular


outcome. An outcome is the result of an experiment (or an event where the outcome is
unknown beforehand, like a coin landing on heads or whether it will rain tomorrow).
Frequency is the number of times something occurs. Relative frequency is the ratio of the
number of times something occurs to the number of trials. (If heads come up 8 times out of
12-coin tosses, the relative frequency is 8/12.) The expected relative frequency is what you
would expect to get in the long run if you repeated the experiment many times. (In the case
of a coin toss, you would expect heads to come up 1/2 of the time in the long run). This is
known as the long-term relative frequency interpretation of probability.

It can be helpful to think of probability as the likelihood of a particular outcome occurring.


If the probability is very low, the outcome is unlikely; if the probability is higher, the
outcome is more likely to occur.
24
We also use probability to express how confident we are that a certain event will occur. This is known as the
subjective interpretation of probability.
Imagine you estimate there’s a 95% chance that a particular machine in your lab will be operational tonight. You
might be using a form of relative frequency interpretation—perhaps you checked how often this machine was
operational on similar days in the past and found that it was operational 95% of the time on such days.
However, what you actually mean might be more subjective: on a scale from 0% to 100%, you’re expressing a 95%
confidence in the machine being operational tonight. In other words, you’d consider it a fair bet based on a 95%
chance that the machine will be running.
This interpretation doesn’t affect how probability is calculated, but it reflects how confident you are in the event
occurring.

Probability can be defined as the ratio of the occurrence or observation of the


event of interest to all possible events after a trial.
25
Figuring Probabilities
Probabilities are generally the ratio of successful possible outcomes: calculated by dividing the number of
successful outcomes by the total number of possible outcomes.

Consider the probability of getting heads when a coin is flipped. There is one successful outcome (heads) out of
two possible outcomes (heads or tails). This probability is 1/2 or 0.5.

For a die roll, the probability of rolling a 2 (or any specific face of the die) is 1/6 or approximately 0.17. This is
because there is only one successful outcome out of six possible outcomes.

The probability of rolling a number 3 or lower on a die is 3/6 or 0.5. Out of six possible outcomes, there are three
successful outcomes: 1, 2, or 3.

Now, let’s consider a slightly more complex example. Imagine a database containing 200 different algorithms, of
which 40 are optimized for real-time processing. If you were to randomly select an algorithm from this database,
the probability of selecting one that is optimized for real-time processing would be 40/200 or 0.20. This is
because there are 40 successful outcomes (selecting a real-time optimized algorithm) out of 200 possible

26
outcomes.
Calculating Probability
1. Determine the number of possible successful outcomes.
2. Determine the total number of possible outcomes.
3. Divide the number of possible successful outcomes by the total number of outcomes.
To convert a ratio to a percentage, multiply by 100. The percentage of 0.13 is: 13%.
To convert a percentage to a ratio, divide by 100. The ratio for 4% is: 0.04.

A ratio cannot be less than 0 or greater than 1. In percentage terms, it should be between 0% and 100%.
The probability of impossible events is 0. Events with a probability of 0 are called impossible events.
The probability of certain events is 1. Events with a probability of 1 are called certain events.
When an event has a low probability, such as 1% or 5%, it is called a low-probability event (but it is not
impossible).
1. When a die is rolled, the probability of getting a number less than 3 (1 or 2) is 3.
2. The total number of possible outcomes (1, 2, 3, 4, 5, or 6) is: 6.
3. The ratio of possible successful outcomes to the total number of outcomes: 36=0.563=0.5.

Probability is represented by the letter p. For an event with a 50% chance, p = 0.5.

If p<0.05, it means the probability is less than 0.05.


27
Probability, Z Score and Normal Curve
So far, we’ve generally discussed the probabilities of specific events occurring or not occurring.
It’s also possible to discuss the probabilities of events occurring within a certain range.
Examples include the probability of rolling a number on a die that is 3 or lower, or the
probability of selecting a component from a batch that has a lifespan between 30 and 40 hours.

If you think of probability as a ratio of scores, you’ll see that probability aligns directly with
frequency distributions. According to the distribution shown below, 3 out of 50 components
have a lifespan of 9 or 10 hours. If you randomly select a component from these 50, the
probability of selecting one with a lifespan of 9 or 10 hours is calculated by dividing the number
of successful outcomes (3 components) by the total number of outcomes (50 components).
Thus, p=350=0.06p=503=0.06.

28
The normal distribution can be thought of as a probability distribution. With a normal curve,
the percentage of scores between any two Z-scores is known.

The probability of selecting a value between any two Z-scores is the same as the percentage of
scores between those two Z-scores.

As you know, in a normal curve, approximately 34% of scores lie between the mean and one
standard deviation above the mean. This means that the probability of a score falling between
the mean and a Z-score of +1 is p = 0.34.

In the previous IQ test examples, let’s assume that 95% of scores in a normal curve fall between
a Z-score of +1.96 and -1.96. This represents a high probability. At the same time, in such a
distribution, the probability of selecting a score above +1.96 or below -1.96 is 0.05 (or 5%).
This is a very low probability and corresponds to the tails of the distribution graph

Normal distribution percentages IQ test example 29


Probability, Sample and Population
Probability also applies to samples and populations. Let’s use an example to illustrate the
role of probability in sample and population analysis.

Suppose a component in a sample has a durability score of 4. However, you don’t know if this
component came from Supplier A or Supplier B. Let’s say that the durability scores of
components from Supplier A typically follow a normal distribution with a mean of 10 and a
standard deviation of 3. How likely is it that your sample component came from Supplier A?
Based on your knowledge of the normal curve, you know that in a normal distribution with a
mean of 10 and a standard deviation of 3, there are very few scores as low as 4. This suggests
that the component may not have come from Supplier A's population.

But what if the sample component had a durability score of 9? In this case, it would be more
likely that the component came from Supplier A’s population since a score of 9 is within the
expected range for that population.

This type of reasoning introduces the concept of hypothesis testing.

30

You might also like