0% found this document useful (0 votes)
9 views12 pages

Bias 210704162347

Uploaded by

Maheshwor Thapa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Bias 210704162347

Uploaded by

Maheshwor Thapa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Sampling Bias

Dr.K.Prabhakar
Bias
• Once we collect the data we represent the data by way of a model.
Let us assume a linear model.
• This may be written as y(outcome)= a1x1+a2x2+a3x3+…+anxn+ error
• Therefore we predict that there will be an error as the outcome is
expressed as a set of predictor variables multiplied by a set of
coefficients the parameters the a in the equation and tell us about the
relationship between the predictor and outcome variable.
• The prediction will not be perfect as there will be an error as we are
using sample data to predict the outcome variable.
The contexts for bias
• Things that bias the parameter estimates
• Things that bias standard errors and confidence intervals
• Things that bias test statistics and p-values. These bias are related. If
the test statistics are bias then the confidence intervals will be biased.
A bias in confidence intervals will bias the test statistics.
• If the test statistics is biased then the results will be biased and we
need to identify and eliminate the biases as much as possible.
Assumptions that lead to bias
1. Presence of outliners
2. Additivity and linearity
3. Normality
4. Homoscedasticity or homogeneity of variance
5. Independence
Outliers
• Presence of outliers in data will bias the data.
• For example if the class average marks is 60 and standard deviation is
10 marks then if there is a presence of zero marks or 100 marks by few
students may bias the data.
• The outliers need to be identified and removed or replaced to have a
better representation of the data. It generally affect the mean of the
data as well as some of the squares errors. The sum of the squares is
used to compute the standard deviation, which in turn is used to
estimate the standard error. The standard error is used for confidence
intervals around the parameter estimates. This it will have a domino
effect on the results.
Outliers OUTLIERS

MEAN
Do Outlier always Lie?
1. Detecting Errors & Data Quality
- Extreme values might indicate errors or anomalies in data collection, measurement, or input. For example, if
an online store records an item as sold for $1,000,000, it’s likely a mistake.
- Identifying and addressing outliers ensures data accuracy and better predictions.

2. Understanding Rare but Critical Events


• - In finance, extreme values (stock market crashes or spikes) affect investment strategies.
• - In medicine, an unusually high fever or blood pressure reading could indicate serious health risks.

3. Risk Management & Decision Making


• - Engineers study extreme weather conditions when designing bridges and buildings to ensure safety.
• - Businesses analyze extreme sales fluctuations to prepare for economic downturns or sudden demand
spikes.
Do Outlier always Lie?
4. Improving Statistical Models
- Some outliers provide valuable insights rather than noise, helping researchers discover **new
trends**.
- Instead of removing them, statisticians develop specialized models that **account for extreme
values**, leading to better forecasts.

5. Security & Fraud Detection


- Banks use outliers in transaction data to detect fraudulent activities(e.g., an unusually large
withdrawal).
- Cybersecurity teams monitor extreme deviations in network traffic to identify potential cyber attacks.

• In short, extreme values can signal problems, reveal opportunities, or improve the way we analyze
and predict real-world events.
Final Notes on Outlier
Every outlier is an extreme point, but not every extreme point is an outlier.
Example:
Imagine test scores in a class: 60, 65, 70, 72, 75, 80, 85, 90, 95, 150
• The score 150 is much higher than all other values—it's clearly an outlier because it deviates
significantly from the rest.
• Since 150 is far from the majority, it is also an extreme point.

2. Not Every Extreme Point is an Outlier


Consider another set of test scores: 60, 65, 70, 72, 75, 80, 85, 90, 95, 100
• The score 100 is at the high end, making it an extreme point.
• However, it is not an outlier, because it is still within a reasonable range relative to the other
scores.
• It doesn't deviate as drastically as 150 did in the previous example.
Additivity and Linearity
• The assumption is the outcome variable is linearly related to all
predictors. That means the relationship may be summed up as a
straight line.
• If there are several predictors as we have see the equation
y(outcome)= a1x1+a2x2+a3x3+…+anxn+ error
their combined effect is described by adding their effects together.
The model can described accurately by the equation given here.
Assumption of Normality
• There is a mistaken belief that assumption of normality = the data need to be from normally
distributed. This misconception stems from the fact that if the data is normally distributed
then errors in the model as well as sampling distribution is also normally distributed.
• The central limit theorem means that there are different situations in which we can
assume normality regardless of the shape of the sample data.
• Normality matters when you construct confidence intervals around parameters of the
model or compute significance tests relating to those parameters then assumption of
normality matters in small samples.
• As long as the sample size is fairly large, outliers are taken into account then assumption of
normality will not be a pressing concern.
• Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the
normality assumption in large public health data sets. Annual review of public
health, 23(1), 151-169.
Homoscedasticity or homogeneity of
variance

You might also like