0% found this document useful (0 votes)
28 views62 pages

504.applied Statistics For Social Sciences 1

The document provides an overview of applied statistics for social sciences, covering key concepts such as the definition of statistics, types of data, and statistical methods including descriptive statistics, hypothesis testing, and regression analysis. It also introduces the use of STATA software for data entry and analysis, emphasizing the importance of understanding statistical principles and methodologies. Key statistical terms and measures, such as mean, variance, and correlation coefficients, are explained to aid in data interpretation and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views62 pages

504.applied Statistics For Social Sciences 1

The document provides an overview of applied statistics for social sciences, covering key concepts such as the definition of statistics, types of data, and statistical methods including descriptive statistics, hypothesis testing, and regression analysis. It also introduces the use of STATA software for data entry and analysis, emphasizing the importance of understanding statistical principles and methodologies. Key statistical terms and measures, such as mean, variance, and correlation coefficients, are explained to aid in data interpretation and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Applied Statistics for Social Sciences I

Mr. Kabubi M. Marvin

Information and Communications University


Department of Social Research,
Applied Statistics for Social Sciences I Coordinator
Phone #: 0975158118
[email protected]
What is statistics
“Statistics”, has been derived from the Latin word ‘Status’ that means a group of
numbers or figures; which represent some information of our human interest.
In the opinion of Fisher 'Statistics' has got three important functions to play
(i) Study of statistical populations
(ii) (ii) study of the variation within the statistical populations
(iii) (iii) study of the methods of reduction of data.
 Statistics is the discipline that concerns the collection, organization, analysis,
interpretation and presentation of data. In applying statistics to a scientific, industrial,
or social problem, it is conventional to begin with a statistical population or a statistical
model to be studied. People start disbelieving in statistics when the:
(1)data are not reliable
(2)computing spurious relationships between variables
(3)generalizing from a small sample to a population without taking care of error involved.
 “Statistics may be defined as the collection, presentation, analysis, and interpretation
of numerical data.” —Croxton and Cowden
Statistics Concepts
 Parameter — A descriptive measure of a population. It is a measurable characteristic of
the population
 Statistic — A descriptive measure of a sample. It is a measurable characteristic of a
sample
 The data are of two kinds: (i) Primary data (ii) Secondary data.
 Primary data are based on primary source I of information and the secondary data are
based on secondary source of information.
 An aggregate of animate or inanimate objects is called statistical population. For
example, large group of data on heights, weights, etc., is known as Statistical
population.
Types of statistics
Descriptive statistics…
 Descriptive statistics are methods of organizing, summarizing, and presenting data in a
convenient and informative way. These methods include: The actual method used
depends on what information we would like to extract. Are we interested in…
 measure(s) of central location? and/or, measure(s) of variability (dispersion)?
Statistics Concepts
Concepts in statistics
a. Hypothesis: Testable proposition, must be measurable, must be clearly stated, must
be value free, must be specific, must be directional in nature, etc.
b. Level of significance: The maximum probability of committing type-I error is called
level of significance and is denoted by; = P (Committing Type-I error), = P (H0 is
rejected when it is true)
c. This can be measured in terms of percentage i.e. 5%, 1%, 10% etc…….
d. Type-I error: The type-I error is said to be committed if the null hypothesis (H0) is true but our test
rejects it. The probability of making a type I error is denoted by .

e. Type-II error: The type-II error is said to be committed if the null hypothesis (H 0) is false but our test
accepts it. The probability of making a type II error is denoted by .
f. Power of the test
=P (H0 is rejected when it is false)

= 1- P (H0 is accepted when it is false)

= 1- P (Committing Type-II error)


= 1- 
Statistics Concepts
Concepts in statistics
f) Null hypothesis:
0 and is always set up for possible rejection
The hypothesis under verification is known as null hypothesis and is denoted by H

under the assumption that it is true. For example, if we want to find out whether extra
coaching has benefited the students or not, we shall set up a null hypothesis that “extra
coaching has not benefited the students”.
g)Alternative hypothesis:
The rival hypothesis or hypothesis which is likely to be accepted in the event of rejection
of the null hypothesis H0 is called alternative hypothesis and is denoted by H1 or Ha. For example, if a psychologist who wishes
to test whether or not a certain class of people have a mean I.Q. 100, then the following
null and alternative hypothesis can be established. The null hypothesis would be
H 0 :   100
Then the alternative hypothesis could be any one of the statements.
H1 :   100, H1 :   100 (or) H1 :   100

h)Sample: A fraction of a population randomly chosen for


investigation
i) Population: A sum of all the elements in a given area at
a given time
j) Parameter is a measurable value from the population while a statistic is a
measurable value from a sample, e.g the mean age of the population.
Statistics Concepts
 A variable is some characteristic of a population or sample. E.g. student grades.
 Typically denoted with a capital letter: “X, Y, Z…”
 The values of the variable are the range of possible values for a variable. E.g. student
marks (0..100)
 Data are the observed values of a variable.
E.g. student marks: {67, 74, 71, 83, 93, 55, 48}

 Interval data
Real numbers, i.e. heights, weights, prices, etc.
 Also referred to as quantitative or numerical.
 Arithmetic operations can be performed on Interval Data, thus its meaningful to talk
about 2*Height, or Price + $1, and so on.
Statistics Concepts
 Nominal Data
 The values of nominal data are categories. E.g. responses to questions about marital status,
coded as:
Single = 1, Married = 2, Divorced = 3, Widowed = 4, Separated = 5
 Because the numbers are arbitrary arithmetic operations don’t make any sense (e.g. does
Widowed ÷ 2 = Married?!)
 Nominal data are also called qualitative or categorical.

 Ordinal Data
 Ordinal Data appear to be categorical in nature, but their values have an order; a ranking to
them:
E.g. College course rating system:
poor = 1, fair = 2, good = 3, very good = 4, excellent = 5
 While its still not meaningful to do arithmetic on this data (e.g. does 2*fair = very good?!), we
can say things like:
excellent > poor or fair < very good
 That is, order is maintained no matter what numeric values are assigned to each category.
Scale of measurement of variables
Measurement of variables
Arithmatic mean
Arithmatic mean
…is appropriate for describing measurement data, e.g. heights of people, marks of
student papers, etc.
…is seriously affected by extreme values called “outliers”. E.g. as soon as a billionaire
moves into a neighborhood, the average household income increases beyond what it was
previously!
Variance and standard deviation of the population vs. sample
Variance and standard deviation formulas
Statistical Notation for the sample vs. population
Statistical notation for the samples vs. population
Nominal data
Presentation of nominal data
Graphical methods that are used to summarize interval data
 There are several graphical methods that are used when the data are interval (i.e.
numeric, non-categorical).
 The most important of these graphical methods is the histogram.
 The histogram is not only a powerful graphical technique used to summarize interval
data, but it is also used to help explain probabilities.
Cumulative Relative Frequencies
Example of a Cumulative relative frequency
Ogive
Example on plotting an Ogive curve
Scatter Diagram
Example on scatter plot
 A real estate agent wanted to know to what extent the selling price of a home is
related to its size.
 Collect data.
 Determine the independent variable (X – house size) and the dependent variable (Y –
selling price)
 Use Excel to create a “scatter diagram”…
Linearity and Direction
Linearity and Direction are two concepts we are interested in
Numerical Descriptive Techniques
Measures of Central Location
Mean, Median, Mode

Measures of Variability
Range, Standard Deviation, Variance, Coefficient of Variation

Measures of Relative Standing


Percentiles, Quartiles

Measures of Linear Relationship


Covariance, Correlation, Least Squares Line
Arithmatic mean
Arithmatic mean
…is appropriate for describing measurement data, e.g. heights of people, marks of
student papers, etc.
…is seriously affected by extreme values called “outliers”. E.g. as soon as a billionaire
moves into a neighborhood, the average household income increases beyond what it was
previously!
The range
 The range is the simplest measure of variability, calculated as:
 Range = Largest observation – Smallest observation
Data: {4, 4, 4, 4, 50} Range = 46
Data: {4, 8, 15, 24, 39, 50} Range = 46
 The range is the same in both cases,
 but the data sets have very different distributions.
Variance and standard deviation of the population vs. sample
Variance
Sampling error
 Sampling error refers to differences between the sample and the population that
exist only because of the observations that happened to be selected for the
sample.
 Another way to look at this is: the differences in results for different samples (of
the same size) is due to sampling error:
 E.g. Two samples of size 10 of 1,000 households. If we happened to get the
highest income level data points in our first sample and all the lowest income
levels in the second, this delta is due to sampling error.
 Non sampling errors are more serious and are due to mistakes made in the
acquisition of data or due to the sample observations being selected improperly.
 Three types of non sampling errors:
Errors in data acquisition, Nonresponse errors, and Selection bias.
 Note: increasing the sample size will not reduce this type of error.
The normal distribution curve

The normal curve


The normal distribution curve
The normal distribution curve
Calculating Normal Probabilities

Calculating Normal Probabilities


Calculating Normal Probabilities

Example on Calculating Normal Probabilities


Calculating Normal Probabilities

Calculating Normal Probabilities


Calculating Normal Probabilities
Calculating normal probabilities
Calculating Normal Probabilities
Reading the z-table
Calculating Normal Probabilities
Calculating normal probabilities
Calculating Normal Probabilities
Calculating normal probabilities
• Other Z values are; Z.05 = 1.645, Z.01 = 2.33
Scatter plot

The heights and weights of 10 students are recorded below. Construct a scatterplot for
this data Scatterplot of Weight against Height

80

70

60

Weight (kg)
50

40
John 181, 61
30 Adam 167,52

20

10

0
150 160 170 180 190

Height (cm)
Scatter plot
For one week the midday temperature and the number of hot drinks sold were recorded.
Construct a scatterplot for this data.
Correlation Coefficient
Correlation Coefficient
 It is possible to quantify the correlation between variables. This is done by calculating a
correlation coefficient.
 A correlation coefficient measures the strength of the linear relationship between
variables.
 Correlation coefficients can range from –1 to +1.
 A value of –1 represents a perfect negative correlation and a value of +1 represents a
perfect positive correlation.
 If a data set has a correlation coefficient of zero there is no correlation between the
variables.
 This formula uses the sums of deviations from the means in both the X values and Y
values.
 However, for ease of calculation the following calculation formula is often used.
 If one variate changes in sympathy with another variate then it can be said that there
exists some association between these two variates. The degree of associationship (or
the extent of relationship) is known as "coefficient of correlation".
Pearsons Correlation Coefficient
Calculating the pearsons correlation coefficient
Calculate the correlation coefficient for the height weight data below
Calculating the pearsons correlation coefficient
Example 2 on calculating pearsons correlation coefficient
Linear Regression
 The relationship between two sets of data can be represented by a linear equation
called a regression equation.
 The regression equation gives the variation of the dependent variable for a given
change in the independent variable.
 It is extremely important to correctly determine which variable is dependent.
 The regression equation can be used to construct the regression line (line of best fit) on
the associated scatterplot.
 Because the equation is for a line, the regression equation takes on the general linear
equation format, y=mx+c
 Usually, however, for a regression equation this is written
 as yx where  is the y-intercept and  the slope of the line.
y     x where
  XY2 n X Y2
X  n X

YX
Regression Equation Computation
 Example of Regression Equation Computation
Interpretation of regression model
Interpretation of regression model
Interpretation of the regression anova table
The regression anova table
Hypothesis testing
Example on Hypothesis testing
Hypothesis testing
Example 2 on Hypothesis testing
Introduction to STATA & its Application in CBMS

Introduction to STATA & its Application in CBMS


This section presents information on how to use STATA in CBMS data entry, data cleaning,
data analysis and interpretation of statistical results.
Enter your data in Ms. Excel, then transfer it into STATA for further processing.

 Stata is your statistical buddy!


 If you put in a bit of effort to learn the basics, you should find the program quite easy
and very helpful.
 Stata can be very intimidating your 1st time around.
 Stay patient, and be attentive!
STATA windows and interface
Data entry in Ms. Excel
Example of data entry in Ms. Excel: The first raw represents the variables or
Question IDs, then data is entered from left to right following the variables in
the columns.
Frequently Asked Questions about STATA
STATA basics: questions which are often asked concerning the usage of stata
 Question: How can i Enter Data from Excel into Stata?
 Firstly enter your data in Ms. Excel as shown in the previous slide
 Answer:
1. Open your Excel data file
2. Copy all the data
3. Open Stata
4. type the command edit in the command window and press the Enter Key
5. The Stata data editor opens
6. Paste your data in the first cell
7. Select "variable names"
 Your data is loaded!
Entering small set of data
For small data use the following procedure
Entering data using STATA Data Editor
Entering data using STATA Data Editor
STATA Commands
Question: How can i use data which is already installed in stata for practice?
Answer:
• Open the stata program; then Change drive; to use data which is already installed in
stata
• cd
• sysuse dir, all
Redirect stata to the data sets for practice (To be used alongside the prescribed books
namely; statistics with stata both version 9 and 12 editions)
• cd C:/ProgramData/data
• sysuse dir, all
• sysuse states.dta
Use this procedure, to load any data set from within the data files that have been given
for your practice and learn more from the prescribed books.. (Statistics with Stata, both 9
& 12 versions and editions)
• clear
STATA Commands and their functions
Example of data entry directly into STATA using the input command
input str16 Manufacturer str12 Model type Price CityMPG HighwayMPG EngineSize Horsepower
FuelTank Passengers Weight str7 Origin
After data enter, You need to label your variables, giving them full descriptions
label var Manufacturer "Manufacturer of car"
label var Model "Model of car"
label var type "type of car"
label var Price "Price of car"
label var CityMPG "CityMPG mile per gallon"
label var HighwayMPG "HighwayMPG highway per gallon"
label var EngineSize "EngineSize of car"
label var Horsepower "Horsepower of car"
label var FuelTank "FuelTank of car"
label var Passengers "Passengers of car"
label var Weight "Weight of car"
label var Origin "Origin of car"
Example of data entry in STATA
input str16 Manufacturer str12 Model type Price CityMPG HighwayMPG EngineSize Horsepower
FuelTank Passengers Weight str7 Origin
"Mazda" "RX-7" 3 32.5 17 25 1.3 255 20 2 2895 "non-US"
"Chevrolet" "Corvette" 3 38 17 25 5.7 300 20 2 3380 "US"
"Hyundai" "Scoupe" 3 10 26 34 1.5 92 11.9 4 2285 "non-US"
"Honda" "Prelude" 3 19.8 24 31 2.3 160 15.9 4 2865 "non-US"
"Honda" "Accord" 2 17.5 24 31 2.2 140 17 4 3040 "non-US"
"Honda" "Civic" 1 12.1 42 46 1.5 102 11.9 4 2350 "non-US"
"Geo" "Storm" 3 12.5 30 36 1.6 90 12.4 4 2475 "non-US"
"Ford" "Festiva" 1 7.4 31 33 1.3 63 10 4 1845 "US"
"Dodge" "Stealth" 3 25.8 18 24 3 300 19.8 4 3805 "US"
"Ford" "Mustang" 3 15.9 22 29 2.3 105 15.4 4 2850 "US"
"Geo" "Metro" 1 8.4 46 50 1 55 10.6 4 1695 "non-US"
………
........
end
STATA Commands and their usage
* Compute Percentile for Horsepower
summarize Horsepower, detail
* Stem and leaf of price and EngineSize
stem Price
stem EngineSize
* Checking the r/ship of variables (Bivariate Descriptives)
correlate EngineSize Price CityMPG HighwayMPG EngineSize Horsepower
FuelTank Passengers Weight
* Correlation of Variables adding p-values
pwcorr EngineSize Price CityMPG HighwayMPG EngineSize Horsepower
FuelTank Passengers Weight, star(0.05) sig
* Applying some logic to your data analysis
summarize Price EngineSize
list Origin Manufacturer type if Price >=35
summarize Price if Price >=30
list Origin Manufacturer Price if EngineSize >=5
summarize Price if EngineSize >=5
STATA Commands and their usage
• * Most Expensive Car
list Origin Manufacturer type EngineSize CityMPG Horsepower FuelTank if Price >=47.9
• * Cheapest Car
list Origin Manufacturer type EngineSize CityMPG Horsepower FuelTank if Price <= 8
summarize if Price >= 20
summarize Price if CityMPG> 1000 & CityMPG< 20000
tab EngineSize Origin if Price > 20
summarize CityMPG HighwayMPG FuelTank Weight EngineSize if Price < 20
• * Other commands
table type Origin
spearman Price Price, stats(rho p)
display sqrt(5*(11-3^2))
encode Origin, gen(numvar)
total Price EngineSize
table Origin type, contents(freq) by (numvar)
STATA Commands and their usage
• * Sorting Data (Sort any Variable)
sort Price in 5/17
list Price EngineSize CityMPG HighwayMPG EngineSize Horsepower FuelTank Passengers
Weight in 5/17
• * Ttest and some Statistical Hypothesis (Now that you know how to run preliminary
descriptive statistics on your data, the next step is inevitably to run statistical tests to
determine if your hypotheses are correct or not.
• * This section describes the procedures in Stata that test the equality of means of a
continuous variable from two or more groups. The remaining sections of this tutorial
dive into more complicated statistical tests.
• * First, we show an example of a one-sample t-test. Below, we test that the mean price
for domestic cars is $15,000. Note that we can add “if” conditions to the ttest
command (without that option, we would be testing the price for all cars in the dataset)
• * A t-test is a useful technique for comparing the mean value of a group against some
hypothesized mean (one-sample) or of two separate sets of numbers against each
other (two-sample). The result of these tests provides you with a statistic which can be
used to determine whether the difference between two means is statistically
significant.
ttest Price == 15 if numvar==1
STATA Commands and their usage

* p-value is less than 0.05, Note that Stata also gives a 95% confidence interval of the
mean price of US-made cars by default, and since it does not include our null hypothesis,
it also tells us that we can reject it.
signtest Price = 15.
• The Null Hypothesis was that Mean Price == 1, but this is outside our confidence
interval; hence its rejection. The true mean is 18.573 which is significantly different
from our hypothesized mean of 15 thousand dollars (p-value = 0.003).
• * When conducting a two-sample t-test, you must test the assumption of equality of
variances in the two groups that are being compared. If you have more than two
groups that you want to compare, you must use an ANOVA (see next section) and also
test that the variances are equal across all groups.
• sdtest CityMPG, by(Origin)
STATA Commands and their usage
• * Since the two-tailed p-value is less than 0.05, we must reject the null hypothesis,
which in this case is that the variances are equal.
• * Therefore, we must include the unequal option at the end of our ttest statement
which will adjust the degrees of freedom used in the analysis (scatterthwaite
calculation) to correct for unequal variances.
• * Testing equal variance between groups
• * Since the two-tailed p-value is less than 0.05, we must reject the null hypothesis,
which in this case is that the variances are equal
ttest CityMPG, by(Origin) unequal
• * From the p-value at the bottom center, we see that there is a significant difference
between the city miles-per-gallon for domestic versus foreign cars. We can also see
that the 95% confidence interval of the difference of the means does not contain zero.
• * Note that the top of this output reads “with unequal variances,” where it would say
“with equal variances” if we did not include the unequal statement in our command.
This is a good check if you forget to test for equality of variances prior to running your
t-test.
STATA Commands and their usage
ttest Price, by(numvar)
ttest Price , by(numvar) unequal
ranksum Price , by(numvar)
• *Anova However, the oneway test does not output the residual sum of squares, which
the anova command does
oneway Weight type
• * The p-value for the ANOVA is <0.0001, meaning that there is a difference in weight
among the different types of vehicles. In other words, we can reject the null hypothesis
that all types of vehicles have equal mean weights.
• * In order to get the marginal means, you must run the anova command. After running
anova Weight type, you can use the margin command to get the marginal means of
weight for each type of vehicle
anova Weight type##numvar
• * From the above output, we can see that the origin of the car is not significant, and
neither is the interaction between origin and type. However, type is significant (p-
value<0.0001), as well as the overall model, which can be found on the top line of the
ANOVA table.
STATA Commands and their usage
• * Linear Regression Commands: regress Dependent variable Independent variable(s)
• * Let us model the linear relationship between engine size of the vehicles and their city
miles-per-gallon
regress CityMPG EngineSize
• * We can interpret these results to say that a vehicle’s engine size does significantly
impact the city miles-per-gallon. For each additional unit increase in engine size, the
vehicle’s city miles-per-gallon decreases by roughly 3.84 units. regress CityMPG
EngineSize, robust
regress Price EngineSize, level (95)
regress Price EngineSize, beta
predict SE, stdp
regress Price CityMPG HighwayMPG EngineSize Horsepower FuelTank Passengers Weight
regress Price CityMPG HighwayMPG EngineSize Horsepower FuelTank Passengers Weight,
robust
regress Price Horsepower
• * Note; There is an improvement in the model holding other variables constant
STATA Commands and their usage
• * Graphical Display of variables
graph twoway lfitci Price EngineSize|| scatter Price EngineSize
graph twoway scatter Price EngineSize
graph twoway scatter HighwayMPG Horsepower
graph twoway scatter Price EngineSize[fweight = Price], msymbol (oh)
graph twoway scatter Price EngineSize if Origin=="US", msymbol (sh)
graph twoway scatter Price EngineSize, msymbol (oh)
graph box Price, over (numvar) yline(19.1)
twoway bar Price type
graph bar Price, over(Origin)
STATA Commands and their usage
• * Plots
scatter Price EngineSize
scatter Price CityMPG HighwayMPG
plot Price CityMPG HighwayMPG
qnorm Price, grid
symplot Price
qqplot Price EngineSize
• * Charts (Can be used for both pie & bar charts)
graph pie Price, over (type)
graph twoway scatter Price EngineSize[fweight = Price], msymbol (oh)
• * Overall behaviour of the variables
graph matrix Price EngineSize CityMPG HighwayMPG EngineSize Horsepower FuelTank
Passengers Weight, half msymbol(oh)

You might also like