Advanced Statistical Methods Project: Data Analysis Using Spss
Advanced Statistical Methods Project: Data Analysis Using Spss
PROJECT
DATA ANALYSIS USING SPSS
Submitted to-
Ms.Shailaja Rego
By-
Mayank Bhatia
A013
MBA Banking 2nd Year
NMiMS Mumbai
On-
31st August 2012
TECHNIQUES USED
Data set used has information and statistics of 109 world countries
The input file contains the following information:
1. Name of Country
2. Population in thousands
3. Number of people / sq. kilometer
4. People living in cities (%)
5. Predominant religion
6. Average female life expectancy
7. Average male life expectancy
8. People who read (%)
9. Population increase (% per year))
10. Infant mortality (deaths per 1000 live births)
11. Gross domestic product / capita
12. Region or economic group
13. Daily calorie intake
14. Aids cases
15. Birth rate per 1000 people
16. Death rate per 1000 people
17. Number of aids cases / 100000 people
18. Log (base 10) of GDP_CAP
19. Log (base 10) of AIDS_RT
20. Birth to death ratio
21. Fertility: average number of kids
22. Log (base 10) of Population
23. Males who read (%)
24. Females who read (%)
25. Predominant climate
The following is the variable view of the data:
OBJECTIVE
The objective is to look at association between several variables with female
life expectancy. Variables will also be used later in Regression.
METHODOLOGY
Bivariate Correlation is used. The first variable used is Average Female Life
Expectancy. This will also be used as outcome variable during Multiple
Regression. The other variables used are:
1. People who read( % literacy)
2. Gross Domestic Product per capita
3. Daily Calorie Intake
4.Birth Rate per 1000 people
Total variables are 5. Pearson Correlation Coefficient and Two tailed test of
significance is used.
The output comes out to be
Correlations
Gross
Average People domestic Daily Birth rate
female life who read product / calorie per 1000
expectancy (%) capita intake people
Average female Pearson 1 .865** .642** .775** -.862**
life expectancy Correlation
Sig. (2-tailed) .000 .000 .000 .000
N 109 107 109 75 109
** ** **
People who read Pearson .865 1 .552 .682 -.869**
(%) Correlation
Sig. (2-tailed) .000 .000 .000 .000
N 107 107 107 74 107
** ** **
Gross domestic Pearson .642 .552 1 .751 -.651**
product / capita Correlation
Sig. (2-tailed) .000 .000 .000 .000
N 109 107 109 75 109
** ** **
Daily calorie Pearson .775 .682 .751 1 -.762**
intake Correlation
Sig. (2-tailed) .000 .000 .000 .000
N 75 74 75 75 75
** ** ** **
Birth rate per Pearson -.862 -.869 -.651 -.762 1
1000 people Correlation
Sig. (2-tailed) .000 .000 .000 .000
N 109 107 109 75 109
**. Correlation is significant at the 0.01 level (2-tailed).
The topmost row and leftmost column contains the name of the variables,
The value of correlation between any two variables ranges from 0 to 1. The
reason for the diagonal elements to be 1 is because each variable is perfectly
correlated to itself. The value 1 is referred to as the perfect correlation. This
means that everything falls on the regression line
The positive and negative sign show whether the relationship is a direct
relationship or an inverse relationship.
Note that the matrix above is symmetrical about the diagonal. This shows that
order of the variables in correlation doesn’t matter. So, the correlation between
variable Daily calorie intake and GDP/capita is same as the correlation
between GDP/ capita and Daily Calorie intake.
Now each of the cells in the above matrix has 3 values. Lets take them one by
one.
Considering the first column:
The first value is that of the Pearson coefficient that is the Pearson Product
moment correlation also referred to as R. The value is equal to 0.865 which is a
very high positive value. This implies that countries having a high literacy rate
have a longer average female life expectancy. The second value is that of the
significance value also known as the p value. Generally, if this is less than 0.05
then correlation is considered as statistically significant or reliably different
from 0. The last number is the N value which shows the number of countries
that have data for both the variables. So in this case there are 107 countries
which have data related to both average female life expectancy and Literacy
level.
Note that all of the associations above are statistically significant since the 2-
tailed significance value is less than 0.05 for each of the cells.
The two asterisks along with the Pearson Coefficient value means that
correlation are statistically significant. A single asterisk denotes a value less
than 0.05 which is also known as the standard level of statistical significance.
Note that all the variables are highly correlated not only in terms of
significance value but also in terms of absolute value. the association between
GDP/Capita and Literacy level has the least absolute value equivalent to 0.552
which is large association.
The negative correlation the birth rate per 1000 and other variables signifies
that more the number of births per 1000 population less is the Average female
life expectancy, less is the literacy level, less is the GDP/capita, less is the
Daily calorie intake. This may be due to the fact that most of the countries
taken in the data are developing countries and therefore there may be lack of
resources in these countries.
MULTIPLE REGRESSION
Multiple Regression looks at the correlation between the variables collectively.
Multiple Regression is used to predict the values on a quantitative outcome
variable using several other predictive variable that can be quantitative or
categorical.
Average female life expectancy is used as the outcome. The others like Birth
Rate per 1000, Daily calorie intake, Literacy level, and Gross Domestic
Product per capita are taken as the predictor variables of female life
expectancy.
OBJECTIVE-
In this case, Multiple Regression is used to view the association of all predictor
variables together to predict life expectancy of women.
METHODOLOGY-
Descriptive Statistics
The above table shows the general overall distribution of variables There are a
total of 74 cases with all the data.
b
Model Summary
a. Predictors: (Constant), Gross domestic product / capita, People who read (%), Daily
calorie intake, Birth rate per 1000 people
The Durbin Watson statistic for this model is 2.071 which is within the
acceptable level of 1.5 -2.5.
b
ANOVA
Total 9567.459 73
a. Predictors: (Constant), Gross domestic product / capita, People who read (%), Daily calorie
intake, Birth rate per 1000 people
The ANOVA table for the regression analysis indicates whether the model is
significant and valid or not. The ANOVA is significant, if the ‘Sig’ column in
the above table is less than the level of significance (generally taken as 5% or
1%). Since 0.000<0.01 the model is significant and it is a tight and a good
model.
Variables Entered/Removed
Variables Variables
Model Entered Removed Method
1 Birth rate per . Enter
1000 people,
Gross
domestic
product /
capita, Daily
calorie
intake,
People who
read (%)a
a. All requested variables entered.
This table shows the different predictor variables which have been taken.
a
Coefficients
Unstandardized Standardized
Coefficients Coefficients Correlations
Zero-
Model B Std. Error Beta t Sig. order Partial Part
People who read (%) .226 .050 .457 4.527 .000 .869 .479 .223
Birth rate per 1000 people -.256 .110 -.277 - .023 -.864 -.269 -
2.324 .115
Daily calorie intake .006 .002 .271 3.190 .002 .776 .358 .157
Gross domestic product / -3.589E-5 .000 -.023 -.273 .786 .676 -.033 -
capita .013
Under the coefficient table the value of B for Constant indicates that keeping
all the predictor variables as 0 then average female life expectancy would be of
43.778 years. The B values for other predictor variables can be interpreted as
follows:
For every percentage point increase in People who read the average female life
expectancy increase by .226 years.
For each additional Daily calorie intake the average female life expectancy
increases by 0.006 years. The value is too less because 1000’s of calories are
taken daily.
For each unit increase in Birth Rate per 1000 the average female life
expectancy would decrease by 0.256 years.
The significance value for each of the predictor variable show the probability
level of each of the predictor variable, These generally need to be less than
0.05 to be considered as reliable, significant or meaningful. All are reliable
except for GDP whose value is 0.786.
Now below is the bivariate correlation between Average Female Life
Expectancy and other predictor variables taken from the above analysis,
Sig. (2-tailed)
N 109
N 107
N 109
N 75
N 109
All predictor variables including GDP have high correlation to the outcome
variable when taken individually or on their own. The correlation between
GDP per capita and Average Female life expectancy is 0.642 which is quite
high and has probability level of less than .001.
However, in the Multiple Regression Model, GDP is no longer significantly
associated. The reason for this is that Multiple Correlation looks at the
combination of these four variables to predict the outcome. The Coefficient
table shows the contribution of each variable but only in combination with
each other.
a
Residuals Statistics
The above chart is to test the validity of the assumption that the residuals are
normally distributed. Looking at the chart one may conclude that the residuals
are normal.
Since all the three regression coefficient are not significant, the enter method
cannot be used for estimation.
Hence for estimation, Stepwise method is used.
Descriptive Statistics
Correlations
Daily
Average female People who Birth rate per calorie Gross domestic
life expectancy read (%) 1000 people intake product / capita
N Average female 74 74 74 74 74
life expectancy
Gross domestic 74 74 74 74 74
product / capita
a
Variables Entered/Removed
Variables Variables
Model Entered Removed Method
c. Predictors: (Constant), People who read (%), Daily calorie intake, Birth rate per 1000
people
In the previous method, there was only one model. Stepwise method gives all
the models that are significant in each step. The Durbin Watson is within the
acceptable range of 1.5-2.5
The last model is generally the best model.
d
ANOVA
Total 9567.459 73
b
2 Regression 7829.451 2 3914.726 159.922 .000
Total 9567.459 73
c
3 Regression 7960.208 3 2653.403 115.563 .000
Total 9567.459 73
c. Predictors: (Constant), People who read (%), Daily calorie intake, Birth rate per 1000 people
The above table gives ANOVA for all iterations and both are significant.
a
Coefficients
Unstandardized Standardized
Coefficients Coefficients Correlations
Zero-
Model B Std. Error Beta t Sig. order Partial Part
People who read (%) .430 .029 .869 14.923 .000 .869 .869 .869
People who read (%) .315 .034 .636 9.202 .000 .869 .738 .465
Daily calorie intake .007 .001 .342 4.949 .000 .776 .506 .250
People who read (%) .227 .049 .460 4.605 .000 .869 .482 .226
Daily calorie intake .005 .002 .261 3.472 .001 .776 .383 .170
Birth rate per 1000 -.245 .103 -.267 -2.386 .020 -.864 -.274 -
people .117
d
Excluded Variables
Collinearity
Statistics
Partial
Model Beta In t Sig. Correlation Tolerance
a
1 Birth rate per 1000 people -.442 -4.134 .000 -.440 .242
a
Daily calorie intake .342 4.949 .000 .506 .535
a
Gross domestic product / .215 3.036 .003 .339 .606
capita
b
2 Birth rate per 1000 people -.267 -2.386 .020 -.274 .192
b
Gross domestic product / .042 .525 .601 .063 .400
capita
c
3 Gross domestic product / -.023 -.273 .786 -.033 .355
capita
b. Predictors in the Model: (Constant), People who read (%), Daily calorie intake
c. Predictors in the Model: (Constant), People who read (%), Daily calorie intake, Birth rate per 1000 people
Technology survey data. The CMU technology survey was conducted to find
out the current situation of using technology for instruction by faculty. Many
issues are addressed in the survey, including the use of technology, the
awareness of CMU technology availability, the participation in workshop,
development of web site and its purposes, the difficulty faced when using
technology, the priority issues and policy issues, and so on.
SOURCE OF DATA:
OBJECTIVE:
METHODOLOGY:
Descriptive Statistics
From the descriptive table we get to know that there are 126 valid cases for
our analysis.
Correlation Matrix
Q31A Q31A Q31A Q31A Q31A Q31A Q31A Q31A Q31A Q31A1 Q31A1 Q31A1
1 2 3 4 5 6 7 8 9 0 1 2
Correlatio Q31A1 1.000 .920 .687 .684 .747 .663 .727 .723 .462 .062 .245 .280
n
Q31A2 .920 1.000 .749 .679 .776 .647 .765 .742 .466 .063 .242 .240
Q31A3 .687 .749 1.000 .651 .752 .673 .667 .683 .511 .197 .372 .278
Q31A4 .684 .679 .651 1.000 .684 .628 .588 .627 .554 .189 .384 .344
Q31A5 .747 .776 .752 .684 1.000 .755 .742 .804 .534 .076 .381 .385
Q31A6 .663 .647 .673 .628 .755 1.000 .630 .643 .475 .125 .382 .347
Q31A7 .727 .765 .667 .588 .742 .630 1.000 .888 .451 -.009 .233 .197
Q31A8 .723 .742 .683 .627 .804 .643 .888 1.000 .493 .015 .276 .203
Q31A9 .462 .466 .511 .554 .534 .475 .451 .493 1.000 .401 .425 .602
Q31A1 .062 .063 .197 .189 .076 .125 -.009 .015 .401 1.000 .378 .488
0
Q31A1 .245 .242 .372 .384 .381 .382 .233 .276 .425 .378 1.000 .489
1
Q31A1 .280 .240 .278 .344 .385 .347 .197 .203 .602 .488 .489 1.000
2
The Principal Component Analysis can be carried out if the correlation matrix
for the variables contains at least two correlations of 0.3 or more. Here we see
that this condition is fulfilled
KMO and Bartlett's Test
Df 66
Sig. .000
Communalities
Initial Extraction
The above table shows the initial communalities and extraction communalities.
Communality is the variance of each variable explained by the common factors
selected from the factor analysis. Extraction communalities are estimates if
variance in each variable accounted for by the components. Eg. 79.6% of the
variance of the variable Q31A1 is explained by common factors in this factor
analysis. The communalities in the above table are all high which indicates
that the extracted components represent the variables well.
The above table gives the total variance contributed by each component.
In the above table we see that only 2 eigenvalues are greater than 1. Therefore
only these 2 factors will be extracted.
The pivot table shows the percentage of variance explained by each factor and
the cumulative variance. If we look at the Rotation Sum of Square loadings
these 2 factors still account for 72 % of the variance
Scree plot plots the eigenvalue against number of components and help in
determining the optimal number of components . The scree plot supports
extraction 2 factors because the eigenvalues level off from the 3rd eigenvalue
onwards.
a
Component Matrix
Component
1 2
a. 2 components extracted.
Component
1 2
a. Rotation converged in 3
iterations.
The maximum of each row of the above table indicates that the respective
variable belongs to the respective component. We can see from the above
rotated component matrix that the variables Q31A2 through Q31A4 are
highly loaded on Factor 1. On the other hand, question Q31A12 through
Q31A9 are highly loaded on Factor 2.
Thus this analysis puts these 12 variables into 2 factors- Those which are
highly loaded on Factor 1 and those which are highly loaded on Factor 2.
Component Transformation
Matrix
Component 1 2
1 .925 .379
2 -.379 .925
OBJECTIVE:
METHODOLOGY:
a
Case Processing Summary
Cases
Rejected
1 1 2 3.219 0 0 5
2 7 8 3.646 0 0 5
3 5 6 4.706 0 0 4
4 3 5 5.409 0 3 6
5 1 7 5.716 1 2 6
6 1 3 5.858 5 4 7
7 1 4 6.645 6 0 11
8 9 12 6.712 0 0 9
9 9 10 7.854 8 0 10
10 9 11 8.125 9 0 11
11 1 9 8.773 7 10 0
Cluster Membership
Case 2 Clusters
Q31A1 1
Q31A2 1
Q31A3 1
Q31A4 1
Q31A5 1
Q31A6 1
Q31A7 1
Q31A8 1
Q31A9 2
Q31A10 2
Q31A11 2
Q31A12 2
Above table shows that cluster 1 is made of variables Q31A1 through Q31A8
and cluster 2 is made up of variables Q31A9 to Q31A12 which is the same
result we have got in the previous problem.
This shows the same thing as dendogram but it doesn’t show the order in
which the clusters are combined.
* * * * * * * * * * * * * * * * * * * H I E R A R C H I C A L C L U S T E
R A N A L Y S I S * * * * * * * * * * * * * * * * * * *
C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+
Q31A1 1 -+---------------------+
Q31A2 2 -+ |
Q31A7 7 ---+-------------------+ Branch 1
Q31A8 8 ---+ +-------+
Q31A5 5 -------------+-----+ | |
Q31A6 6 -------------+ +---+ +-----------------+
Q31A3 3 -------------------+ | |
Q31A4 4 -------------------------------+ |
Q31A9 9 -------------------------------+---------+ |
Q31A12 12 -------------------------------+ +---+ |
Q31A10 10 -----------------------------------------+ +---+
Q31A11 11 ---------------------------------------------+
Branch2
Dendogram shows how variables combine to form clusters. Here it shows the
two clusters. Cluster 1 is of variables Q31A1 through Q31A8 and second
cluster consists of variables Q31A9 to Q31A12.