Regression With Stata Chapter 1 - Simple and Multiple Regression PDF
Regression With Stata Chapter 1 - Simple and Multiple Regression PDF
1.0 Introduction
1.1 A First Regression Analysis
1.2 Examining Data
1.3 Simple linear regression
1.4 Multiple regression
1.5 Transforming variables
1.6 Summary
1.7 Self assessment
1.8 For more information
1.0 Introduction
This book is composed of four chapters covering a variety of topics about using Stata for regression.
We should emphasize that this book is about “data analysis” and that it demonstrates how Stata can be
used for regression analysis, as opposed to a book that covers the statistical basis of multiple
regression. We assume that you have had at least one statistics course covering regression analysis
and that you have a regression book that you can use as a reference (see the Regression With Stata
(/stata/webbooks/reg/) page and our Statistics Books for Loan page (/books/) for recommended
regression analysis books). This book is designed to apply your knowledge of regression, combine it
with instruction on Stata, to perform, understand and interpret regression analyses.
This first chapter will cover topics in simple and multiple regression, as well as the supporting tasks that
are important in preparing to analyze your data, e.g., data checking, getting familiar with your data file,
and examining the distribution of your variables. We will illustrate the basics of simple and multiple
regression and demonstrate the importance of inspecting, checking and verifying your data before
accepting the results of your analysis. In general, we hope to show that the results of your regression
analysis can be misleading without further probing of your data, which could reveal relationships that a
casual analysis could overlook.
In this chapter, and in subsequent chapters, we will be using a data file that was created by randomly
sampling 400 elementary schools from the California Department of Education’s API 2000 dataset.
This data file contains a measure of school academic performance as well as other attributes of the
elementary schools, such as, class size, enrollment, poverty, etc.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 1/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
You can access this data file over the web from within Stata with the Stata use command as shown
below. Note: Do not type the leading dot in the command — the dot is a convention to indicate that
the statement is a Stata command.
use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi
Once you have read the file, you probably want to store a copy of it on your computer (so you don’t
need to read it over the web every time). Let’s say you are using Windows and want to store the file in
a folder called c:regstata (you can choose a different name if you like). First, you can make this folder
within Stata using the mkdir command.
mkdir c:regstata
cd c:regstata
And then if you save the file it will be saved in the c:regstata folder. Let’s save the file as elemapi .
save elemapi
Now the data file is saved as c:regstataelemapi.dta and you could quit Stata and the data file would still
be there. When you wish to use the file in the future, you would just use the cd command to change to
the c:regstata directory (or whatever you called it) and then use the elemapi file.
cd c:regstata
use elemapi
Let’s dive right in and perform a regression analysis using the variables api00, acs_k3, meals and full.
These measure the academic performance of the school (api00), the average class size in kindergarten
through 3rd grade (acs_k3), the percentage of students receiving free meals (meals) – which is an
indicator of poverty, and the percentage of teachers who have full teaching credentials (full). We expect
that better academic performance would be associated with lower class size, fewer students receiving
free meals, and a higher percentage of teachers having full teaching credentials. Below, we show the
Stata command for testing this regression model followed by the Stata output.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 2/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
acs_k3 | -2.681508 1.393991 -1.92 0.055 -5.424424 .0614073
meals | -3.702419 .1540256 -24.04 0.000 -4.005491 -3.399348
full | .1086104 .090719 1.20 0.232 -.0698947 .2871154
_cons | 906.7392 28.26505 32.08 0.000 851.1228 962.3555
------------------------------------------------------------------------------
Let’s focus on the three predictors, whether they are statistically significant and, if so, the direction of
the relationship. The average class size (acs_k3, b=-2.68), is not statistically significant at the 0.05 level
(p=0.055), but only just so. The coefficient is negative which would indicate that larger class size is
related to lower academic performance — which is what we would expect. Next, the effect of meals
(b=-3.70, p=.000) is significant and its coefficient is negative indicating that the greater the proportion
students receiving free meals, the lower the academic performance. Please note, that we are not
saying that free meals are causing lower academic performance. The meals variable is highly related
to income level and functions more as a proxy for poverty. Thus, higher levels of poverty are associated
with lower academic performance. This result also makes sense. Finally, the percentage of teachers
with full credentials (full, b=0.11, p=.232) seems to be unrelated to academic performance. This would
seem to indicate that the percentage of teachers with full credentials is not an important factor in
predicting academic performance — this result was somewhat unexpected.
Should we take these results and write them up for publication? From these results, we would
conclude that lower class sizes are related to higher performance, that fewer students receiving free
meals is associated with higher performance, and that the percentage of teachers with full credentials
was not related to academic performance in the schools. Before we write this up for publication, we
should do a number of checks to make sure we can firmly stand behind these results. We start by
getting more familiar with the data file, doing preliminary data checking, looking for errors in the data.
First, let’s use the describe command to learn more about this data file. We can verify how many
observations it has and see the names of the variables it contains. To do this, we simply type
describe
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 3/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
obs: 400
vars: 21 25 Feb 2001 16:58
size: 14,800 (92.3% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
snum int %9.0g school number
dnum int %7.0g dname district number
api00 int %6.0g api 2000
api99 int %6.0g api 1999
growth int %6.0g growth 1999 to 2000
meals byte %4.0f pct free meals
ell byte %4.0f english language learners
yr_rnd byte %4.0f yr_rnd year round school
mobility byte %4.0f pct 1st year in school
acs_k3 byte %4.0f avg class size k-3
acs_46 byte %4.0f avg class size 4-6
not_hsg byte %4.0f parent not hsg
hsg byte %4.0f parent hsg
some_col byte %4.0f parent some college
col_grad byte %4.0f parent college grad
grad_sch byte %4.0f parent grad school
avg_ed float %9.0g avg parent ed
full float %4.0f pct full credential
emer byte %4.0f pct emer credential
enroll int %9.0g number of students
mealcat byte %18.0g mealcat Percentage free meals in 3
categories
-------------------------------------------------------------------------------
Sorted by: dnum
We will not go into all of the details of this output. Note that there are 400 observations and 21
variables. We have variables about academic performance in 2000 and 1999 and the change in
performance, api00, api99 and growth respectively. We also have various characteristics of the
schools, e.g., class size, parents education, percent of teachers with full and emergency credentials,
and number of students. Note that when we did our original regression analysis it said that there were
313 observations, but the describe command indicates that we have 400 observations in the data file.
If you want to learn more about the data file you could list all or some of the observations For
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 4/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
If you want to learn more about the data file, you could list all or some of the observations. For
example, below we list the first five observations.
list in 1/5
Observation 1
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 5/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Observation 2
Observation 3
Observation 4
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 6/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Observation 5
This takes up lots of space on the page, but does not give us a lot of information. Listing our data can
be very helpful, but it is more helpful if you list just the variables you are interested in. Let’s list the first
10 observations for the variables that we looked at in our first regression analysis.
We see that among the first 10 observations, we have four missing values for meals. It is likely that the
missing data for meals had something to do with the fact that the number of observations in our first
regression analysis was 313 and not 400.
Another useful tool for learning about your variables is the codebook command. Let’s do codebook for
the variables we included in the regression analysis, as well as the variable yr_rnd. We have
interspersed some comments on this output in [square brackets and in bold].
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 7/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
mean: 647.622
std. dev: 142.249
mean: 18.5477
std. dev: 5.00493
mean: 71.9937
std. dev: 24.3856
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 8/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
mean: 66.0568
std. dev: 40.2979
tabulation:
Freq. Numeric Label
308 0 No
92 1 Yes
[the variable yr_rnd is coded 0=No (not year round) and 1=Yes (year round)]
[308 are non-year round and 92 are year round, and none are missing]
The codebook command has uncovered a number of peculiarities worthy of further examination. Let’s
use the summarize command to learn more about these variables. As shown below, the summarize
command also reveals the large number of missing values for meals (400 – 315 = 85) and we see the
unusual minimum for acs_k3 of -21.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 9/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Let’s get a more detailed summary for acs_k3. In Stata, the comma after the variable list indicates that
options follow, in this case, the option is detail. As you can see below, the detail option gives you the
percentiles, the four largest and smallest values, measures of central tendency and variance, etc. Note
that summarize, and other commands, can be abbreviated: we could have typed sum acs_k3, d.
It seems as though some of the class sizes somehow became negative, as though a negative sign was
incorrectly typed in front of them. Let’s do a tabulate of class size to see if this seems plausible.
tabulate acs_k3
avg class |
size k-3 | Freq. Percent Cum.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 10/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
------------+-----------------------------------
-21 | 3 0.75 0.75
-20 | 2 0.50 1.26
-19 | 1 0.25 1.51
14 | 2 0.50 2.01
15 | 1 0.25 2.26
16 | 14 3.52 5.78
17 | 20 5.03 10.80
18 | 64 16.08 26.88
19 | 143 35.93 62.81
20 | 97 24.37 87.19
21 | 40 10.05 97.24
22 | 7 1.76 98.99
23 | 3 0.75 99.75
25 | 1 0.25 100.00
------------+-----------------------------------
Total | 398 100.00
Indeed, it seems that some of the class sizes somehow got negative signs put in front of them. Let’s
look at the school and district number for these observations to see if they come from the same district.
Indeed, they all come from district 140.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 11/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
All of the observations from district 140 seem to have this problem. When you find such a problem, you
want to go back to the original source of the data to verify the values. We have to reveal that we
fabricated this error for illustration purposes, and that the actual data had no such problem. Let’s
pretend that we checked with district 140 and there was a problem with the data there, a hyphen was
accidentally put in front of the class sizes making them negative. We will make a note to fix this! Let’s
continue checking our data.
Let’s take a look at some graphical methods for inspecting data. For each variable, it is useful to
inspect them using a histogram, boxplot, and stem-and-leaf plot. These graphs can show you
information about the shape of your variables better than simple numeric statistics can. We already
know about the problem with acs_k3, but let’s see how these graphical methods would have revealed
the problem with this variable.
First, we show a histogram for acs_k3. This shows us the observations where the average class size is
negative.
histogram acs_k3
Likewise, a boxplot would have called these observations to our attention as well. You can see the
outlying negative observations way at the bottom of the boxplot.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 12/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Finally, a stem-and-leaf plot would also have helped to identify these observations. This plot shows the
exact values of the observations, indicating that there were three -21s, two -20s, and one -19.
stem acs_k3
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 13/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
-2* | 11100
-1. | 9
-1s |
-1f |
-1t |
-1* |
-0. |
-0s |
-0f |
-0t |
-0* |
0* |
0t |
0f |
0s |
0. |
1* |
1t |
1f | 445
1s | 6666666666666677777777777777777777
1. | 88888888888888888888888888888888888888888888888888888888888888 ... (207)
2* | 00000000000000000000000000000000000000000000000000000000000000 ... (137)
2t | 2222222333
2f | 5
We recommend plotting all of these graphs for the variables you will be analyzing. We will omit, due to
space considerations, showing these graphs for all of the variables. However, in examining the
variables, the stem-and-leaf plot for full seemed rather unusual. Up to now, we have not seen anything
problematic with this variable, but look at the stem and leaf plot for full below. It shows 104
observations where the percent with a full credential is less than one. This is over 25% of the schools,
and seems very unusual.
stem full
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 14/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 15/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
7** | 60,60,60,60,70,70
7** | 80,80,80,80,90,90,90
8** | 00,00,00,00,00,00,00,00,00,00,10,10,10,10
8** | 20,20,20,30,30,30,30,30,30,30,30,30
8** | 40,40,40,40,50,50,50,50,50,50,50,50
8** | 60,60,60,60,60,70,70,70,70,70,70,70,70,70,70,70,70
8** | 80,80,80,80,80,80,90,90,90,90,90
9** | 00,00,00,00,00,00,00,00,00,10,10,10,10,10,10,10,10
9** | 20,20,20,20,20,20,20,30,30,30,30,30,30,30,30,30,30,30,30
9** | 40,40,40,40,40,40,40,40,40,40,50,50,50,50,50,50,50,50,50,50, ... (27)
9** | 60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,70,70,70, ... (28)
9** | 80,80,80,80,80,80,80,80,80
10** | 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00, ... (81)
Let’s look at the frequency distribution of full to see if we can understand this better. The values go
from 0.42 to 1.0, then jump to 37 and go up from there. It appears as though some of the percentages
are actually entered as proportions, e.g., 0.42 was entered instead of 42 or 0.96 which really should
have been 96.
tabulate full
pct full |
credential | Freq. Percent Cum.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 16/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
------------+-----------------------------------
0.42 | 1 0.25 0.25
0.45 | 1 0.25 0.50
0.46 | 1 0.25 0.75
0.47 | 1 0.25 1.00
0.48 | 1 0.25 1.25
0.50 | 3 0.75 2.00
0.51 | 1 0.25 2.25
0.52 | 1 0.25 2.50
0.53 | 1 0.25 2.75
0.54 | 1 0.25 3.00
0.56 | 2 0.50 3.50
0.57 | 2 0.50 4.00
0.58 | 1 0.25 4.25
0.59 | 3 0.75 5.00
0.60 | 1 0.25 5.25
0.61 | 4 1.00 6.25
0.62 | 2 0.50 6.75
0.63 | 1 0.25 7.00
0.64 | 3 0.75 7.75
0.65 | 3 0.75 8.50
0.66 | 2 0.50 9.00
0.67 | 6 1.50 10.50
0.68 | 2 0.50 11.00
0.69 | 3 0.75 11.75
0.70 | 1 0.25 12.00
0.71 | 1 0.25 12.25
0.72 | 2 0.50 12.75
0.73 | 6 1.50 14.25
0.75 | 4 1.00 15.25
0.76 | 2 0.50 15.75
0.77 | 2 0.50 16.25
0.79 | 3 0.75 17.00
0.80 | 5 1.25 18.25
0.81 | 8 2.00 20.25
0.82 | 2 0.50 20.75
0.83 | 2 0.50 21.25
0.84 | 2 0.50 21.75
0.85 | 3 0.75 22.50
0.86 | 2 0.50 23.00
0.90 | 3 0.75 23.75
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 17/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 18/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
district |
number | Freq. Percent Cum.
------------+-----------------------------------
401 | 104 100.00 100.00
------------+-----------------------------------
Total | 104 100.00
We note that all 104 observations in which full was less than or equal to one came from district 401.
Let’s count how many observations there are in district 401 using the count command and we see
district 401 has 104 observations.
count if dnum==401
104
All of the observations from this district seem to be recorded as proportions instead of percentages.
Again, let us state that this is a pretend problem that we inserted into the data for illustration purposes.
If this were a real life problem, we would check with the source of the data and verify the problem. We
will make a note to fix this problem in the data as well.
Another useful graphical technique for screening your data is a scatterplot matrix. While this is probably
more relevant as a diagnostic tool searching for non-linearities and outliers in your data, it can also be a
useful data screening tool, possibly revealing information in the joint distributions of your variables that
would not be apparent from examining univariate distributions. Let’s look at the scatterplot matrix for
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 19/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
the variables in our regression model. This reveals the problems we have already identified, i.e., the
negative class sizes and the percent full credential being entered as proportions.
We have identified three problems in our data. There are numerous missing values for meals, there
were negatives accidentally inserted before some of the class sizes (acs_k3) and over a quarter of the
values for full were proportions instead of percentages. The corrected version of the data is called
elemapi2. Let’s use that data file and repeat our analysis and see if the results are the same as our
original analysis. First, let’s repeat our original regression analysis below.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 20/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
acs_k3 | -2.681508 1.393991 -1.92 0.055 -5.424424 .0614073
meals | -3.702419 .1540256 -24.04 0.000 -4.005491 -3.399348
full | .1086104 .090719 1.20 0.232 -.0698947 .2871154
_cons | 906.7392 28.26505 32.08 0.000 851.1228 962.3555
------------------------------------------------------------------------------
Now, let’s use the corrected data file and repeat the regression analysis. We see quite a difference in
the results! In the original analysis (above), acs_k3 was nearly significant, but in the corrected analysis
(below) the results show this variable to be not significant, perhaps due to the cases where class size
was given a negative value. Likewise, the percentage of teachers with full credentials was not
significant in the original analysis, but is significant in the corrected analysis, perhaps due to the cases
where the value was given as the proportion with full credentials instead of the percent. Also, note
that the corrected analysis is based on 398 observations instead of 313 observations, due to getting
the complete data for the meals variable which had lots of missing values.
use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 21/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
acs_k3 | -.7170622 2.238821 -0.32 0.749 -5.118592 3.684468
meals | -3.686265 .1117799 -32.98 0.000 -3.906024 -3.466505
full | 1.327138 .2388739 5.56 0.000 .857511 1.796765
_cons | 771.6581 48.86071 15.79 0.000 675.5978 867.7184
------------------------------------------------------------------------------
From this point forward, we will use the corrected, elemapi2, data file. You might want to save this on
your computer so you can use it in future analyses.
save elemapi2
So far we have covered some topics in data checking/verification, but we have not really discussed
regression analysis itself. Let’s now talk more about performing regression analysis in Stata.
Let’s begin by showing some examples of simple linear regression using Stata. In this type of
regression, we have only one predictor variable. This variable may be continuous, meaning that it may
assume all values within a range, for example, age or height, or it may be dichotomous, meaning that
the variable may assume only one of two values, for example, 0 or 1. The use of categorical variables
with more than two levels will be covered in Chapter 3. There is only one response or dependent
variable, and it is continuous.
In Stata, the dependent variable is listed immediately after the regress command followed by one or
more predictor variables. Let’s examine the relationship between the size of school and academic
performance to see if the size of the school is related to academic performance. For this example,
api00 is the dependent variable and enroll is the predictor.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 22/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
enroll | -.1998674 .0298512 -6.70 0.000 -.2585532 -.1411817
_cons | 744.2514 15.93308 46.71 0.000 712.9279 775.5749
------------------------------------------------------------------------------
Let’s review this output a bit more carefully. First, we see that the F-test is statistically significant, which
means that the model is statistically significant. The R-squared of .1012 means that approximately 10% of
the variance of api00 is accounted for by the model, in this case, enroll. The t-test for enroll equals
-6.70, and is statistically significant, meaning that the regression coefficient for enroll is significantly
different from zero. Note that (-6.70)2 = 44.89, which is the same as the F-statistic (with some rounding
error). The coefficient for enroll is -.1998674, or approximately -.2, meaning that for a one unit increase
in enroll, we would expect a .2-unit decrease in api00. In other words, a school with 1100 students
would be expected to have an api score 20 units lower than a school with 1000 students. The constant
is 744.2514, and this is the predicted value when enroll equals zero. In most cases, the constant is not
very interesting. We have prepared an annotated output (/stata/webbooks/reg/chapter1/simple-
regression-analysis/) which shows the output from this regression along with an explanation of each of
the items in it.
In addition to getting the regression table, it can be useful to see a scatterplot of the predicted and
outcome variables with the regression line plotted. After you run a regression, you can create a
variable that contains the predicted values using the predict command. You can get these values at
any point after you run a regress command, but remember that once you run a new regression, the
predicted values will be based on the most recent regression. To create predicted values you just type
predict and the name of a new variable Stata will give you the fitted values. For this example, our new
variable name will be fv, so we will type
predict fv
If we use the list command, we see that a fitted value has been generated for each observation.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 23/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
4. 387 541.7857
5. 387 592.1523
6. 394 618.5348
7. 397 543.5845
8. 406 604.5441
9. 411 645.5169
10. 412 491.619
Below we can show a scatterplot of the outcome variable, api00 and the predictor, enroll.
We can combine scatter with lfit to show a scatterplot with fitted values.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 24/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
As you see, some of the points appear to be outliers. If you use the mlabel(snum) option on the
scatter command, you can see the school number for each point. This allows us to see, for example,
that one of the outliers is school 2910.
As we saw earlier, the predict command can be used to generate predicted (fitted) values after running
regress. You can also obtain residuals by using the predict command followed by a variable name, in
this case e, with the residual option.
predict e, residual
This command can be shortened to predict e, resid or even predict e, r. The table below shows some
of the other values can that be created with the predict option.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 25/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Now, let’s look at an example of multiple regression, in which we have one outcome (dependent)
variable and multiple predictors. Before we begin with our next example, we need to make a decision
regarding the variables that we have created, because we will be creating similar variables with our
multiple regression, and we don’t want to get the variables confused. For example, in the simple
regression we created a variable fv for our predicted (fitted) values and e for the residuals. If we want
to create predicted values for our next example we could call the predicted value something else, e.g.,
fv_mr, but this could start getting confusing. We could drop the variables we have created, using drop
fv e. Instead, let’s clear out the data in memory and use the elemapi2 data file again. When we start
new examples in future chapters, we will clear out the existing data file and use the file again to start
fresh.
clear
use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2
For this multiple regression example, we will regress the dependent variable, api00, on all of the
predictor variables in the data set.
regress api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 26/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ell | -.8600707 .2106317 -4.08 0.000 -1.274203 -.4459382
meals | -2.948216 .1703452 -17.31 0.000 -3.28314 -2.613293
yr_rnd | -19.88875 9.258442 -2.15 0.032 -38.09218 -1.68531
mobility | -1.301352 .4362053 -2.98 0.003 -2.158995 -.4437089
acs_k3 | 1.3187 2.252683 0.59 0.559 -3.1104 5.747801
acs_46 | 2.032456 .7983213 2.55 0.011 .462841 3.602071
full | .609715 .4758205 1.28 0.201 -.3258169 1.545247
emer | -.7066192 .6054086 -1.17 0.244 -1.89694 .4837018
enroll | -.012164 .0167921 -0.72 0.469 -.0451798 .0208517
_cons | 778.8305 61.68663 12.63 0.000 657.5457 900.1154
------------------------------------------------------------------------------
Let’s examine the output from this regression analysis. As with the simple regression, we look to the p-
value of the F-test to see if the overall model is significant. With a p-value of zero to four decimal
places, the model is statistically significant. The R-squared is 0.8446, meaning that approximately 84%
of the variability of api00 is accounted for by the variables in the model. In this case, the adjusted R-
squared indicates that about 84% of the variability of api00 is accounted for by the model, even after
taking into account the number of predictor variables in the model. The coefficients for each of the
variables indicates the amount of change one could expect in api00 given a one-unit change in the
value of that variable, given that all other variables in the model are held constant. For example,
consider the variable ell. We would expect a decrease of 0.86 in the api00 score for every one unit
increase in ell, assuming that all other variables in the model are held constant. The interpretation of
much of the output from the multiple regression is the same as it was for the simple regression. We
have prepared an annotated output (/stata/webbooks/reg/chapter1/multiple-regression-analysis/) that
more thoroughly explains the output of this multiple regression analysis.
You may be wondering what a 0.86 change in ell really means, and how you might compare the
strength of that coefficient to the coefficient for another variable, say meals. To address this problem,
we can add an option to the regress command called beta, which will give us the standardized
regression coefficients. The beta coefficients are used by some researchers to compare the relative
strength of the various predictors within the model. Because the beta coefficients are all measured in
standard deviations, instead of the units of the variables, they can be compared to one another. In
other words, the beta coefficients are the coefficients that you would obtain if the outcome and
predictor variables were all transformed standard scores, also called z-scores, before running the
regression.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 27/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
regress api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll, beta
------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
ell | -.8600707 .2106317 -4.08 0.000 -.1495771
meals | -2.948216 .1703452 -17.31 0.000 -.6607003
yr_rnd | -19.88875 9.258442 -2.15 0.032 -.0591404
mobility | -1.301352 .4362053 -2.98 0.003 -.0686382
acs_k3 | 1.3187 2.252683 0.59 0.559 .0127287
acs_46 | 2.032456 .7983213 2.55 0.011 .0549752
full | .609715 .4758205 1.28 0.201 .0637969
emer | -.7066192 .6054086 -1.17 0.244 -.0580132
enroll | -.012164 .0167921 -0.72 0.469 -.0193554
_cons | 778.8305 61.68663 12.63 0.000 .
------------------------------------------------------------------------------
Because the coefficients in the Beta column are all in the same standardized units you can compare
these coefficients to assess the relative strength of each of the predictors. In this example, meals has
the largest Beta coefficient, -0.66 (in absolute value), and acs_k3 has the smallest Beta, 0.013. Thus, a
one standard deviation increase in meals leads to a 0.66 standard deviation decrease in predicted
api00, with the other variables held constant. And, a one standard deviation increase in acs_k3, in turn,
leads to a 0.013 standard deviation increase in predicted api00 with the other variables in the model
held constant.
In interpreting this output, remember that the difference between the numbers listed in the Coef.
column and the Beta column is in the units of measurement. For example, to describe the raw
coefficient for ell you would say “A one-unit decrease in ell would yield a .86-unit increase in the
predicted api00.” However, for the standardized coefficient (Beta) you would say, “A one standard
deviation decrease in ell would yield a .15 standard deviation increase in the predicted api00.”
The listcoef command gives more extensive output regarding standardized coefficients. It is not part of
Stata, but you can download it over the internet like this.
search listcoef
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 28/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
and then follow the instructions (see also How can I use the search command to search for programs
and get additional help? (https://stats.idre.ucla.edu/stata/faq/search-faq/) for more information about
using search). Now that we have downloaded listcoef, we can run it like this.
listcoef
---------------------------------------------------------------------------
api00 | b t P>|t| bStdX bStdY bStdXY SDofX
---------+-----------------------------------------------------------------
ell | -0.86007 -4.083 0.000 -21.2891 -0.0060 -0.1496 24.7527
meals | -2.94822 -17.307 0.000 -94.0364 -0.0207 -0.6607 31.8960
yr_rnd | -19.88875 -2.148 0.032 -8.4174 -0.1397 -0.0591 0.4232
mobility | -1.30135 -2.983 0.003 -9.7692 -0.0091 -0.0686 7.5069
acs_k3 | 1.31870 0.585 0.559 1.8117 0.0093 0.0127 1.3738
acs_46 | 2.03246 2.546 0.011 7.8245 0.0143 0.0550 3.8498
full | 0.60972 1.281 0.201 9.0801 0.0043 0.0638 14.8924
emer | -0.70662 -1.167 0.244 -8.2569 -0.0050 -0.0580 11.6851
enroll | -0.01216 -0.724 0.469 -2.7548 -0.0001 -0.0194 226.4732
---------------------------------------------------------------------------
Let us compare the regress output with the listcoef output. You will notice that the values listed in the
Coef., t, and P>|t| values are the same in the two outputs. The values listed in the Beta column of the
regress output are the same as the values in the bStadXY column of listcoef. The bStdX column gives
the unit change in Y expected with a one standard deviation change in X. The bStdY column gives the
standard deviation change in Y expected with a one unit change in X. The SDofX column gives that
standard deviation of each predictor variable in the model.
For example, the bStdX for ell is -21.3, meaning that a one standard deviation increase in ell would lead
to an expected 21.3 unit decrease in api00. The bStdY value for ell of -0.0060 means that for a one
unit, one percent, increase in english language learners, we would expect a 0.006 standard deviation
decrease in api00. Because the bStdX values are in standard units for the predictor variables, you can
use these coefficients to compare the relative strength of the predictors like you would compare Beta
coefficients. The difference is BStdX coefficients are interpreted as changes in the units of the
outcome variable instead of in standardized units of the outcome variable. For example, the BStdX for
meals versus ell is -94 versus -21, or about 4 times as large, the same ratio as the ratio of the Beta
coefficients. We have created an annotated output (/stata/webbooks/reg/chapter1/for-listcoef-
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 29/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
So far, we have concerned ourselves with testing a single variable at a time, for example looking at the
coefficient for ell and determining if that is significant. We can also test sets of variables, using the test
command, to see if the set of variables are significant. First, let’s start by testing a single variable, ell,
using the test command.
test ell==0
( 1) ell = 0.0
F( 1, 385) = 16.67
Prob > F = 0.0001
If you compare this output with the output from the last regression you can see that the result of the F-
test, 16.67, is the same as the square of the result of the t-test in the regression (-4.083^2 = 16.67). Note
that you could get the same results if you typed the following since Stata defaults to comparing the
term(s) listed to 0.
test ell
( 1) ell = 0.0
F( 1, 385) = 16.67
Prob > F = 0.0001
Perhaps a more interesting test would be to see if the contribution of class size is significant. Since the
information regarding class size is contained in two variables, acs_k3 and acs_46, we include both of
these with the test command.
( 1) acs_k3 = 0.0
( 2) acs_46 = 0.0
F( 2, 385) = 3.95
Prob > F = 0.0200
The significant F-test, 3.95, means that the collective contribution of these two variables is significant.
One way to think of this, is that there is a significant difference between a model with acs_k3 and
acs_46 as compared to a model without them, i.e., there is a significant difference between the “full”
model and the “reduced” models.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 30/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Finally, as part of doing a multiple regression analysis you might be interested in seeing the
correlations among the variables in the regression model. You can do this with the correlate command
as shown below.
correlate api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll
(obs=395)
If we look at the correlations with api00, we see meals and ell have the two strongest correlations with
api00. These correlations are negative, meaning that as the value of one variable goes down, the value
of the other variable tends to go up. Knowing that these variables are strongly associated with api00,
we might predict that they would be statistically significant predictor variables in the regression model.
We can also use the pwcorr command to do pairwise correlations. The most important difference
between correlate and pwcorr is the way in which missing data is handled. With correlate, an
observation or case is dropped if any variable has a missing value, in other words, correlate uses
listwise , also called casewise, deletion. pwcorr uses pairwise deletion, meaning that the observation is
dropped only if there is a missing value for the pair of variables being correlated. Two options that you
can use with pwcorr, but not with correlate, are the sig option, which will give the significance levels for
the correlations and the obs option, which will give the number of observations used in the correlation.
Such an option is not necessary with corr as Stata lists the number of observations at the top of the
output.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 31/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
pwcorr api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll, obs sig
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 32/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
api00 | 1.0000
|
| 400
|
ell | -0.7676 1.0000
| 0.0000
| 400 400
|
meals | -0.9007 0.7724 1.0000
| 0.0000 0.0000
| 400 400 400
|
yr_rnd | -0.4754 0.4979 0.4185 1.0000
| 0.0000 0.0000 0.0000
| 400 400 400 400
|
mobility | -0.2064 -0.0205 0.2166 0.0348 1.0000
| 0.0000 0.6837 0.0000 0.4883
| 399 399 399 399 399
|
acs_k3 | 0.1710 -0.0557 -0.1880 0.0227 0.0401 1.0000
| 0.0006 0.2680 0.0002 0.6517 0.4245
| 398 398 398 398 398 398
|
acs_46 | 0.2329 -0.1733 -0.2131 -0.0421 0.1277 0.2708 1.0000
| 0.0000 0.0005 0.0000 0.4032 0.0110 0.0000
| 397 397 397 397 396 395 397
|
full | 0.5744 -0.4848 -0.5276 -0.3977 0.0252 0.1606 0.1177
| 0.0000 0.0000 0.0000 0.0000 0.6156 0.0013 0.0190
| 400 400 400 400 399 398 397
|
emer | -0.5827 0.4722 0.5330 0.4347 0.0596 -0.1103 -0.1245
| 0.0000 0.0000 0.0000 0.0000 0.2348 0.0277 0.0131
| 400 400 400 400 399 398 397
|
enroll | -0.3182 0.4030 0.2410 0.5918 0.1050 0.1089 0.0283
| 0.0000 0.0000 0.0000 0.0000 0.0360 0.0298 0.5741
| 400 400 400 400 399 398 397
|
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 33/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Earlier we focused on screening your data for potential errors. In the next chapter, we will focus on
regression diagnostics to verify whether your data meet the assumptions of linear regression. Here, we
will focus on the issue of normality. Some researchers believe that linear regression requires that the
outcome (dependent) and predictor variables be normally distributed. We need to clarify this issue. In
actuality, it is the residuals that need to be normally distributed. In fact, the residuals need to be normal
only for the t-tests to be valid. The estimation of the regression coefficients do not require normally
distributed residuals. As we are interested in having valid t-tests, we will investigate issues concerning
normality.
histogram enroll
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 34/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
We can use the normal option to superimpose a normal curve on this graph and the bin(20) option to
use 20 bins. The distribution looks skewed to the right.
You may also want to modify labels of the axes. For example, we use the xlabel() option for labeling the
x-axis below, labeling it from 0 to 1600 incrementing by 100.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 35/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Histograms are sensitive to the number of bins or columns that are used in the display. An alternative to
histograms is the kernel density plot, which approximates the probability density of the variable. Kernel
density plots have the advantage of being smooth and of being independent of the choice of origin,
unlike histograms. Stata implements kernel density plots with the kdensity command.
Not surprisingly, the kdensity plot also indicates that the variable enroll does not look normal. Now
let’s make a boxplot for enroll, using graph box command.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 36/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Note the dots at the top of the boxplot which indicate possible outliers, that is, these data points are
more than 1.5*(interquartile range) above the 75th percentile. This boxplot also confirms that enroll is
skewed to the right.
There are three other types of graphs that are often used to examine the distribution of variables;
symmetry plots, normal quantile plots and normal probability plots.
A symmetry plot graphs the distance above the median for the i-th value against the distance below the
median for the i-th value. A variable that is symmetric would have points that lie on the diagonal line.
As we would expect, this distribution is not symmetric.
symplot enroll
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 37/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
A normal quantile plot graphs the quantiles of a variable against the quantiles of a normal (Gaussian)
distribution. qnorm is sensitive to non-normality near the tails, and indeed we see considerable
deviations from normal, the diagonal line, in the tails. This plot is typical of variables that are strongly
skewed to the right.
qnorm api00
Finally, the normal probability plot is also useful for examining the distribution of variables. pnorm is
sensitive to deviations from normality nearer to the center of the distribution. Again, we see indications
of non-normality in enroll.
pnorm enroll
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 38/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Having concluded that enroll is not normally distributed, how should we address this problem? First,
we may try entering the variable as-is into the regression, but if we see problems, which we likely
would, then we may try to transform enroll to make it more normally distributed. Potential
transformations include taking the log, the square root or raising the variable to a power. Selecting the
appropriate transformation is somewhat of an art. Stata includes the ladder and gladder commands to
help in the process. Ladder reports numeric results and gladder produces a graphic display. Let’s start
with ladder and look for the transformation with the smallest chi-square.
ladder enroll
ladder enroll
The log transform has the smallest chi-square. Let’s verify these results graphically using gladder.
gladder enroll
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 39/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
This also indicates that the log transformation would help to make enroll more normally distributed.
Let’s use the generate command with the log function to create the variable lenroll which will be the
log of enroll. Note that log in Stata will give you the natural log, not log base 10. To get log base 10, type
log10(var).
Now let’s graph our new variable and see if we have normalized it.
We can see that lenroll looks quite normal. We would then use the symplot, qnorm and pnorm
commands to help us assess whether lenroll seems normal, as well as seeing how lenroll impacts the
residuals, which is really the important consideration.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 40/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
1.6 Summary
In this lecture we have discussed the basics of how to perform simple and multiple regressions, the
basics of interpreting output, as well as some related commands. We examined some tools and
techniques for screening for bad data and the consequences such data can have on your results.
Finally, we touched on the assumptions of linear regression and illustrated how you can check the
normality of your variables and how you can transform your variables to achieve normality. The next
chapter will pick up where this chapter has left off, going into a more thorough discussion of the
assumptions of linear regression and how you can use Stata to assess these assumptions for your data.
In particular, the next lecture will address the following issues.
. Make five graphs of api99: histogram, kdensity plot, boxplot, symmetry plot and normal
quantile plot.
. What is the correlation between api99 and meals?
. Regress api99 on meals. What does the output tell you?
. Create and list the fitted (predicted) values.
. Graph meals and api99 with and without the regression line.
. Look at the correlations among the variables api99 meals ell avg_ed using the corr and
pwcorr commands. Explain how these commands are different. Make a scatterplot matrix for
these variables and relate the correlation results to the scatterplot matrix.
. Perform a regression predicting api99 from meals and ell. Interpret the output.
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 41/42
11/4/2020 Regression with Stata Chapter 1 – Simple and Multiple Regression
Stata Manuals
[R] regress
[R] predict
[R] test
Related Web Pages
Stata FAQ- How can I do a scatterplot with regression line in Stata? (/stata/faq/how-can-
i-do-a-scatterplot-with-regression-line-in-stata/)
Annotated Stata Output- Regression (https://stats.idre.ucla.edu/stata/output/regression-
analysis/)
Stata Add On Programs
https://stats.idre.ucla.edu/stat/stata/ado (/stata/ado/)
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/ 42/42