stata notes
stata notes
November 5, 2017
1 Data …les
Variables within a data set are typically organized in columns, while rows
represent di¤erent observations of a given variable. An important feature of
data sets is their format. Our data sets will come either in the ASCII (text)
format or in STATA format. Both formats are compatible with STATA.
Data can either be stored in a separate …le - which we will call DATA or
typed in when using STATA in the interactive mode. Obviously, we won’t
be typing in long data sets each time we want to analyze them, so we will
prefer to store our data in a separate …le. In STATA, text format data …les
have the su¢ x .RAW while STATA format data …les will bear the su¢ x,
.DTA (text format data sets may bear another su¢ x, such as .TXT). So
assume that we have a data …le, named either DATA.RAW or DATA.DTA.
1
where VAR1 and VAR2 (and possibly VAR3...) are names you will give
to the variables (columns) which make up DATA. You must specify drive
(in this example c:) and the path to the directory where the DATA …le is
stored (here path). The maximum length of a name is 8 characters.
- If the data are in STATA format, the command is:
USE c:npathnDATA
Unlike text …les, STATA format data …les already contain variable names,
so you should not re-specify these. You can create a STATA format data
…le from a text format …le by …rst loading the text format data using the
INFILE command, and then typing:
SAVE c:npathnDATA
This will create a …le called DATA.DTA in your directory.
2
SORT VAR1 reorganizes the data in such a way that VAR1 will appear
in ascending order. For example, if we have a sample of individuals, we may
want to organize our data in ascending order of their income or education
levels. Never use this when dealing with time series !
TABULATE VAR1 VAR2 provides frequency tables; if VAR1 is an age
group and VAR2 is the education level, it will tell you how many individuals
of each age group have a given education level.
CORRELATE VAR1 VAR2 VAR3 provides the autocorrelation table of the
listed variables.
GRAPH VAR1 VAR2 provides a scatter plot of the data with VAR2 on
the x-axis and VAR1 on the y-axis.
GENERATE NEWVAR=VAR1+VAR2 generates a new variable called NEW-
VAR, which is the sum of VAR1 and VAR2, and stores it in the sessions
memory. Of course, you can create any combination of any number of
variables using +, -, *, / etc...
DROP VAR1 removes VAR1 from the sessions memory.
REPLACE VAR1=VAR1+VAR2 replaces the values of VAR1 with the
sum of its old values plus VAR2. This is equivalent to (1) GENERATE
NEWVAR=VAR1+VAR2; (2) DROP VAR1; (3) RENAME NEWVAR VAR1.
You can combine these commands with logical quali…ers such as if, &
(and) and or. For example, you can use:
LIST VAR1 IF VAR1>100 & VAR2==1 which will display VAR1 when-
ever the value of this variable is greater than 100 and the value of VAR2 is
equal to 1. Note that a single equal sign corresponds to a variables name
(as in the GENERATE command) while two equal signs are needed when
dealing with a given value of a variable (as in VAR2==1).
More commands are of course available; you can get a complete list by
typing HELP under STATA, or by consulting the users manual. The best
way to learn STATA is through practice.
3
4 Introduction to statistical analysis using STATA
Least squares regression is one of the essential statistical methods we will
be studying in the course. A discussed in the …rst lecture, this consists of
minimizing the vertical distance between the scattered data points and the
line we are trying to …t through them. Suppose that we wish to predict
VAR1 using VAR2 and VAR3:
VAR1 = 1 + 2 VAR2 + 3 VAR3 +"
To do this, and to produce many of the useful statistics that go with it,
STATA has a very convenient command:
REGRESS VAR1 VAR2 VAR3
You will immediately obtain estimated values for 1; 2 ;and 3, as well as
their standard errors, con…dence intervals and other useful statistics which
have been or will be introduced in class.
To obtain …tted values or regression residuals from this regression, type:
PREDICT FITTED stores the …tted values from the regression in a data
column (variable) called FITTED, and keeps it in memory.
PREDICT RESID, RESIDUALS stores the residuals from the regression in a
data column (variable) called RESID, and keeps it in memory.
4
INFILE VAR1 VAR2 VAR3 USING c:npathnDATA
SUMMARIZE
GENERATE VAR4=VAR1-VAR2
REGRESS VAR1 VAR2 VAR4
PREDICT FITTED
LIST FITTED IN 1/20
Save this program under the name PROGRAM1.DO, then enter STATA
and run the program by going to the …le menu and choosing DO. This will
perform all the commands of PROGRAM1 in the order you typed them in,
and provide you with the output on the output window.
Another convenient tool is to store the output of your work (regression
results, statistics, transformed data ...) in an output …le which can then be
printed. Sometimes, the output will be too long to …t on a single screen (as
in our example above), so it is convenient to store it in a text (ASCII)
…le, which you can later view and print. To do this, you can start your
session or your program with:
LOG USING c:npathnOUTPUT
This will create an output …le in text format, called OUTPUT.LOG,
stored in the subdirectory from which you invoked STATA. This …le will con-
tain all the results from your session, as well as the commands you typed in
the interactive mode. To access this …le, simply type EDIT OUTPUT.LOG
at the DOS prompt, or use a word processor. Note that the LOG command
will not keep track of graphs (which you can print directly from STATA us-
ing the PRINT GRAPH command). Note also: LOG CLOSE stops logging
a session and closes the text …le containing the log.
LOG OFF temporarily stops the logging session without closing the log
…le.
LOG ON resumes logging on the open log …le.
One more thing ! You must save your data before STATA lets you exit:
SAVE c:npathnDATA will save the data in the sessions memory (including
new variables, transformed variables...) in a .DTA …le (STATA format), on
the c: drive (or any other drive you specify) in the path directory. You
cannot use an existing name for the data …le, so it is a good idea to delete
5
useless data …les from your subdirectory, and keep only the initial data
set and/or useful transformed data sets. Its also a good idea to delete old
output …les, especially if the LOG command is written in your program (again,
STATA won’t overwrite).
6 Command appendix
Generate a log variable
– gen VARNAME=ln(var)
– regress x1 x2
– predict VARNAME, residuals
Find correlation
– correlate x1 x2
– regress y x1
– predict VARNAME, xb
– regress y x1
– gen VARNAME=_b[x1]
6
– Method 1: regress y x1 x2, level(90)
– Method 2:
regress y x1 x2
lincom x2;level (90)
Note: type help lincom in stata to get more information.
Hypothesis testing
– regress y x1 x2
– test x1 = a
– Note: type help test in stata to get more information.