0% found this document useful (0 votes)
1 views

stata notes

The document provides a comprehensive guide on using STATA for data analysis, covering data file formats, loading data, and basic commands for data manipulation and statistical analysis. It explains how to perform regression analysis, create programs for batch processing commands, and manage output files. Additionally, it includes a command appendix for generating variables, finding correlations, and conducting hypothesis testing.

Uploaded by

Vinci Chan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

stata notes

The document provides a comprehensive guide on using STATA for data analysis, covering data file formats, loading data, and basic commands for data manipulation and statistical analysis. It explains how to perform regression analysis, create programs for batch processing commands, and manage output files. Additionally, it includes a command appendix for generating variables, finding correlations, and conducting hypothesis testing.

Uploaded by

Vinci Chan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Basics of STATA

November 5, 2017

1 Data …les
Variables within a data set are typically organized in columns, while rows
represent di¤erent observations of a given variable. An important feature of
data sets is their format. Our data sets will come either in the ASCII (text)
format or in STATA format. Both formats are compatible with STATA.
Data can either be stored in a separate …le - which we will call DATA or
typed in when using STATA in the interactive mode. Obviously, we won’t
be typing in long data sets each time we want to analyze them, so we will
prefer to store our data in a separate …le. In STATA, text format data …les
have the su¢ x .RAW while STATA format data …les will bear the su¢ x,
.DTA (text format data sets may bear another su¢ x, such as .TXT). So
assume that we have a data …le, named either DATA.RAW or DATA.DTA.

2 Loading data into STATA


Launching STATA from the Windows menu allows you to enter the interac-
tive mode of the program, which means that you can type commands which
will be executed one by one. The prompt ’.’ (dot) indicates that you are
within STATA. The …rst thing you will have to do is to enter the data in
STATA’s memory:
- If the data are in text (ASCII) format, the command to perform this
is:
INFILE VAR1 VAR2 USING c:npathnDATA

1
where VAR1 and VAR2 (and possibly VAR3...) are names you will give
to the variables (columns) which make up DATA. You must specify drive
(in this example c:) and the path to the directory where the DATA …le is
stored (here path). The maximum length of a name is 8 characters.
- If the data are in STATA format, the command is:
USE c:npathnDATA
Unlike text …les, STATA format data …les already contain variable names,
so you should not re-specify these. You can create a STATA format data
…le from a text format …le by …rst loading the text format data using the
INFILE command, and then typing:
SAVE c:npathnDATA
This will create a …le called DATA.DTA in your directory.

3 Getting started with data analysis


We are now ready to start analyzing our data set. The …rst thing we might
want to do is to make sure we have loaded the right …le, and to get a rough
idea of its components. A convenient way to obtain this information is the
following command:
DESCRIBE
This gives you some information on the data which STATA has in its
memory (number of variables, number of observations, names of variables,
etc...).
We may also want to view the data directly on screen: LIST VAR1 VAR2
displays the variables named VAR1, VAR2.
LIST displays all the variables in STATA’s current memory.
LIST IN 1/10 displays the …rst 10 observations of all the variables in
the data set.
Going yet further, we may be interested in summary statistics about the
data:
SUMMARIZE provides statistics such as mean, standard deviation of the
variables in STATA’s memory.
Other potentially useful commands are the following:

2
SORT VAR1 reorganizes the data in such a way that VAR1 will appear
in ascending order. For example, if we have a sample of individuals, we may
want to organize our data in ascending order of their income or education
levels. Never use this when dealing with time series !
TABULATE VAR1 VAR2 provides frequency tables; if VAR1 is an age
group and VAR2 is the education level, it will tell you how many individuals
of each age group have a given education level.
CORRELATE VAR1 VAR2 VAR3 provides the autocorrelation table of the
listed variables.
GRAPH VAR1 VAR2 provides a scatter plot of the data with VAR2 on
the x-axis and VAR1 on the y-axis.
GENERATE NEWVAR=VAR1+VAR2 generates a new variable called NEW-
VAR, which is the sum of VAR1 and VAR2, and stores it in the sessions
memory. Of course, you can create any combination of any number of
variables using +, -, *, / etc...
DROP VAR1 removes VAR1 from the sessions memory.
REPLACE VAR1=VAR1+VAR2 replaces the values of VAR1 with the
sum of its old values plus VAR2. This is equivalent to (1) GENERATE
NEWVAR=VAR1+VAR2; (2) DROP VAR1; (3) RENAME NEWVAR VAR1.
You can combine these commands with logical quali…ers such as if, &
(and) and or. For example, you can use:
LIST VAR1 IF VAR1>100 & VAR2==1 which will display VAR1 when-
ever the value of this variable is greater than 100 and the value of VAR2 is
equal to 1. Note that a single equal sign corresponds to a variables name
(as in the GENERATE command) while two equal signs are needed when
dealing with a given value of a variable (as in VAR2==1).
More commands are of course available; you can get a complete list by
typing HELP under STATA, or by consulting the users manual. The best
way to learn STATA is through practice.

3
4 Introduction to statistical analysis using STATA
Least squares regression is one of the essential statistical methods we will
be studying in the course. A discussed in the …rst lecture, this consists of
minimizing the vertical distance between the scattered data points and the
line we are trying to …t through them. Suppose that we wish to predict
VAR1 using VAR2 and VAR3:
VAR1 = 1 + 2 VAR2 + 3 VAR3 +"
To do this, and to produce many of the useful statistics that go with it,
STATA has a very convenient command:
REGRESS VAR1 VAR2 VAR3
You will immediately obtain estimated values for 1; 2 ;and 3, as well as
their standard errors, con…dence intervals and other useful statistics which
have been or will be introduced in class.
To obtain …tted values or regression residuals from this regression, type:
PREDICT FITTED stores the …tted values from the regression in a data
column (variable) called FITTED, and keeps it in memory.
PREDICT RESID, RESIDUALS stores the residuals from the regression in a
data column (variable) called RESID, and keeps it in memory.

5 Writing programs and getting output …les


The interactive mode requires you to enter commands one at a time, and
to get results one at a time. This may be inconvenient if you have many
commands to run, or if you make mistakes which require you to type a
whole chain of commands over again. In this case, you may prefer to write
a program which contains all the commands in order of desired execution.
You can then correct, modify and run the program whenever need arises.
The easiest way to write programs is the following: Write the program
in the Windows editor (notepad or any text editor), one command on each
line, and save the program …le in a convenient directory. This …le must bear
the su¢ x .DO. For example, it can be called PROGRAM1.DO.
An example of a possible STATA program is the following:

4
INFILE VAR1 VAR2 VAR3 USING c:npathnDATA
SUMMARIZE
GENERATE VAR4=VAR1-VAR2
REGRESS VAR1 VAR2 VAR4
PREDICT FITTED
LIST FITTED IN 1/20
Save this program under the name PROGRAM1.DO, then enter STATA
and run the program by going to the …le menu and choosing DO. This will
perform all the commands of PROGRAM1 in the order you typed them in,
and provide you with the output on the output window.
Another convenient tool is to store the output of your work (regression
results, statistics, transformed data ...) in an output …le which can then be
printed. Sometimes, the output will be too long to …t on a single screen (as
in our example above), so it is convenient to store it in a text (ASCII)
…le, which you can later view and print. To do this, you can start your
session or your program with:
LOG USING c:npathnOUTPUT
This will create an output …le in text format, called OUTPUT.LOG,
stored in the subdirectory from which you invoked STATA. This …le will con-
tain all the results from your session, as well as the commands you typed in
the interactive mode. To access this …le, simply type EDIT OUTPUT.LOG
at the DOS prompt, or use a word processor. Note that the LOG command
will not keep track of graphs (which you can print directly from STATA us-
ing the PRINT GRAPH command). Note also: LOG CLOSE stops logging
a session and closes the text …le containing the log.
LOG OFF temporarily stops the logging session without closing the log
…le.
LOG ON resumes logging on the open log …le.
One more thing ! You must save your data before STATA lets you exit:
SAVE c:npathnDATA will save the data in the sessions memory (including
new variables, transformed variables...) in a .DTA …le (STATA format), on
the c: drive (or any other drive you specify) in the path directory. You
cannot use an existing name for the data …le, so it is a good idea to delete

5
useless data …les from your subdirectory, and keep only the initial data
set and/or useful transformed data sets. Its also a good idea to delete old
output …les, especially if the LOG command is written in your program (again,
STATA won’t overwrite).

6 Command appendix
Generate a log variable

– gen VARNAME=ln(var)

Generate a variable with constant value

– gen VARNAME = VALUE

Store residuals from regression

– regress x1 x2
– predict VARNAME, residuals

Find correlation

– correlate x1 x2

Estimate and save the predicted value

– regress y x1
– predict VARNAME, xb

Save the coe¢ cient

– regress y x1
– gen VARNAME=_b[x1]

Test signi…cance level

6
– Method 1: regress y x1 x2, level(90)
– Method 2:
regress y x1 x2
lincom x2;level (90)
Note: type help lincom in stata to get more information.

Hypothesis testing

– regress y x1 x2
– test x1 = a
– Note: type help test in stata to get more information.

You might also like