You are on page 1of 40

STATA TUTORIAL PART ONE

Stata is easy to learn and a powerful software used


widely in research

Get started

Organizing do-files

The do-file and creating a log-file

Input and load data

Combining data

Reshaping the data

Useful commands to modify data

Label variables

Create summary statistics

Import data

GET STARTED

Open up Stata in start menu Programs

Use help [commandname] to get the help files!


. help regress

If you want to search all the sources that has to do with


regression in general, type
. findit regress

Also, use the Internet for help! You can search for codes
written by someone else. For example, Stata does not
have an inbuilt command to calculate Gini index, type
. net search gini

To update Stata, type


. help update

ORGANIZING DO-FILES
For reproducibility of your results!

Write all the code in do-files:

Idea: original data is safe and you can always get back to
the raw data if something goes wrong. The new data and
new variables are created in do-files. Working directly in
Stata is useful to explore the data but as soon as it produces
something important you should write it in a do-file.

Always create logs in the do-file. Each do-file should have a log file
where all results are saved in text.
Separate do-files that create data from do-files that analyze data.
crdata1.do
crdata2.do

andata1.do
andata2.do

etc.

STARTING A PROJECT / CREATE DOFILE


Create a separate directory for each project. Tell Stata
to change directory by typing:

cd Z:/pathname/Econometrics 1

A do-file can be created in many ways:

Save the do-file:

In the menu: under Window > Do-file Editor


Clicking the do-file editor icon, a little pen and lines
Type doedit in the Command window

Under file in the do-file menu


Ctrl+s

Make sure the do-file is saved where you want it

THE DO-FILE AND CREATING A LOGFILE

Open and save the log-file in the do-files.

The do-file can have the following structure:


capture log close
log using crdata.log, replace
set more of
[
..body.
]
log close

THE DO-FILE AND CREATING A LOGFILE

log close closes any log files that are opened.


capture is a powerful command in do-files that allows
the do-file or program to continue even if there are error
messages terminating the program. As if there is no logfile open.
log using [filename] opens the log-file.
replace writes over the [filename] if it already exists,
otherwise it creates a new.
set more off tells Stata not to pause when running the
program. Default is set more on which tells Stata to
wait until you press a key before continuing when a
more- message is displayed. This is mostly annoying but
sometimes useful.

INPUT AND LOAD DATA

Create a dataset manually (command window):


. clear

. input id female wage2005 wage2006 wage2007


1.
1 0 94 96 98
2.
2 1 75 79 77
3.
3 0 70 69 70
4.
end
. save mydata1
. clear
. input id school public
5.
181
6.
270
7.
311
8.
4 10 1
9.
end
. save mydata2

INPUT AND LOAD DATA


That last command will save your newly created
data as mydata2.dta
You can choose other extensions too, like .xls

File -> Export -> Data to Excel spreadsheet


. export excel using mydata2

If you want to work with this data, type

. use mydata2

This command loads a stata file, that is files with


the extension .dta

COMBINING DATA

append using filename [filename], [option]


appends new data (using dataset) vertically into the data in
memory (master dataset)
. append using mydata1.dta
. list

merge 1:1 [varlist] using [filename], [option]


combines new data (using dataset) horizontally into the data
in memory (master dataset) by an identifying variable
specified in varlist. The data in the master data are never
replaced by the using dataset unless STATA is explicitly
asked to do this
. use mydata2, clear
. merge 1:1 id using mydata1.dta
. list

COMBINING DATA - HORIZONTALLY


+----------------------------------------------------------------------------------------------------+
| id female wage2005 wage2006 wage2007 school public _merge |
|-----------------------------------------------------------------------------------------------------|
| 1
0
94
96
98
8
1
matched (3) |
| 2
1
75
79
77
7
0
matched (3) |
| 3
0
70
69
70
1
1
matched (3) |
| 4
.
.
.
.
10
1 using only (2) |
+-----------------------------------------------------------------------------------------------------+

Observations1-3 contain information from both datasets


(_merge==3)

Observation 4 contains information from mydata2.dta (_merge==2)

Save the new dataset:


. save merged1, replace

RESHAPING THE DATA

reshape converts data from wide to long formats and


vice versa
. reshape long wage, i(id) j(year)
(note: j = 2005 2006 2007)
Data

wide -> long

Number of obs.
Number of variables
j variable (3 values)

4 ->

12
7 ->

-> year

xij variables:
wage2005 wage2006 wage2007

-> wage

RESHAPING THE DATA


. list

id year female wage school public


--------------------------------------------1. 1 2005

94

2. 1 2006

96

3. 1 2007

98

4. 2 2005

75

5. 2 2006

79

--------------------------------------------6. 2 2007

77

7. 3 2005

70

8. 3 2006

69

9. 3 2007

70

10

10. 4 2005

--------------------------------------------11. 4 2006

10

12. 4 2007

10

+---------------------------------------------+

USEFUL COMMANDS TO MODIFY


VARIABLES

count shows the number of observations in the dataset:


. save merged1.dta, replace
. count
You can also count the number of observations that fulfill a condition
. count if female==1 & wage>=76

gen creates a new variable. For example, to create a squared wage


variable and a log transformation:
. gen wagesqr=wage^2
. gen lnwage=log(wage)
. gen femaledummy=1 if female==1 & female !=.
. replace femaledummy=0 if female==0 & female !=.
. tab year, gen(year_dum)
replace changes the contents of an existing variable
. replace wagesqr=wagesqr/1000

USEFUL COMMANDS TO MODIFY


VARIABLES

rename [oldname] [newname] changes the name of an old


variable
. rename school schyears

recode varlist (rule) (rule) [if] [in], gen(newvar) recode


categorical variables
. recode female (1=0) (0=1 ) , gen(male)
. recode schyears (1/6=0 primary) (7/10=1 secondary) if public==1, gen(secsch)

drop [varlist] deletes variables in varlist


. drop secsch

keep [varlist] keeps the variables in varlist


. keep id year female wage schyears public

LABEL VARIABLES

label var [varname] label to label variable


. label var public The school attended was public
. describe public

label define [labelname] value label value label


. label define pp 0 private 1 public
pp is a created definition of 1/0 for public .

label values [varname] [labelname] connects the new


definition pp with the values in variable public
. label values public pp
. list

CREATING SUMMARY STATISTICS

egen create summary statistics of all observations for some variable


. egen avgwage=mean(wage)
. egen maxwage=max(wage), by(schyears)

collapse clist [if] [in] [weight] [, options] creates a new dataset


of summary statistics. clist can be sum, min, max, median, sd etc.
Default is mean, but now we take the sum of all the wages by school
years, if the school was public.
. save merged2.dta, replace
. collapse (sum) wage if public==1, by(schyears)
. list

IMPORT DATA

Download CPS92_08 Data (Excel Dataset) from:

http://wps.pearsoned.co.uk/ema_ge_stock_ie_3/193/4
9605/12699039.cw/index.html

If you do a Google search for cps92_08, its on the first


page you find, under CPS data
Open cps92_08.xlsx and save it as cps92_08.csv
instead

Arkiv -> Exportera -> ndra filtyp -> CSV

IMPORT DATA

Work from the do-file. Write clear to get rid of any old
datasets before we load any new data.
insheet using [filename], option to import a .csv file. If
you have a semi-colon as delimiter use delimit(;) as option
Save it as a Stata file (.dta)
capture log close
log using mycps.log, replace text
set more of
clear
insheet using cps92_08.csv, delimit(;)
li if _n<50
destring ahe, dpcomma replace
save mycps.dta
log close

ABBREVIATIONS IN STATA

Commands and variables can be abbreviated


You abbreviate variables by using the shortest length
that uniquely identifies the variable
. list ah ba ag in 1/50

Commands works the same way. It may be hard to


know exactly what identifies a command since there
are lots of them in the help file the abbreviation for a
command is underlined
Some common abbreviations:
gen generate
li list
des describe
reg regress

STATA TUTORIAL: PART TWO


Examine dataset
Organize your variables
Producing tables
Correlation
Regression analysis
Extracting regression results
Testing hypothesis
Graphing Data

EXAMINE DATASET

Open Stata and load cps92_08 dataset, what we saved


as mycps.dta. Type desc for an overview of the dataset,
or inspect for a quick overview:
. cd Z:\...
. use mycps, clear
. desc
. inspect

Go to the break button in the menu if you want to tell


Stata to stop running or press any key to continue when
a more message is displayed, or type set more off
before running to turn of the pause message.

EXAMINE DATASET

You may also want to use the data editor in the stata menu
to browse through your data in a spreadsheet.
Data data editor, or
. br

Use list [varlist] [if] [in], [options] to examine certain


variables or a particular range of numbers in one or all
variables:
. list age bachelor
. list in 1/20
. list age bachelor in 1/20
. list age bachelor female if female==1 in 1/20

Tips! Use help in Stata to find out all the options for each
command. For example, type help list

ORGANIZE YOUR VARIABLES

To organize your variables, generate id numbers for each


observation.
. gen id=_n
. bysort female: gen id_f=_n
. list id_f female in 8720/8750

To give diferent id:s for males and females, use bysort


option since the data need to be sorted by these groups
before.
bysort is a shorter way but works the same as sort/by:
. sort female
. by female: gen id_f2=_n
. list id_f id_f2 female in 8720/8750
. drop id_f2

PRODUCING TABLES TABULATE

tab produces one-way tables of frequency counts


tab1 produces one-way tables for each variable
tab [varname] if in , options
tab1 [varlist] if in, options

. tab bachelor

frequency table

. tab1 bachelor female

frequency table for several


variables

. tab bachelor female

crosstabulation

. tab bachelor female, column row

column and/or row percentages

. tab bachelor female, column nofreq

to hide frequences and only


see percentages

. tab bachelor female if ahe>15, column nofreq use a condition

PRODUCING TABLES - TABLE

table calculates and displays tables of summary statistics


table [rowvar] [colvar] if in, options

. table female age

frequency table

. table female age, c(mean ahe)


frequences

content is mean of ahe insted of

. table female age, c(mean ahe) center format(%9.2f)


tells Stata that the output to be
decimals

centered and with two

. table female age , by(bachelor) c(mean ahe) center format(%9.2f)


can be used with the by option

TABLES LABELLING VARIABLES

Ok, lets make it easier to tell which one is bachelor and


which one is female. Good idea to lable the variable values.

. label define mf 0 male 1 female


. label values female mf
. label define bach 0 high school 1 bachelor
. label values bachelor bach
. table female bachelor, by(year) c(mean ahe) center format(%9.2f)

PRODUCING TABLES TABSTAT AND


SUM

tabstat is another option to display a table of summary


statistics
tabstat [varlist] if in, options
. tabstat ahe age, stat(mean var sd min max N)

lists some summary statistics like mean, median, sd


. tabstat ahe age, by(female) stat(mean sd)
the by option to produce summary statistics for men and women
in a single table

summarize also produces summary statistics


summarize [varlist] if in, options
. sum ahe age
. bysort female: sum ahe age

CORRELATION

To get the correlation between two or more variables use


correlate [varlist] if in, options
. corr ahe age bachelor
. corr ahe female

the correlation between wage, age and education

REGRESSION ANALYSIS

The efect of years of schooling on hourly wage rates.


Use OLS regression of dependent variable ahe and
independent variable age
. reg ahe age

What does this result tell us about the effect of age on hourly wage
rate?

Dont forget to check

the

T-value

statistically significant?

P-value

significance level?

how well does the model explains the values of


independent variable?

REGRESSION ANALYSIS

Two independent variables


. reg ahe age female

What does the coefficient for female tell us?

Use the by option and/or if option to run separate


analysis for groups of observations.
. bysort year: reg ahe age female
Runs a regression for each year separately but in the same time
. reg ahe age female if year==2008
Runs a regression just for 2008 observation.

REGRESSION ANALYSIS

Heteroskedasticity? Assuming independent error


terms, ie. homoskedastic error terms, is hardly satisfied
in the real world. To correct the error terms adding the
option robust after the model specification.
. reg ahe age female, robust

Robust is abbreviated by just r


. reg ahe age female, r

EXTRACTING REGRESSION RESULTS

ereturn to see your stored results from a regression run


. ereturn list
. matrix list e(b)

<- shows the regression coefficients

est store [name] to save your last regression in an


estimate table called model1
. est store model1

est table [namelist] displays table of estimation results

. est table
. est table, b se t stats(N, r2, F)
If you want to see more result statistics, just add the desired
statistics after table,

EXTRACTING REGRESSION RESULTS

Example: Let's run another regression and


store it as model2
. reg ahe age female bachelor, robust
. est store model2
. est table model1 model2, b se stats(N, r2, F)

PREDICTIONS

predict [newvar], option computes predicted


(fitted) value and residual for each observation after
estimation.
. predict yhat
values for
regression and

Calculates predicted
timlon from our
store it as yhat

. predict uhat, resid


values of
it as uhat

Calculates predicted
residuals, and store

. des yhat uhat

Check your new variables

TESTING HYPOTHESIS (T-TEST)


T-test (mean comparison test). To test the equality of means,
we use:
ttest [varname] == # if in, level(#)
One sample mean comparison test

. ttest ahe==15
Test if the mean of a specified variable (aheis equal to a
certain
hypothesized value (15)
. ttest ahe==15, level(99)
The confidence interval is 95% by default, this can be changed by
setting it to 99%

Two-group mean-comparison test


. ttest ahe, by(female)
tests if men and women on average earn the same wage

GRAPHING DATA

hist [varname] if in, option produces histograms


. hist ahe
. hist uhat
. hist uhat, normal Superimpose a normal curve

graph twoway scatter [varlist] if in, option


or short
twoway scatter [varlist] if in, option
produces a scatter plot of two or more variables
. twoway scatter ahe age
. twoway scatter ahe age, ti(Hourly wage vs Age)
Writes a title for your graph

GRAPHING DATA

There are many options for graphing. Type help twoway and
find out. For example,
. twoway scatter ahe age, ms(o) mc(red)
changes the marker symbol to o and the marker color to red
. twoway lfit ahe age
fits a linear line onto our scatter plot to see any relationship more clearly
. twoway lfit ahe age, lc(blue)
changes the line color of the fitted line to blue
. twoway (lfit ahe age) (scatter ahe age), ti("Hourly wage vs Age")
fitted line and scatter plot in the same graph
. twoway (lfit ahe age, lw(0.5) lc(blue)) (scatter ahe age, ms(o) mc(red)), ti("Hourly wage vs
Age") xline(30, lw(1) lc(black))
adds a x-line at age 30 with line width 1 and line color black

GRAPHING DATA

graph box [yvarlist] if in, options creates


boxplots. This command draws vertical boxes
. graph box ahe, over(female) over(year)

graph hbox [yvarlist] if in, options creates


horizontal boxes
. graph hbox ahe, over(female) over(bachelor)

export graph export the graph as a post script file


. graph export mygraph.ps

Copy the graph directly to MS Word by right-clicking and use


copy.

SOME EXTRA TIPS - LOOPS


Loops are very useful in some circumstances and
comes in two flavours:

foreach, which loops over a list of something:


. foreach var in age bachelor female {
.
reg ahe `var', r
.}

forvalues, which loops over numbers:

. tab age, gen(age_dum)


. forvalues i = 1/10 {
.
qui sum ahe if age_dum`i' == 1
.
scalar a = r(mean)
.
di "Average hourly earnings is " a " if age = `i"
.}

SOME EXTRA TIPS - ESTOUT


Estout is a way to produce nice output tables. It is not
a standard feature of Stata so it might have to be
installed:
. ssc install estout

It has many features, explore the help-file to get to


know it! We present an example here:
. reg ahe age, r
. eststo mod1
. reg ahe age bachelor, r
. eststo mod2
. reg ahe age bachelor female, r
. eststo mod3
. esttab mod1 mod2 mod3 using table1, rtf replace

You might also like