Last modified on 8 February 2014, at 23:56

Stata/Data Management

Read and import dataEdit

Usually, data are loaded into memory using the use command. The clear option makes it sure that the current database in memory will be removed without saving the last changes.

use "W:\Data\…\table.dta" , clear     

The cd command allows to specify a working directory and makes it easier to load tables into memory.

cd "W:\Data\" 
use table, clear

Stata9 users can import Stata10 datasets using the use10 command.

use10 table, clear

Some example datasets are stored in the Stata directory. They can be loaded into memory using the sysuse command.

. sysuse cancer, clear
. sysuse smoking, clear
. sysuse auto, clear
. sysuse jspmix, clear

You can import a Comma Separated Value (CSV) format using insheet

insheet using "W:\Data\…\table.csv", delim(";")     

See alsoEdit

  • 'webuse' for internet data
  • 'xmluse' for xml files
  • 'infile' for text files
  • 'input' for entering data from keyboard
  • 'infix'
  • 'fdause' for SAS xport data
  • If none of these command works, you may use Stat/Transfer
  • FTRANS: module to batch convert file formats

Save and export dataEdit

  • save
save table, replace

If you use Stata10 you can export to Stata9 format using saveold

saveold table, replace
  • outsheet : export to tab delimited or csv format.
outsheet using "W:\Data\…\table.csv", replace comma                

See also

  • outfile
  • xmlsave
  • fdasave

Append and mergeEdit

The standard Stata command is merge. However, the user-written command mmerge is safer and gives a better output. This command may be installed using ssc install mmerge command or using findit mmerge.

  • dmerge
  • joinby merge all possible pairs between the datasets
  • append if you have two datasets with the same variable but different observations, you can make one dataset using the append command.
use data_1, clear
append data_2
br 

Describe a datasetsEdit

  • des
  • des, s
  • codebook
  • codebook2

Detect missing valuesEdit

  • tabmiss
  • npresent
  • nmissing

You can convert missing values to values using the mvencode command.

mvencode exg ga dvg verts eco dr dvd fn reg mnr div, mv(0) override

VariablesEdit

Very often you have to convert variable from a string to a numerical format. There are several way to do it. If you already have numeric values in your string variable, you should use destring. Otherwise you should use the encode command. Encode will automatically create a numerical variable and will use as a value label the string values of the previous variable.


  • gen
  • egen
  • replace
  • recode
  • drop
  • keep
  • rename

'vallist' gives the list of all categories of a categorical variable in Stata.

vallist codep

Dealing with labelsEdit

  • lab var
  • lab list
  • lab define
  • lab value


ExpandEdit

  • You can expand a dataset (ie multiplying observations by a given factor) using the expand command.

This is useful for generating panel data models. In the first example, we draw 10 observations in a standard normal distribution and we replicate each observation once.

clear
set obs 10
gen u = invnorm(uniform())
expand 2
sort u
br 

It is also possible to pass an integer variable as an argument to expand.

clear
set obs 10
gen u = uniform()
gen var = 1 + int(10 * uniform())
expand var
sort u 
br
clear
set obs 10
gen u = invnorm(uniform())
expandcl 2 , gen(cl)

Data Storage typesEdit

All numeric types in Stata are normal "signed" quantities except that the highest 27 values are reserved for the "missing" types (., .a, .b, ..., .z). The storage size of the each variable is as follows:

Variable Size (in bytes)
byte 1
int 2
long 4
float 4
double 8
string 1 per-letter (therefore only ASCII characters, not full Unicode/UTF-8)
Previous: Random Number Generation Index Next: Graphics