Table of Contents
1. Introduction to Stata
Overview of Stata interface
Reading: Complete parts 1.1.1–1.1.3 of Prof. Rodriguez’s tutorial.
Goals: By the end of this section, you should be comfortable with
- The basics of the Stata interface
- The following commands:
- Basic math:
display
- Get help for a command:
help
- Basic math:
Do files and Log files
Reading: Complete all of part 1.2 of Prof. Rodriguez’s tutorial.
Goals: By the end of this section, you should be comfortable with
- The concept of a .do file and its purpose
- The concept of a .log file and its purpose
- How to add comments to a .do file, and why this is important
- Structure of Stata commands:
command varlist, options
Anatomy of a Stata command:
Questions (Part 1)
- Provide the correct Stata syntax to perform the following actions:
- Calculate
(3*4)^2 – 18^4 - Get more information about the command
display
- Calculate
2. Describing data
Descriptive Statistics: Part 1
Reading: Complete part 1.1.4–1.1.5 of Prof. Rodriguez’s tutorial and supplementary section on browsing (see below).
Goals: By the end of this section, you should be comfortable with
- Browsing your dataset or listing particular observations
- Getting basic summary statistics for your dataset
Browsing
Section 1.1.5 of Prof. Rodriguez’s tutorial describes how to use the list
command to view observations that match certain criteria. Instead of the list
command, we can also use browse
. This is very similar to list
, except that Stata will open a separate window that is reminiscent of an Excel spreadsheet. If we are looking at only a few observations, there is not much difference between these two commands. However, just entering browse
without any other criteria will give you a view of the entire dataset – all observations, all variables. Use browse
if you would like to get a view of the raw data that you are working with.
Descriptive Statistics: Part 2
Reading: Supplementary tutorial (see below).
Goals: By the end of this section, you should be comfortable with
- Getting more advanced summary statistics such as medians and percentiles
- Conducting one-way and two-way tabulations
Summarize, detail
Now that you have some familiarity with the Stata command summarize
, let’s see how we can further explore out data.
Repeat the summarize
exercise from section 1.1.5 in the tutorial. This time, add detail
as an option to your command:
summarize lexp gnppc, detail
You can now explore the variables in your dataset in much more…detail. The column of percentages show the values for different percentiles in your dataset, with 50% as the median. We’ll learn much more about using medians once class starts, but in the meantime think about this question:
Why might we be interested in summarizing data using medians and percentages rather than the information that we get from a simple summarize
command?
Tab
tab
is another command that’s extremely useful when exploring a dataset for the first time. Put simply, this command tabulates the unique values of a given variable, and shows how many observations have that particular value. It’s much easier to see the utility of this in practice, so enter the command:
sysuse auto
This will load a different Example Dataset that’s built into Stata, one with many more variables. This dataset describes 74 different car models which were available in 1978. One of these variables is whether or not the car was foreign made. To see how many of these cars are foreign made, enter the command:
tab foreign
You should see that 22 cars are foreign made – about 30% of our observations. This is called a one-way tabulation, as it shows the frequency counts for a single variable.
Suppose, though, that we would like to see how the repair record (a scale from 1–5, with 5 being the best) of the cars is split between foreign and domestic cars. For this we need a two-way tabulation, since we want to see the frequency counts of one variable, categorized by a second variable. Enter the command:
tab rep78 foreign
You can think of this command as saying “show me a tabulation for repair status, grouped by whether a car was made domestically or abroad.” Of course, this can get pretty unwieldy when we have lots of possible values (try tab weight foreign
as an example), so this command is more useful when we have just a few categories.
Questions (Part 2)[1]
Save your work as a .do file and keep a .log file for your records.
Using the MDG.dta file (remember to unzip the file once it’s downloaded), answer the following questions:
- How many variables are in this dataset and what are the variables? Give a detailed description, not just the abbreviated variable names.
- How many observations are in the dataset?
- For how many countries do we have data on the dollar a day poverty rate in 2000? For how many countries do we have the year 2000 literacy rate?
- Fill out the table below on GNI, Population size and poverty rate, (mean, median, sd, min, max) for the world:
Indicator (1) | Mean (2) | Median (3) | Min (4) | Max (5) | SD (6) |
---|---|---|---|---|---|
GNI | |||||
Population size | |||||
Poverty rate ($1) |
3. Creating new variables
Generating and replacing variables
Reading: Complete all of part 2.4 of Prof. Rodriguez’s tutorial.
Goals: by the end of this section, you should be comfortable with
- Manipulating existing variables and generating new variables
- The following commands:
- Generate a new variable from existing variables:
gen
- Replacing the values of a variable using operators: e.g.,
replace var1 = 1 if var2 > 10
- Operators and expressions
- Generate a new variable from existing variables:
Questions (Part 3)[1]
Save your work as a .do file and keep a .log file for your records.
Using the MDG.dta file (remember to unzip the file once it’s downloaded), answer the following questions:
ratiofemalener
is the ratio of female to male primary enrollment. Generate a variable that is equal to 1 if this ratio is over 100%, and equal to 0 if it is below 100%. What proportion of countries have a ratio that is over 100%?gni
is a country’s total GNI, whilegnipc
is its GNI per capita. Using onlygnipc
andpopulation
, generate a new variable for total GNI. Check your work by comparing your new variable togni
(they should be roughly equivalent).
4. Conclusion
Summative Questions (Part 4)
Using the MDG.dta file (remember to unzip the file once it’s downloaded), answer the following questions:
- What is the proportion of countries in the sample are located in the Latin America and Caribbean (LAC) region?
- What is the proportion of countries are classified as “low income”?
- What proportion of these “low income” countries are are in the Sub-Saharan Africa (SSA) region?
- What is the mean and median of country GNI per capita in the dataset?
- Generate a variable that takes value 1 for “high income” countries and 0 otherwise (upper middle, lower middle, low income). What is the difference in the average GNI per capita of the two groups?
- What is the mean and median literacy rate for “high income” countries versus all others (use the variable that you created in the previous question)?
5. Survey
- Loading a dataset:
Most of the examples given in the tutorial use what are called Example Datasets – datasets that are bundled with the Stata program for the purposes of training and experimentation. The command to load one of these datasets issysuse
, as you’ll see below. However, the exercises at the end of each section will require you to load the MDG dataset, which is a separate file.
To load this dataset, you can either double-click on the file once it’s downloaded, or in the Stata interface click on “File > Open…” and select the dataset. If you get any errors, click on the Stata command window, typeclear
, and press enter. This will make sure that you’re starting with a fresh dataset, regardless of what you were doing before. ↩