API-209 Stata Tutorial

1. Introduction to Stata

Overview of Stata interface

Reading: Complete parts 1.1.1–1.1.3 of Prof. Rodriguez’s tutorial.

Goals: By the end of this section, you should be comfortable with

• The basics of the Stata interface
• The following commands:
• Basic math: display
• Get help for a command: help

Do files and Log files

Reading: Complete all of part 1.2 of Prof. Rodriguez’s tutorial.

Goals: By the end of this section, you should be comfortable with

• The concept of a .do file and its purpose
• The concept of a .log file and its purpose
• Structure of Stata commands: command varlist, options

Anatomy of a Stata command: Questions (Part 1)

1. Provide the correct Stata syntax to perform the following actions:
• Calculate
(3*4)^2 – 18^4
• Get more information about the command display

2. Describing data

Descriptive Statistics: Part 1

Reading: Complete part 1.1.4–1.1.5 of Prof. Rodriguez’s tutorial and supplementary section on browsing (see below).

Goals: By the end of this section, you should be comfortable with

• Browsing your dataset or listing particular observations
• Getting basic summary statistics for your dataset

Browsing

Section 1.1.5 of Prof. Rodriguez’s tutorial describes how to use the list command to view observations that match certain criteria. Instead of the list command, we can also use browse. This is very similar to list, except that Stata will open a separate window that is reminiscent of an Excel spreadsheet. If we are looking at only a few observations, there is not much difference between these two commands. However, just entering browse without any other criteria will give you a view of the entire dataset – all observations, all variables. Use browse if you would like to get a view of the raw data that you are working with.

Descriptive Statistics: Part 2

Goals: By the end of this section, you should be comfortable with

• Getting more advanced summary statistics such as medians and percentiles
• Conducting one-way and two-way tabulations

Summarize, detail

Now that you have some familiarity with the Stata command summarize, let’s see how we can further explore out data.

Repeat the summarize exercise from section 1.1.5 in the tutorial. This time, add detail as an option to your command:

summarize lexp gnppc, detail


You can now explore the variables in your dataset in much more…detail. The column of percentages show the values for different percentiles in your dataset, with 50% as the median. We’ll learn much more about using medians once class starts, but in the meantime think about this question:

Why might we be interested in summarizing data using medians and percentages rather than the information that we get from a simple summarize command?

Tab

tab is another command that’s extremely useful when exploring a dataset for the first time. Put simply, this command tabulates the unique values of a given variable, and shows how many observations have that particular value. It’s much easier to see the utility of this in practice, so enter the command:

sysuse auto


This will load a different Example Dataset that’s built into Stata, one with many more variables. This dataset describes 74 different car models which were available in 1978. One of these variables is whether or not the car was foreign made. To see how many of these cars are foreign made, enter the command:

tab foreign


You should see that 22 cars are foreign made – about 30% of our observations. This is called a one-way tabulation, as it shows the frequency counts for a single variable.

Suppose, though, that we would like to see how the repair record (a scale from 1–5, with 5 being the best) of the cars is split between foreign and domestic cars. For this we need a two-way tabulation, since we want to see the frequency counts of one variable, categorized by a second variable. Enter the command:

tab rep78 foreign


You can think of this command as saying “show me a tabulation for repair status, grouped by whether a car was made domestically or abroad.” Of course, this can get pretty unwieldy when we have lots of possible values (try tab weight foreign as an example), so this command is more useful when we have just a few categories.

Questions (Part 2)

Save your work as a .do file and keep a .log file for your records.

1. How many variables are in this dataset and what are the variables? Give a detailed description, not just the abbreviated variable names.
2. How many observations are in the dataset?
3. For how many countries do we have data on the dollar a day poverty rate in 2000? For how many countries do we have the year 2000 literacy rate?
4. Fill out the table below on GNI, Population size and poverty rate, (mean, median, sd, min, max) for the world:
Indicator (1) Mean (2) Median (3) Min (4) Max (5) SD (6)
GNI
Population size
Poverty rate (\$1)

3. Creating new variables

Generating and replacing variables

Reading: Complete all of part 2.4 of Prof. Rodriguez’s tutorial.

Goals: by the end of this section, you should be comfortable with

• Manipulating existing variables and generating new variables
• The following commands:
• Generate a new variable from existing variables: gen
• Replacing the values of a variable using operators: e.g., replace var1 = 1 if var2 > 10
• Operators and expressions

Questions (Part 3)

Save your work as a .do file and keep a .log file for your records.

1. ratiofemalener is the ratio of female to male primary enrollment. Generate a variable that is equal to 1 if this ratio is over 100%, and equal to 0 if it is below 100%. What proportion of countries have a ratio that is over 100%?
2. gni is a country’s total GNI, while gnipc is its GNI per capita. Using only gnipc and population, generate a new variable for total GNI. Check your work by comparing your new variable to gni (they should be roughly equivalent).

4. Conclusion

Summative Questions (Part 4)

1. What is the proportion of countries in the sample are located in the Latin America and Caribbean (LAC) region?
2. What is the proportion of countries are classified as “low income”?
3. What proportion of these “low income” countries are are in the Sub-Saharan Africa (SSA) region?
4. What is the mean and median of country GNI per capita in the dataset?
5. Generate a variable that takes value 1 for “high income” countries and 0 otherwise (upper middle, lower middle, low income). What is the difference in the average GNI per capita of the two groups?
6. What is the mean and median literacy rate for “high income” countries versus all others (use the variable that you created in the previous question)?

Most of the examples given in the tutorial use what are called Example Datasets – datasets that are bundled with the Stata program for the purposes of training and experimentation. The command to load one of these datasets is sysuse, as you’ll see below. However, the exercises at the end of each section will require you to load the MDG dataset, which is a separate file.
To load this dataset, you can either double-click on the file once it’s downloaded, or in the Stata interface click on “File > Open…” and select the dataset. If you get any errors, click on the Stata command window, type clear, and press enter. This will make sure that you’re starting with a fresh dataset, regardless of what you were doing before.  ↩