3 Reading in Data

#| echo: false
clear all

Unknown #command

In this chapter we’ll discuss how to get data sets into Stata.

3.1 File Systems

Most data sets you’ll work with will come from files that are stored in a file system. You may never have given file systems much thought—most applications these days try to keep you from having to. But research projects with their many interdependent files require that you pay attention to where files are located and how they’re organized.

A file system consists of folders (also known as directories) that can contain files and/or other folders, all organized in a hierarchy or “tree.” A folder that is contained in another folder is also called a subfolder. In Windows the root of the tree is usually a drive letter, like C: or U:. In Linux or on a Mac, it’s just called the root directory and denoted by a forward slash (/).

Stata keeps track of a working directory, which is shown in the lower left corner of the Stata window. Think of it as your current location in the file system. If you try to load a file without specifying its location, Stata will look for it in the working directory. If you save a file without specifying a location, Stata will put it in the working directory. You can change the working directory with the cd (change directory) command:

cd path

where path should be replaced by the path to the folder you want to make the new working directory.

In Windows, if you start Stata by double-clicking on a Stata file, the working directory will be set to the location of that file, making this easiest way to get to work in Stata.

3.1.1 Absolute Paths

A path specifies the location of a file or folder. An absolute path starts at the root and lists all the folders and subfolders needed to get from there to the file or folder.

In Windows an absolute path typically starts with a drive letter followed by a colon, and then puts a backslash between each folder. For example:

C:\users\bbadger\dissertation\data\raw_data.dta

An absolute path in Linux or MacOS starts with a forward slash, representing the root directory, and puts another forward slash between each folder. For example:

/users/bbadger/dissertation/data/raw_data.dta

3.1.2 Relative Paths

A relative path starts with just the name of a file or folder, which is assumed to be in the working directory. If the working directory is c:\users\bbadger\dissertation\data then the relative path to the same file is just:

raw_data.dta

This would also work in Linux or on a Mac if the working directory were /users/bbadger/disstertation/data.

If the working directory is c:\users\bbadger\dissertation\ then the relative path to the file is:

data\raw_data.dta

data/raw_data.dta

in Linux or on a Mac. Actually, Stata will automatically convert forward slashes and backslashes in paths to the form needed by the computer it’s running on, so either data\raw_data.dta or data/raw_data.dta will work on any computer. However, Stata can’t automatically convert root directories (/ may or may not be the same as c:\) so absolute paths are specific to Windows or Linux/MacOS.

In specifying a path, .. means “go up one level.” Suppose you put your do files, like my_analysis.do, in c:\users\bbadger\dissertation\dofiles and you make that the working directory when you’re working on them. Then if my_analysis.do needs to use raw_data.dta the command would be:

use ..\data\raw_data.dta

This path means “go up one level, from c:\users\bbadger\dissertation\dofiles to c:\users\bbadger\dissertation, then go down into the data folder to find the raw_data.dta file.”

A single period, . represents the working directory. In Linux or MacOS, ~ is short for your home directory.

Suppose c:\users\bbadger\dissertation\dofiles\my_analysis.do takes too long to run on your laptop so you want to use the SSCC’s Linux servers, where it will run using Stata MP and a lot more computing power. You pick up the entire c:\users\bbadger\dissertation folder and put it in your Linux home directory, which is mapped on SSCC Windows computers as the Z: drive. When you log into Linstat you start in your home directory, so to make dofiles the working directory you type:

cd dissertation/dofiles

You can then start Stata, run my_analysis.do, and it will just work. ..\data\raw_data.dta now means “go up one level, from /home/b/bbadger/dissertation/dofiles to /home/b/bbadger/dissertation, then go down into the data folder to find the raw_data.dta file” but those differences don’t matter because it’s a relative path. Your code will just work, even though it’s now running in a new location and on a different operating system. However, if you use absolute paths then you’ll need to change c:\users\bbadger\ to /home/b/bbadger wherever it appears in your code.

I suggest putting all the files related to a project under a single folder, having a standard place within that folder you use as the working directory, and making all your paths relative. Then your project will be portable. In addition to it just working if you move it to a different computer, it will also just work if you send it to someone or they download it and put it on their computer. That’s part of reproducibility.

3.1.3 File and Directory Names

If any part of a path contains spaces, then you must put the entire path in quotes. Stata uses spaces to separate syntax elements, so it will think raw data.dta is two separate things, not a single file. The command to use it (from the dofiles directory) would be:

use "..\data\raw data.dta"

In time, you’ll probably find yourself avoiding spaces so you don’t have to use quotes.

Exercise 1

Decide where you want to put a folder containing the example files for this class, then in Stata use the cd command to make that the working directory. Create a folder called dws by running:

mkdir dws

Make that folder the working directory. Then run:

net get dws, from(https://ssc.wisc.edu/sscc/stata/)

to get the files and put them in that folder.

Solution

If I wanted to put the example files in a folder called dws on my desktop, and my user name were bbadger on a Windows computer, I would run:

cd c:\users\bbadger\desktop
mkdir dws
cd dws
net get dws, from(https://ssc.wisc.edu/sscc/stata/)

On a Mac I would run:

cd /users/bbadger/desktop
mkdir dws
cd dws
net get dws, from(https://ssc.wisc.edu/sscc/stata/)

3.2 Stata Data Files

Stata has its own format for storing data sets, .dta files. These are highly convenient for Stata users because Stata can use them immediately without any need for interpretation. They can also contain metadata such as value labels. The disadvantage is that non-Stata users may or may not be able to use them, depending on the tools they have available.

The command to load a Stata data set into memory so Stata can use it is simply:

use dataset

where dataset should be replaced by the actual name of your data set. The data set specification can include the path to the file; if it does not Stata will assume it is in the working directory.

Stata will assume files you try to use are in Stata format. Thus you don’t need to put .dta at the end of the file name, but it won’t hurt if you do.

Load one of the example data sets, acs.dta with:

use acs

The above command will work if you downloaded the example data and the working directory is set to the location of the example files. If it fails, you probably need to use the cd command to change the working directory.

This data set is a sample from the 2000 American Community Survey. We’ll do a lot more with the ACS in the next section.

Stata can also load data sets directly from the web if you give the use command a URL. For example, the auto data set that comes with Stata is also available by running:

use http://www.stata-press.com/data/r18/auto.dta

Stata will not allow you to load a new data set if there is a data set in memory that has unsaved changes. You can first run a clear command to remove the data set from memory, or add the clear option to your use command (use dataset, clear).

Exercise 2

Load the data set 2000_acs_sample.dta, which contains the ACS “source” data created by the Census Bureau.

Solution

use 2000_acs_sample

3.3 Delimited Files and Excel Spreadsheets

Delimited files and Excel spreadsheets are very popular ways of storing data, but importing them into Stata sometimes requires telling Stata how to interpret them. Since the same issues arise with both kinds of files, we’ll address them together.

A delimited file is a text file that contains data, where a special character (the delimiter) separates one value from the next. The most common form is CSV (comma separated value) files, where the delimiter is a comma, but the delimiter could be a space, a tab, or in theory just about any other character or set of characters. Your computer may import CSV files into Excel by default, but CSV files are not really Excel spreadsheets.

The command to import a delimited file into Stata is import delimited, and the command to import an Excel spreadsheet is import excel. Sometimes it takes some experimentation to find the right options to properly read in a given file. This is a case where Stata’s graphical user interface can help you: it includes a preview of what the data set will look like after it is imported as well as the options for reading it. The preview will be updated as you choose different options, so you can experiment until it looks right. Then you can actually carry out the import, look over the results to make sure it was successful, and then copy the import command that was generated into your do file.

The two most common issues you’ll have to deal with in importing delimited files or Excel spreadsheets are header rows containing variable names and content that is not actually data. Stata will try to identify whether the file has a header row or not, but may get confused. And it will definitely be confused if it tries to import content that is not data.

If you have Excel on your computer, open sscc_training.xlsx from the example files, which contains data on the number of graduate students trained by SSCC in FY 2018. Note that row 1 is a title, while row 2 contains the variable names. The data end with row 14, as the rows after that contain summary statistics and a note.

Now return to Stata and click File, Import, Text Data. Then click the folder button to the right of File to import: and find sscc_training.csv, a CSV version of the same data. Looking at the preview, you’ll see that the presence of the title prevented Stata from recognizing that row 2 contains variable names. It also thinks it should interpret rows 15-19 as data. Note that all the data are in red, indicating that they are strings (text). Because Stata believes the text in row 2 is data, it must make all the variables string variables to store the values in row 2.

For First row as variable names:, choose Custom and then enter 2, meaning that row 2 contains the variable names. Most of the variables will change from red to black, meaning that Stata now recognizes they are numeric. However, workshoplength remains red because Stata thinks the note in row 17 is data.

Correct this by clicking on Override row and column ranges…. Under Rows check the box next to Last and enter 14. This tells Stata to ignore everything after row 14. It will also ignore row 1 since it comes before the row with the variable names. Now all the variables except workshopname are in black (workshopname really is text and should be in red). Click Okay and Stata will successfully import the data.

The command Stata used will to do so will appear in both the History window and the Results Window. It will look something like:

import delimited "c:\users\bbadger\dws\sscc_training.csv", varnames(2) rowrange(:14)

The next step would be to copy this command into your do file. Since it uses an absolute path (and, less importantly, unnecessary quotes), I’d suggest editing it to:

import delimited sscc_training.csv, varnames(2) rowrange(:14)

(encoding automatically selected: ISO-8859-1)
(6 vars, 12 obs)

Open the data browser and take a look at the results. You’ll get a much better sense of the data if you do that than if you rely on the following list:

list in 1/3, ab(30)


     +------------------------------------------------------------------------+
  1. |                          workshopname | numberofgradstudentsregistered |
     | Dynamic Documents with Stata Markdown |                             11 |
     |------------------------------------------------------------------------|
     |  workshoplengthhours   |  numberoftimesoffered   |   instructorhours   |
     |                  1.5   |                     1   |               1.5   |
     |------------------------------------------------------------------------|
     |                              studenthours                              |
     |                                      16.5                              |
     +------------------------------------------------------------------------+

     +------------------------------------------------------------------------+
  2. |                          workshopname | numberofgradstudentsregistered |
     |                     Introduction to R |                             85 |
     |------------------------------------------------------------------------|
     |  workshoplengthhours   |  numberoftimesoffered   |   instructorhours   |
     |                  3.5   |                     3   |              10.5   |
     |------------------------------------------------------------------------|
     |                              studenthours                              |
     |                                     297.5                              |
     +------------------------------------------------------------------------+

     +------------------------------------------------------------------------+
  3. |                          workshopname | numberofgradstudentsregistered |
     |                   Introduction to SAS |                             17 |
     |------------------------------------------------------------------------|
     |  workshoplengthhours   |  numberoftimesoffered   |   instructorhours   |
     |                    3   |                     1   |                 3   |
     |------------------------------------------------------------------------|
     |                              studenthours                              |
     |                                        51                              |
     +------------------------------------------------------------------------+

Now try importing the Excel file sscc_training.xlsx (File, Import, Excel Spreadsheet, then Browse and find the file). Click the button to the right of Cell Range and set the Row of the Upper-left cell to 2 and the Row of the Lower-right cell to 14, leaving the colums as-is.Then check Import first row as variable names (the “first row” is now row 2 since that’s where we told it to start reading). Let Stata replace the data currently in memory when it asks.

The resulting command will again use an absolute path, so I suggest editing it to:

import excel sscc_training.xlsx, sheet("Sheet1") cellrange(A2:F14) firstrow clear
list in 1/3, ab(30)

(6 vars, 12 obs)

     +----------------------------------------------------------------------+
  1. |                          WorkshopName | NumberofGradStudentsRegister |
     | Dynamic Documents with Stata Markdown |                           11 |
     |----------------------------------------------------------------------|
     |  WorkshopLengthHours  |  NumberofTimesOffered   |  InstructorHours   |
     |                  1.5  |                     1   |              1.5   |
     |----------------------------------------------------------------------|
     |                             StudentHours                             |
     |                                     16.5                             |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. |                          WorkshopName | NumberofGradStudentsRegister |
     |                     Introduction to R |                           85 |
     |----------------------------------------------------------------------|
     |  WorkshopLengthHours  |  NumberofTimesOffered   |  InstructorHours   |
     |                  3.5  |                     3   |             10.5   |
     |----------------------------------------------------------------------|
     |                             StudentHours                             |
     |                                    297.5                             |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  3. |                          WorkshopName | NumberofGradStudentsRegister |
     |                   Introduction to SAS |                           17 |
     |----------------------------------------------------------------------|
     |  WorkshopLengthHours  |  NumberofTimesOffered   |  InstructorHours   |
     |                    3  |                     1   |                3   |
     |----------------------------------------------------------------------|
     |                             StudentHours                             |
     |                                       51                             |
     +----------------------------------------------------------------------+

So what happens if you just import one of these files without setting the proper options for the import?

import excel sscc_training.xlsx, clear
list in 1/3, ab(30)

(6 vars, 19 obs)

     +--------------------------------------------------------------+
  1. |                                                A             |
     |                                          FY 2018             |
     |--------------------------------------------------------------|
     |                                  B |                       C |
     |                                    |                         |
     |--------------------------------------------------------------|
     |                       D |                E  |             F  |
     |                         |                   |                |
     +--------------------------------------------------------------+

     +--------------------------------------------------------------+
  2. |                                                A             |
     |                                    Workshop Name             |
     |--------------------------------------------------------------|
     |                                  B |                       C |
     | Number of Grad Students Registered | Workshop Length (Hours) |
     |--------------------------------------------------------------|
     |                       D |                E  |             F  |
     | Number of Times Offered | Instructor-Hours  | Student-Hours  |
     +--------------------------------------------------------------+

     +--------------------------------------------------------------+
  3. |                                                A             |
     |            Dynamic Documents with Stata Markdown             |
     |--------------------------------------------------------------|
     |                                  B |                       C |
     |                                 11 |                     1.5 |
     |--------------------------------------------------------------|
     |                       D |                E  |             F  |
     |                       1 |              1.5  |          16.5  |
     +--------------------------------------------------------------+

Stata correctly deduced that the first row didn’t contain variable names, but didn’t realize they were in the second row instead. So it turned the second row into an observation. In order to store that observation, which consists entirely of text, it had to make all the variables strings instead of numbers. If you run the code and look in the data browser, you’ll see it also turned the notes at the bottom of the file into observations.

You could fix all these problems. But it’s much easier to read in the data properly to begin with.

Exercise 3

qualtrics_survey.csv is a data file created by the survey program Qualtrics. It was a very simple survey with just four questions (labeled by Qualtrics as Q1, Q17, Q3, and Q4) but Qualtrics includes lots of other information about the respondent and their experience taking the survey that you don’t care about. It also puts information about each question in rows 2-3. Import just the columns containing the questions and the rows containing the responses. Yes, you’ll have to count all the columns you want to ignore. Make sure the variable names are the question numbers (Q1, Q17, Q3, and Q4) and not the question text.

Solution

import delimited qualtrics_survey.csv, varnames(1) rowrange(4) colrange(18:21) clear
list in 1/3, ab(30)

(encoding automatically selected: ISO-8859-1)
(4 vars, 69 obs)

     +--------------------------+
     | q1         q17   q3   q4 |
     |--------------------------|
  1. | 20   Wisconsin    2    0 |
  2. | 21   Wisconsin    3    1 |
  3. | 20       WISCO    3    0 |
     +--------------------------+

Note that this time, Stata doesn’t notice that q3 and q4 only contain numbers so it still imports them as text (strings). You could fix that, but you’ll learn how to convert string variables that contain numbers to numeric variables in the next chapter. You’ll also learn how to detect that q1 is mostly numbers but has one text value, and how to fix it.

3.4 Fixed Format Files

Fixed format files are text files, but unlike CSV files there is no separator between variables. Instead, a data dictionary tells you which variables are stored in which columns (the variables always appear in the same columns, thus they are fixed). The advantage of fixed format files is that they are smaller–no space is wasted on separators. The disadvantage is that they are completely useless without the corresponding data dictionary. For this reason they’ve become less popular as data storage has gotten cheaper and we won’t cover how to import them, though you will see an example momentarily. If you need to read a fixed format file, read the help file for infix. It’s not hard, but translating a data dictionary into specifications for infix can be tedious.

3.5 Do Files Generated by the Data Provider

Some data providers, like IPUMS, give you data in text format plus a program you can run to read it into Stata. These generally work, but often challenge new Stata users.

The key to using them is to read the instructions. Often these are placed in comments inside the do file itself. For example, here are the first ten lines of such a do file from IPUMS:

* NOTE: You need to set the Stata working directory to the path
* where the data file is located.

set more off

clear
quietly infix             ///
  int     year     1-4    ///
  byte    datanum  5-6    ///
  double  serial   7-14   ///

Later on the do file refers to the data file by name without specifying a location, so you really do need to set the working directory to the location of the data file before running this or it will not work. Other do files may require you to edit the do file and specify the location of the data file or make other changes.

Then there’s what this do file does not include (and this is common): a save command. When the do file finishes running the data will have been imported into Stata, but not saved as a Stata data set. You could just tell Stata to save it, but since you want your workflow to be completely reproducible you should add a save command to the do file they gave you instead.

IPUMS data files are in fixed format, so this also demonstrates what it looks like to import fixed format data. The last three lines of the code displayed specify that columns 1-4 in the text data contain the year variable, columns 5-6 contain a variable called datanum, and columns 7-14 contain a variable called serial.

3.6 Files in the Formats of Other Statistical Software

If you’ve got an SPSS .sav file or a SAS .sas7bdat file, Stata can import them with import spss or import sas. If you have a file in the format of a different statistical program, Stat/Transfer can probably convert it to Stata format for you. Stat/Transfer is available on Winstat and Linstat. It’s very easy to use, but there are instructions in the SSCC Knowledge Base. Please do not use Stat/Transfer for files that can be read in other ways: it’s quite expensive and we have to pay for a license for every person who runs the program, even once.

If you have a choice of file formats, avoid SAS. SAS stores value labels separately from the data they label, which complicates converting them.

3.7 Saving Files in Stata Format

Reading files in Stata format will always be faster than importing other types of files. So unless your data set is small, you don’t want to import it over and over again. If your research project starts with a big file in text format, your first do file should probably just import the data, drop any parts of it you won’t actually use (more on that in the next chapter), and then save the result as a Stata data file. The next do file can then read that Stata data file and go to work. For example, if you were working with the SSCC training data your complete first do file might be:

capture log close
log using read.log, replace

clear all
import excel sscc_training.xlsx, sheet("Sheet1") cellrange(A2:F14) firstrow

save sscc_training, replace
log close