Data Wrangling in Stata

Author

Russell Dimond

Published

January 12, 2023

Introduction

Most data sets need to be transformed in some way before they can be analyzed, a process that’s come to be known as “data wrangling.” This book will introduce you to the key concepts, tools, and skills of data wrangling, implementing them in Stata. The book is meant to complement coursework in statistical analysis, completing the skillset you’ll need to do research with real-world data. Its primary audience is graduate students in the social sciences, but anyone who wants to do data-driven research in Stata should find it useful.

The SSCC’s Introduction to Stata provides an introduction to the Stata language itself. If you are taking coursework that uses Stata but don’t intend to do quantitative research (or don’t intend to do quantitative research yet) the Introduction may be all you need. We’ll start this book by very briefly reviewing some basic Stata concepts that should be familiar to you, but if they’re not, Introduction to Stata will do a much better job of teaching them to you.

To get the most out of Data Wrangling in Stata you need to be an active participant. Open Stata, and type in and run the example code yourself. This will help you retain more, and ensure you get all the details right—Stata is always happy to tell you when you’re wrong. Do the exercises (some of them are straightforward applications of what you just learned; others will require more creativity). Data wrangling is not something you read and understand—it’s a skill you must practice.

Setting Up

The example files for this class are available on the SSCC’s web server. The example code will download them directly from there. For real work, you’ll usually download you data and work with it locally, so if you plan to work through the entire book we suggest you download the example files and working with them locally too. They can be obtained within Stata by running:

net get dws, from(https://ssc.wisc.edu/sscc/stata/)

If this fails on your computer, try:

net get dws, from(http://ssc.wisc.edu/sscc/stata/)

This will put the example files in your current working directory. If you are comfortable doing so, create a folder for the example files, make that Stata’s working directory, and run the command above. If not, we’ll talk about how to do all this in the Reading in Data chapter and you can get the example files then.

If you do download the files, you can replace commands like

use https://sscc.wisc.edu/sscc/stata/dws/acs.dta

with just:

use acs

Technical Notes

This web book was written in JupyterLab using nbstata and rendered by Quarto.

I am a firm believer that looking at your data will help you understand it, but a web book cannot open Stata’s data browser so this book uses list instead. The list command is not one you really need to master, but the ab(30) option you’ll see frequently means that variable names should only be abbreviated if they exceed 30 characters. I use it to prevent them from being abbreviated.

in is another tool that’s more useful for this book than for real work. It works like if but allows you to specify subsets by observation number. A range of observations can be specified with start/end so:

list in 1/5, ab(30)

means “list the first five observations without abbreviating the variable names.” You can just open the data browser instead.