Data Wrangling in Stata
Introduction
Most data sets need to be transformed in some way before they can be analyzed, a process that’s come to be known as “data wrangling.” This book will introduce you to the key concepts, tools, and skills of data wrangling, implementing them in Stata. The book is meant to complement coursework in statistical analysis, giving you a critical but often neglect piece of the skillset you’ll need to do research with real-world data. Its primary audience is graduate students in the social sciences, but anyone who wants to do data-driven research in Stata should find it useful.
The SSCC’s Introduction to Stata provides an introduction to the Stata language itself. If you are taking coursework that uses Stata but don’t intend to do quantitative research (or don’t intend to do quantitative research yet) the Introduction may be all you need. We’ll start this book by very briefly reviewing some basic Stata concepts that should be familiar to you, but if they’re not, Introduction to Stata will do a much better job of teaching them to you.
The first part of this book, Data Preparation, covers reading data into Stata and doing basic clean-up. These are steps you should take with every data set you work with. The second part, Variable Transformations, talks about various ways of creating and changing variables. Much of it discusses how to do so in hierarchical data sets, like individuals living in households. The third part, Data Set Transformations, covers tasks like combining data sets or converting data sets with one row per subject per time period to just one row per subject.
The first part covers tasks that every quantitative researcher needs to be able to perform. The tasks covered in the second and third parts were chosen based on some combination of being frequently needed, teaching skills that can be applied to other tasks, and being confusing to new Stata users. The book is also designed to give you experience working in Stata so by the end you’ll be ready to apply what you’ve learned to your research.
To get the most out of Data Wrangling in Stata you need to be an active participant. Open Stata, and type in and run the example code yourself. This will help you retain more, and ensure you get all the details right—Stata is always happy to tell you when you’re wrong. Do the exercises: some of them are straightforward applications of what you just learned; others will require more creativity. Data wrangling is not something you read and understand—it’s a skill you must practice.
Technical Notes
This web book was written in JupyterLab using nbstata and rendered by Quarto. My thanks to the developers of both.
The example files used by this book are available from the SSCC web site. The example code loads them directly from the web (for example use https://sscc.wisc.edu/sscc/pubs/dws/acs.dta). The command to do so will always be in a code block, and if you put your mouse over the code block a clipboard button will appear you can use to copy the command and paste it into your do file. Don’t do this for everything, as you’ll learn a lot by typing code. But you won’t learn much from typing URLs.
Alternatively, you can download all the example files using net get. This is explained in the chapter on Reading in Data. If you do that, you can load them with, for example, use acs. For real work you’ll almost always download your data once and then work with it locally.
I am a firm believer that looking at your data will help you understand it, but a web book cannot open Stata’s data browser so this book uses list instead. The list command is not one you really need to master, but the ab(30) option you’ll see frequently means that variable names should only be abbreviated if they exceed 30 characters. I use it to prevent them from being abbreviated.
in is another tool that’s more useful for this book than for real work. It works like if but allows you to specify subsets of the data by observation number. A range of observations can be specified with start/end so:
list in 1/5, ab(30)
means “list the first five observations without abbreviating the variable names.” You can just open the data browser instead.