Producing statistical graphs in Stata revolves around the graph
commands.
Type help graph
in Stata to see a quick overview of these commands.
In order to draw any graph in Stata you need to specify three things: what graphical elements you want to use in your graph, how these elements will be related to your data, and what kind of scales will be used to position them on the page.
I've stated this abstractly, but in practice this is actually pretty easy - you pick the appropriate graph
command and specify the appropriate variable names. Typical examples might look like
graph bar var
graph twoway scatter yvar xvar
graph box zvar
While there are many other features of graphs that we might want to specify or customize, these three concepts are what it takes to get started - graphical element, data, scale (level of measurement). Everything else about a graph has some default value that we can come back and consider later.
At a basic level these are just things like points, line segments, or bounded areas (like polygons). More complicated graphical objects can be constructed out of these basic elements. Stata's graph
command will make it easy to specify simple elements as well as treating more complicated objects as fundamental - objects like histograms and boxplots.
The graphs we will consider in Stata are all two-dimensional representations of data. Sometimes elements like points are just positioned in the graph by Cartesian coordinates given by the data values themselves, but other times a point might be given its position by some summary of the data like a group mean. So it can be useful to distinguish between the data set, and the graph-data set. For things like simple scatter plots, these will be one-and-the-same.
In order to position graphical elements on a page or screen we need some sort of coordinate system. This mainly means Cartesian coordinates. However, Stata will also allow us to distinguish between continuous (Cartesian) scales and categorical scales. Again, this sounds a little abstract, but in practice it is pretty easy.
All of this will be a little more concrete if we look at some examples. We'll start by setting up a familiar data set, auto
.
sysuse auto, clear
* Create a categorical variable
generate maker = substr(make, 1, strpos(make, " ")-1)
replace maker = make if strpos(make, " ")==0
label variable maker "Manufacturer"
Consider two graphs. Both use points (dots) as graphical elements to visually represent the data.
graph twoway scatter price weight
* scatter price weight // abbreviated version
graph dot price, over(maker)
Here the graphical elements are the points. The position of each point is determined by a pair of data values, the car weight value and the car price value. These data values are used "as is", as they occur in the data set, untransformed. The number of points is (in principle) the same as the number of observations. Both the x- and the y-values are plotted along continuous scales.
In this second graph, the graphical elements are again points. However the vertical position of each point is given by a distinct category, the car maker. The horizontal position of each point is given by a summary statistic, the mean of the prices of cars from a given maker. The number of points is the number of car makers, not the number of observations. The x-values are on a continuous scale, while the y-values are on a categorical scale (as we will see later, Stata switches the "x-" and "y-" nomenclature).
In order to plot the points, the software generates a graph data set (which we never actually see).
Use help graph
to find commands you may need.
In thinking about Stata's graph
command, perhaps the most fundamental distinction to be made is between commands that use continous-by-continuous scales - the many graph twoway
commands- versus commands that use categorical-by-continuous (or continous-by-categorical) scales - graph bar
, graph dot
, graph box
. Most other graphing commands call on graph twoway
behind the scenes.
In general, categorical-by-continuous graphs use a single graphical element, while twoway
graphs may be layered together to create graphs composed of several different elements.
graph bar (percent), over(rep78)
graph twoway (scatter price weight)(lfit price weight)
The bare graph
commands allow us to represent categories in two ways. First, where each unit of observation in our data set represents a category (as in the auto
data), every observation may locate a graphical element for a category. It may seem trivial to point out that every point in the scatterplots above represents a category of car, but we will find this concept useful, later.
Second, as we have seen with dot
and bar
graphs, relative position along a continuum can represent a category - Stata picks these positions based on the space available, the number of categories, and some sorting order.
There are two other common ways of representing categories: as subplots (panels) within the overall graph, and as elements with different aesthetic values (color, shape, etc.).
graph dot price, over(rep78) by(foreign)
Notice here that the over()
option defines a categorical axis, while the by()
option defines categorical subplots.
separate price, by(rep78)
scatter price1 price2 price3 price4 price5 weight
Here we divide our data into separate variables for separate categories, then combine them into one graph as repeated y variables.
The over()
option is unique to categorical-by-continuous graphs, but the use of by()
and of repeated y variables work in either group of graphing commands.
Some graph commands use the data set "as-is", while other commands perform some transformation of the data. Additionally, some graph commands require estimation results, and some commands do not require any data set.
Given the examples we have looked at so far, it might be tempting to think that twoway
commands always use data "as-is", while categorical commands always summarize data, but this would be an oversimplification.