clear all
use acs
1 Tables
The table
command allows you to create simple tables very easily, and fairly complex ones with more effort. Learning to use table
is a great introduction to the concepts used by collect
, but even if you never go on to collect
you’ll find learning table
worthwhile.
Start up Stata, open a new do file, and have it open the ACS data set:
This data set is a small extract from the 2000 American Community Survey. We’ll use it for examples throughout this book.
1.1 Table Structure
The starting point for any table is its structure. Almost every table will contain rows and/or columns. The simplest tables associate variables with the rows or columns, so you get a row or column for each level of that variable. (These will generally be categorical variables, or your table is likely to be unwieldy.)
To create a table with rows defined by a variable, use table
followed by the variable name in parentheses:
table (marital_status)
----------------------------
| Frequency
----------------+-----------
Marital Status |
Now married | 11,643
Widowed | 1,405
Divorced | 2,177
Separated | 435
Never married | 5,606
Total | 21,266
----------------------------
By default the table cells will contain frequencies, but table
can do much more.
To add columns, add the name of the variable that defines the columns, again in parentheses:
table (marital_status) (female)
-------------------------------------------
| Female
| Male Female Total
----------------+--------------------------
Marital Status |
Now married | 5,844 5,799 11,643
Widowed | 262 1,143 1,405
Divorced | 896 1,281 2,177
Separated | 166 269 435
Never married | 3,012 2,594 5,606
Total | 10,180 11,086 21,266
-------------------------------------------
To create a table with multiple columns but just one row, use empty parentheses for the row:
table () (female)
-------------------------------------
| Female
| Male Female Total
----------+--------------------------
Frequency | 13,326 14,084 27,410
-------------------------------------
Exercise 1
Make a table containing frequencies (crosstabs) for hispanic
and race
. Which variable works best as the row variable? (Solution)
1.2 Dimensions and Levels
You can generalize this concept by thinking of the rows and columns of the table as dimensions. So far we have specified tables with one or two dimensions; what happens if we add more?
table (marital_status) (female) (hispanic)
Person is Hispanic = Not Hispanic
------------------------------------------
| Female
| Male Female Total
----------------+-------------------------
Marital Status |
Now married | 5,191 5,180 10,371
Widowed | 244 1,087 1,331
Divorced | 821 1,171 1,992
Separated | 137 217 354
Never married | 2,528 2,237 4,765
Total | 8,921 9,892 18,813
------------------------------------------
Person is Hispanic = Hispanic
-----------------------------------------
| Female
| Male Female Total
----------------+------------------------
Marital Status |
Now married | 653 619 1,272
Widowed | 18 56 74
Divorced | 75 110 185
Separated | 29 52 81
Never married | 484 357 841
Total | 1,259 1,194 2,453
-----------------------------------------
Person is Hispanic = Total
-------------------------------------------
| Female
| Male Female Total
----------------+--------------------------
Marital Status |
Now married | 5,844 5,799 11,643
Widowed | 262 1,143 1,405
Divorced | 896 1,281 2,177
Separated | 166 269 435
Never married | 3,012 2,594 5,606
Total | 10,180 11,086 21,266
-------------------------------------------
The third dimension is used to create separate tables. (Sorry, no four dimensional tables!)
Each dimension has a set of levels associated with it. Take the table:
table (marital_status) (female)
-------------------------------------------
| Female
| Male Female Total
----------------+--------------------------
Marital Status |
Now married | 5,844 5,799 11,643
Widowed | 262 1,143 1,405
Divorced | 896 1,281 2,177
Separated | 166 269 435
Never married | 3,012 2,594 5,606
Total | 10,180 11,086 21,266
-------------------------------------------
This has the dimensions marital_status
and female
. (It also has other dimensions that you’ll learn about later.) The layout of the table specifies that marital_status
is the rows and female
the columns.
The marital_status
dimension has one level for each level of the variable marital_status
, plus one for the totals. The level names come from the values of marital_status
rather than the value labels, so they are 1, 2, 3, 4, and 5, plus .m for the totals. (You’ll learn later how to find this out for yourself.)
You can refer to a specific level of a specific dimension with the syntax dimension[level]
, for example marital_status[1]
. You can also list multiple levels in the brackets. One use for this syntax is to specify that you only want to include certain levels in the table. For example, if you only want to include “Now married” and “Never married” in the table, you can do so with:
table (marital_status[1 5]) (female)
------------------------------------------
| Female
| Male Female Total
----------------+-------------------------
Marital Status |
Now married | 5,844 5,799 11,643
Never married | 3,012 2,594 5,606
------------------------------------------
The totals for marital_status
are omitted because they’re just another level and not in the list of levels you asked for.
If you just want to omit the totals, you can do so with the nototal
option:
table (marital_status) (female), nototal
---------------------------------
| Female
| Male Female
----------------+----------------
Marital Status |
Now married | 5,844 5,799
Widowed | 262 1,143
Divorced | 896 1,281
Separated | 166 269
Never married | 3,012 2,594
---------------------------------
Alternatively, you can use the total()
option to specify you want totals for one dimension and then they will be omitted for the other:
table (marital_status) (female), total(female)
----------------------------------
| Female
| Male Female
----------------+-----------------
Marital Status |
Now married | 5,844 5,799
Widowed | 262 1,143
Divorced | 896 1,281
Separated | 166 269
Never married | 3,012 2,594
Total | 10,180 11,086
----------------------------------
Note how totals calculated across female
add a level to the marital_status
dimension.
Exercise 2
Create a crosstab of race
and hispanic
again, but this time only include White, Black, Asian, and Other. Include a total row but not a total column. (Solution)
1.3 Super Rows and Super Columns
Super rows and super columns are rows and columns that contain other rows and columns. You specify them by listing two variables for a dimension. For example:
table (marital_status) (hispanic female), nototal
-------------------------------------------------
| Person is Hispanic
| Not Hispanic Hispanic
| Female Female
| Male Female Male Female
----------------+--------------------------------
Marital Status |
Now married | 5,191 5,180 653 619
Widowed | 244 1,087 18 56
Divorced | 821 1,171 75 110
Separated | 137 217 29 52
Never married | 2,528 2,237 484 357
-------------------------------------------------
The first variable listed becomes “super.” Changing the order gives the same results, but organized differently:
table (marital_status) (female hispanic), nototal
--------------------------------------------------------------------
| Female
| Male Female
| Person is Hispanic Person is Hispanic
| Not Hispanic Hispanic Not Hispanic Hispanic
----------------+---------------------------------------------------
Marital Status |
Now married | 5,191 653 5,180 619
Widowed | 244 18 1,087 56
Divorced | 821 75 1,171 110
Separated | 137 29 217 52
Never married | 2,528 484 2,237 357
--------------------------------------------------------------------
Choose the organization that puts the numbers your readers are most likely to compare close to each other.
Tables with super rows can get long quickly:
table (marital_status hispanic) (female), nototal
----------------------------------------
| Female
| Male Female
-----------------------+----------------
Marital Status |
Now married |
Person is Hispanic |
Not Hispanic | 5,191 5,180
Hispanic | 653 619
Widowed |
Person is Hispanic |
Not Hispanic | 244 1,087
Hispanic | 18 56
Divorced |
Person is Hispanic |
Not Hispanic | 821 1,171
Hispanic | 75 110
Separated |
Person is Hispanic |
Not Hispanic | 137 217
Hispanic | 29 52
Never married |
Person is Hispanic |
Not Hispanic | 2,528 2,237
Hispanic | 484 357
----------------------------------------
1.4 Adding Statistics
By default the table command puts frequencies in the table cells, i.e. the number of observations that have that particular combination of values. You can change that with the stat()
option, including calculating multiple statistics. Start with a table that has female
in the column dimension (we’ll add rows soon):
table () (female)
-------------------------------------
| Female
| Male Female Total
----------+--------------------------
Frequency | 13,326 14,084 27,410
-------------------------------------
This is equivalent to using the stat(frequency)
option:
table () (female), stat(frequency)
-------------------------------------
| Female
| Male Female Total
----------+--------------------------
Frequency | 13,326 14,084 27,410
-------------------------------------
Now add percentages by adding stat(percent)
:
table () (female), stat(frequency) stat(percent)
-------------------------------------
| Female
| Male Female Total
----------+--------------------------
Frequency | 13,326 14,084 27,410
Percent | 48.62 51.38 100.00
-------------------------------------
While the frequency
and percent
statistics are based on female
itself, you can also include statistics for other variables. To do so, use the stat()
option and specify a statistic like before, but then include the variable or variables the statistic is to be calculated for in the parentheses as well. For example, you can add the means of age and income with stat(mean age income)
and their standard deviations with stat(sd mean income)
. This command is getting long, so put one stat()
option per line for readability:
table () (female), ///
frequency) ///
stat(percent) ///
stat(mean age income) ///
stat(sd age income) stat(
----------------------------------------------------
| Female
| Male Female Total
-------------------+--------------------------------
Frequency | 13,326 14,084 27,410
Percent | 48.62 51.38 100.00
Mean |
Age | 34.52499 36.76186 35.67435
Income | 36763.2 19508.31 27766.47
Standard deviation |
Age | 21.7909 22.80731 22.34652
Income | 47557.12 26929 39176.14
----------------------------------------------------
(Yes, we’d rather have the mean and standard deviation of each variable together. We’ll learn how soon.)
You can add frequencies for another variable with the fvfreq
statistic, or “factor variable frequency.” This is distinguishes it from frequency
(or just freq
) which refers to the variable defining the dimension.
table () (female), ///
frequency) ///
stat(percent) ///
stat(mean age income) ///
stat(sd age income) ///
stat( stat(fvfreq marital_status)
----------------------------------------------------------------
| Female
| Male Female Total
-------------------------------+--------------------------------
Frequency | 13,326 14,084 27,410
Percent | 48.62 51.38 100.00
Mean |
Age | 34.52499 36.76186 35.67435
Income | 36763.2 19508.31 27766.47
Standard deviation |
Age | 21.7909 22.80731 22.34652
Income | 47557.12 26929 39176.14
Factor-variable frequency |
Marital Status=Now married | 5,844 5,799 11,643
Marital Status=Widowed | 262 1,143 1,405
Marital Status=Divorced | 896 1,281 2,177
Marital Status=Separated | 166 269 435
Marital Status=Never married | 3,012 2,594 5,606
----------------------------------------------------------------
Now this is starting to look like “Table 1” from a paper!
Note that while we left the row dimension blank, our table not only has multiple rows, it has rows and super rows. What are they, and where did they come from? When you use the stat()
option, table
automatically creates a new dimension called result
, with one level for each result that’s calculated. If your stat
() option includes multiple variables, an additional dimension is created called var
that contains the variables. For fvfreq
this includes indicators for each level of the variable (1.marital_status, 2.marital_status, etc.).
table
then assumes that you want the result
dimension and the var
dimension in your table and finds a place for them: in this case the empty row dimension. It also made result
the super row. But you can take control by specifying where to put them. What we did before is equivalent to:
table (result var) (female), ///
frequency) ///
stat(percent) ///
stat(mean age income) ///
stat(sd age income) ///
stat( stat(fvfreq marital_status)
----------------------------------------------------------------
| Female
| Male Female Total
-------------------------------+--------------------------------
Frequency | 13,326 14,084 27,410
Percent | 48.62 51.38 100.00
Mean |
Age | 34.52499 36.76186 35.67435
Income | 36763.2 19508.31 27766.47
Standard deviation |
Age | 21.7909 22.80731 22.34652
Income | 47557.12 26929 39176.14
Factor-variable frequency |
Marital Status=Now married | 5,844 5,799 11,643
Marital Status=Widowed | 262 1,143 1,405
Marital Status=Divorced | 896 1,281 2,177
Marital Status=Separated | 166 269 435
Marital Status=Never married | 3,012 2,594 5,606
----------------------------------------------------------------
Reversing the order of result
and var
will make a super row for each variable and put their mean and standard deviation together–almost certainly what you want:
table (var result) (female), ///
frequency) ///
stat(percent) ///
stat(mean age income) ///
stat(sd age income) ///
stat( stat(fvfreq marital_status)
--------------------------------------------------------------
| Female
| Male Female Total
-----------------------------+--------------------------------
Frequency | 13,326 14,084 27,410
Percent | 48.62 51.38 100.00
Age |
Mean | 34.52499 36.76186 35.67435
Standard deviation | 21.7909 22.80731 22.34652
Income |
Mean | 36763.2 19508.31 27766.47
Standard deviation | 47557.12 26929 39176.14
Marital Status=Now married |
Factor-variable frequency | 5,844 5,799 11,643
Marital Status=Widowed |
Factor-variable frequency | 262 1,143 1,405
Marital Status=Divorced |
Factor-variable frequency | 896 1,281 2,177
Marital Status=Separated |
Factor-variable frequency | 166 269 435
Marital Status=Never married |
Factor-variable frequency | 3,012 2,594 5,606
--------------------------------------------------------------
On the other hand it makes a real mess out of fvfreq
. There are settings you can apply that will fix it, but many such settings have been collected into a style called table-1
. It’s designed for the tables of summary statistics that make up the “Table 1” of so many papers. Apply it with the style()
option:
table (var result) (female), ///
frequency) ///
stat(percent) ///
stat(mean age income) ///
stat(sd age income) ///
stat(///
stat(fvfreq marital_status) table-1) style(
-------------------------------------------------
| Female
| Male Female Total
----------------+--------------------------------
| 13,326 14,084 27,410
| 48.62 51.38 100.00
|
Age | 34.52499 36.76186 35.67435
| 21.7909 22.80731 22.34652
|
Income | 36763.2 19508.31 27766.47
| 47557.12 26929 39176.14
|
Marital Status |
Now married | 5,844 5,799 11,643
Widowed | 262 1,143 1,405
Divorced | 896 1,281 2,177
Separated | 166 269 435
Never married | 3,012 2,594 5,606
-------------------------------------------------
Better, but it still needs some polishing. You’ll learn how to change the appearance of a table in the Appearance chapter.
If you want to use the above table in a Word document, you can save it in Word format by exporting the collection that the table command created automatically. We’ll have lots more to say about collections in the next chapter.
export table1.docx, replace collect
(collection Table exported to file table1.docx)
You can export tables to many other formats as well, including Excel, HTML, PDF, and LaTeX.
Exercise 3
Create a similar table with hispanic
for the columns, and summary statistics for age
, income
, and edu
. For income
and age
, report the 10th, 25th, 50th, 75th, and 90th percentiles rather than the mean and standard deviation, but don’t worry about labeling them. (Hint: in the stat()
option, p#
requests the #th percentile.) Report overall frequencies but not percentages. (Solution)