1 Tables

The table command allows you to create simple tables very easily, and fairly complex ones with more effort. Learning to use table is a great introduction to the concepts used by collect, but even if you never go on to collect you’ll find learning table worthwhile.

Start up Stata, open a new do file, and have it open the ACS data set:

clear all
use acs

This data set is a small extract from the 2000 American Community Survey. We’ll use it for examples throughout this book.

1.1 Table Structure

The starting point for any table is its structure. Almost every table will contain rows and/or columns. The simplest tables associate variables with the rows or columns, so you get a row or column for each level of that variable. (These will generally be categorical variables, or your table is likely to be unwieldy.)

To create a table with rows defined by a variable, use table followed by the variable name in parentheses:

table (marital_status)


----------------------------
                |  Frequency
----------------+-----------
Marital Status  |           
  Now married   |     11,643
  Widowed       |      1,405
  Divorced      |      2,177
  Separated     |        435
  Never married |      5,606
  Total         |     21,266
----------------------------

By default the table cells will contain frequencies, but table can do much more.

To add columns, add the name of the variable that defines the columns, again in parentheses:

table (marital_status) (female)


-------------------------------------------
                |           Female         
                |    Male   Female    Total
----------------+--------------------------
Marital Status  |                          
  Now married   |   5,844    5,799   11,643
  Widowed       |     262    1,143    1,405
  Divorced      |     896    1,281    2,177
  Separated     |     166      269      435
  Never married |   3,012    2,594    5,606
  Total         |  10,180   11,086   21,266
-------------------------------------------

To create a table with multiple columns but just one row, use empty parentheses for the row:

table () (female)


-------------------------------------
          |           Female         
          |    Male   Female    Total
----------+--------------------------
Frequency |  13,326   14,084   27,410
-------------------------------------

Exercise 1

Make a table containing frequencies (crosstabs) for hispanic and race. Which variable works best as the row variable? (Solution)

1.2 Dimensions and Levels

You can generalize this concept by thinking of the rows and columns of the table as dimensions. So far we have specified tables with one or two dimensions; what happens if we add more?

table (marital_status) (female) (hispanic)


Person is Hispanic = Not Hispanic
------------------------------------------
                |           Female        
                |   Male   Female    Total
----------------+-------------------------
Marital Status  |                         
  Now married   |  5,191    5,180   10,371
  Widowed       |    244    1,087    1,331
  Divorced      |    821    1,171    1,992
  Separated     |    137      217      354
  Never married |  2,528    2,237    4,765
  Total         |  8,921    9,892   18,813
------------------------------------------

Person is Hispanic = Hispanic
-----------------------------------------
                |          Female        
                |   Male   Female   Total
----------------+------------------------
Marital Status  |                        
  Now married   |    653      619   1,272
  Widowed       |     18       56      74
  Divorced      |     75      110     185
  Separated     |     29       52      81
  Never married |    484      357     841
  Total         |  1,259    1,194   2,453
-----------------------------------------

Person is Hispanic = Total
-------------------------------------------
                |           Female         
                |    Male   Female    Total
----------------+--------------------------
Marital Status  |                          
  Now married   |   5,844    5,799   11,643
  Widowed       |     262    1,143    1,405
  Divorced      |     896    1,281    2,177
  Separated     |     166      269      435
  Never married |   3,012    2,594    5,606
  Total         |  10,180   11,086   21,266
-------------------------------------------

The third dimension is used to create separate tables. (Sorry, no four dimensional tables!)

Each dimension has a set of levels associated with it. Take the table:

table (marital_status) (female)


-------------------------------------------
                |           Female         
                |    Male   Female    Total
----------------+--------------------------
Marital Status  |                          
  Now married   |   5,844    5,799   11,643
  Widowed       |     262    1,143    1,405
  Divorced      |     896    1,281    2,177
  Separated     |     166      269      435
  Never married |   3,012    2,594    5,606
  Total         |  10,180   11,086   21,266
-------------------------------------------

This has the dimensions marital_status and female. (It also has other dimensions that you’ll learn about later.) The layout of the table specifies that marital_status is the rows and female the columns.

The marital_status dimension has one level for each level of the variable marital_status, plus one for the totals. The level names come from the values of marital_status rather than the value labels, so they are 1, 2, 3, 4, and 5, plus .m for the totals. (You’ll learn later how to find this out for yourself.)

You can refer to a specific level of a specific dimension with the syntax dimension[level], for example marital_status[1]. You can also list multiple levels in the brackets. One use for this syntax is to specify that you only want to include certain levels in the table. For example, if you only want to include “Now married” and “Never married” in the table, you can do so with:

table (marital_status[1 5]) (female)


------------------------------------------
                |           Female        
                |   Male   Female    Total
----------------+-------------------------
Marital Status  |                         
  Now married   |  5,844    5,799   11,643
  Never married |  3,012    2,594    5,606
------------------------------------------

The totals for marital_status are omitted because they’re just another level and not in the list of levels you asked for.

If you just want to omit the totals, you can do so with the nototal option:

table (marital_status) (female), nototal


---------------------------------
                |      Female    
                |   Male   Female
----------------+----------------
Marital Status  |                
  Now married   |  5,844    5,799
  Widowed       |    262    1,143
  Divorced      |    896    1,281
  Separated     |    166      269
  Never married |  3,012    2,594
---------------------------------

Alternatively, you can use the total() option to specify you want totals for one dimension and then they will be omitted for the other:

table (marital_status) (female), total(female)


----------------------------------
                |       Female    
                |    Male   Female
----------------+-----------------
Marital Status  |                 
  Now married   |   5,844    5,799
  Widowed       |     262    1,143
  Divorced      |     896    1,281
  Separated     |     166      269
  Never married |   3,012    2,594
  Total         |  10,180   11,086
----------------------------------

Note how totals calculated across female add a level to the marital_status dimension.

Exercise 2

Create a crosstab of race and hispanic again, but this time only include White, Black, Asian, and Other. Include a total row but not a total column. (Solution)

1.3 Super Rows and Super Columns

Super rows and super columns are rows and columns that contain other rows and columns. You specify them by listing two variables for a dimension. For example:

table (marital_status) (hispanic female), nototal


-------------------------------------------------
                |        Person is Hispanic      
                |   Not Hispanic       Hispanic  
                |      Female           Female   
                |   Male   Female   Male   Female
----------------+--------------------------------
Marital Status  |                                
  Now married   |  5,191    5,180    653      619
  Widowed       |    244    1,087     18       56
  Divorced      |    821    1,171     75      110
  Separated     |    137      217     29       52
  Never married |  2,528    2,237    484      357
-------------------------------------------------

The first variable listed becomes “super.” Changing the order gives the same results, but organized differently:

table (marital_status) (female hispanic), nototal


--------------------------------------------------------------------
                |                        Female                     
                |            Male                     Female        
                |     Person is Hispanic        Person is Hispanic  
                |  Not Hispanic   Hispanic   Not Hispanic   Hispanic
----------------+---------------------------------------------------
Marital Status  |                                                   
  Now married   |         5,191        653          5,180        619
  Widowed       |           244         18          1,087         56
  Divorced      |           821         75          1,171        110
  Separated     |           137         29            217         52
  Never married |         2,528        484          2,237        357
--------------------------------------------------------------------

Choose the organization that puts the numbers your readers are most likely to compare close to each other.

Tables with super rows can get long quickly:

table (marital_status hispanic) (female), nototal


----------------------------------------
                       |      Female    
                       |   Male   Female
-----------------------+----------------
Marital Status         |                
  Now married          |                
    Person is Hispanic |                
      Not Hispanic     |  5,191    5,180
      Hispanic         |    653      619
  Widowed              |                
    Person is Hispanic |                
      Not Hispanic     |    244    1,087
      Hispanic         |     18       56
  Divorced             |                
    Person is Hispanic |                
      Not Hispanic     |    821    1,171
      Hispanic         |     75      110
  Separated            |                
    Person is Hispanic |                
      Not Hispanic     |    137      217
      Hispanic         |     29       52
  Never married        |                
    Person is Hispanic |                
      Not Hispanic     |  2,528    2,237
      Hispanic         |    484      357
----------------------------------------

1.4 Adding Statistics

By default the table command puts frequencies in the table cells, i.e. the number of observations that have that particular combination of values. You can change that with the stat() option, including calculating multiple statistics. Start with a table that has female in the column dimension (we’ll add rows soon):

table () (female)


-------------------------------------
          |           Female         
          |    Male   Female    Total
----------+--------------------------
Frequency |  13,326   14,084   27,410
-------------------------------------

This is equivalent to using the stat(frequency) option:

table () (female), stat(frequency)


-------------------------------------
          |           Female         
          |    Male   Female    Total
----------+--------------------------
Frequency |  13,326   14,084   27,410
-------------------------------------

Now add percentages by adding stat(percent):

table () (female), stat(frequency) stat(percent)


-------------------------------------
          |           Female         
          |    Male   Female    Total
----------+--------------------------
Frequency |  13,326   14,084   27,410
Percent   |   48.62    51.38   100.00
-------------------------------------

While the frequency and percent statistics are based on female itself, you can also include statistics for other variables. To do so, use the stat() option and specify a statistic like before, but then include the variable or variables the statistic is to be calculated for in the parentheses as well. For example, you can add the means of age and income with stat(mean age income) and their standard deviations with stat(sd mean income). This command is getting long, so put one stat() option per line for readability:

table () (female), ///
    stat(frequency) ///
    stat(percent) ///
    stat(mean age income) ///
    stat(sd age income)


----------------------------------------------------
                   |              Female            
                   |      Male     Female      Total
-------------------+--------------------------------
Frequency          |    13,326     14,084     27,410
Percent            |     48.62      51.38     100.00
Mean               |                                
  Age              |  34.52499   36.76186   35.67435
  Income           |   36763.2   19508.31   27766.47
Standard deviation |                                
  Age              |   21.7909   22.80731   22.34652
  Income           |  47557.12      26929   39176.14
----------------------------------------------------

(Yes, we’d rather have the mean and standard deviation of each variable together. We’ll learn how soon.)

You can add frequencies for another variable with the fvfreq statistic, or “factor variable frequency.” This is distinguishes it from frequency (or just freq) which refers to the variable defining the dimension.

table () (female), ///
    stat(frequency) ///
    stat(percent) ///
    stat(mean age income) ///
    stat(sd age income) ///
    stat(fvfreq marital_status)


----------------------------------------------------------------
                               |              Female            
                               |      Male     Female      Total
-------------------------------+--------------------------------
Frequency                      |    13,326     14,084     27,410
Percent                        |     48.62      51.38     100.00
Mean                           |                                
  Age                          |  34.52499   36.76186   35.67435
  Income                       |   36763.2   19508.31   27766.47
Standard deviation             |                                
  Age                          |   21.7909   22.80731   22.34652
  Income                       |  47557.12      26929   39176.14
Factor-variable frequency      |                                
  Marital Status=Now married   |     5,844      5,799     11,643
  Marital Status=Widowed       |       262      1,143      1,405
  Marital Status=Divorced      |       896      1,281      2,177
  Marital Status=Separated     |       166        269        435
  Marital Status=Never married |     3,012      2,594      5,606
----------------------------------------------------------------

Now this is starting to look like “Table 1” from a paper!

Note that while we left the row dimension blank, our table not only has multiple rows, it has rows and super rows. What are they, and where did they come from? When you use the stat() option, table automatically creates a new dimension called result, with one level for each result that’s calculated. If your stat() option includes multiple variables, an additional dimension is created called var that contains the variables. For fvfreq this includes indicators for each level of the variable (1.marital_status, 2.marital_status, etc.).

table then assumes that you want the result dimension and the var dimension in your table and finds a place for them: in this case the empty row dimension. It also made result the super row. But you can take control by specifying where to put them. What we did before is equivalent to:

table (result var) (female), ///
    stat(frequency) ///
    stat(percent) ///
    stat(mean age income) ///
    stat(sd age income) ///
    stat(fvfreq marital_status)


----------------------------------------------------------------
                               |              Female            
                               |      Male     Female      Total
-------------------------------+--------------------------------
Frequency                      |    13,326     14,084     27,410
Percent                        |     48.62      51.38     100.00
Mean                           |                                
  Age                          |  34.52499   36.76186   35.67435
  Income                       |   36763.2   19508.31   27766.47
Standard deviation             |                                
  Age                          |   21.7909   22.80731   22.34652
  Income                       |  47557.12      26929   39176.14
Factor-variable frequency      |                                
  Marital Status=Now married   |     5,844      5,799     11,643
  Marital Status=Widowed       |       262      1,143      1,405
  Marital Status=Divorced      |       896      1,281      2,177
  Marital Status=Separated     |       166        269        435
  Marital Status=Never married |     3,012      2,594      5,606
----------------------------------------------------------------

Reversing the order of result and var will make a super row for each variable and put their mean and standard deviation together–almost certainly what you want:

table (var result) (female), ///
    stat(frequency) ///
    stat(percent) ///
    stat(mean age income) ///
    stat(sd age income) ///
    stat(fvfreq marital_status)


--------------------------------------------------------------
                             |              Female            
                             |      Male     Female      Total
-----------------------------+--------------------------------
Frequency                    |    13,326     14,084     27,410
Percent                      |     48.62      51.38     100.00
Age                          |                                
  Mean                       |  34.52499   36.76186   35.67435
  Standard deviation         |   21.7909   22.80731   22.34652
Income                       |                                
  Mean                       |   36763.2   19508.31   27766.47
  Standard deviation         |  47557.12      26929   39176.14
Marital Status=Now married   |                                
  Factor-variable frequency  |     5,844      5,799     11,643
Marital Status=Widowed       |                                
  Factor-variable frequency  |       262      1,143      1,405
Marital Status=Divorced      |                                
  Factor-variable frequency  |       896      1,281      2,177
Marital Status=Separated     |                                
  Factor-variable frequency  |       166        269        435
Marital Status=Never married |                                
  Factor-variable frequency  |     3,012      2,594      5,606
--------------------------------------------------------------

On the other hand it makes a real mess out of fvfreq. There are settings you can apply that will fix it, but many such settings have been collected into a style called table-1. It’s designed for the tables of summary statistics that make up the “Table 1” of so many papers. Apply it with the style() option:

table (var result) (female), ///
    stat(frequency) ///
    stat(percent) ///
    stat(mean age income) ///
    stat(sd age income) ///
    stat(fvfreq marital_status) ///
    style(table-1)


-------------------------------------------------
                |              Female            
                |      Male     Female      Total
----------------+--------------------------------
                |    13,326     14,084     27,410
                |     48.62      51.38     100.00
                |                                
            Age |  34.52499   36.76186   35.67435
                |   21.7909   22.80731   22.34652
                |                                
         Income |   36763.2   19508.31   27766.47
                |  47557.12      26929   39176.14
                |                                
 Marital Status |                                
  Now married   |     5,844      5,799     11,643
      Widowed   |       262      1,143      1,405
     Divorced   |       896      1,281      2,177
    Separated   |       166        269        435
Never married   |     3,012      2,594      5,606
-------------------------------------------------

Better, but it still needs some polishing. You’ll learn how to change the appearance of a table in the Appearance chapter.

If you want to use the above table in a Word document, you can save it in Word format by exporting the collection that the table command created automatically. We’ll have lots more to say about collections in the next chapter.

collect export table1.docx, replace

(collection Table exported to file table1.docx)

You can export tables to many other formats as well, including Excel, HTML, PDF, and LaTeX.

Exercise 3

Create a similar table with hispanic for the columns, and summary statistics for age, income, and edu. For income and age, report the 10th, 25th, 50th, 75th, and 90th percentiles rather than the mean and standard deviation, but don’t worry about labeling them. (Hint: in the stat() option, p# requests the #th percentile.) Report overall frequencies but not percentages. (Solution)