1 Exploring Linear Models with Stata

This is mostly about linear models, and a little about their implementation in Stata.

Linear models are used to summarize and describe data, to predict new data, and to test hypotheses about a population based on the data.

1.1 Some Regression Notation

A simple regression model may be written as

\[y_i = \beta_0 + \beta_1x_i + \epsilon_i\] an expression that applies to each row of our data, \(i=1, ..., N\).

Here, each observed response, \(y_i\), is a combination of a constant, \(\beta_0\) (also called the “intercept”), some multiple or fraction of each observed independent variable, \(\beta_1 \times x_i\), and an individual random component, \(\epsilon_i\) (occasionally called an “error” term, although this word is often misleading).

The random elements, \(\epsilon_i\), are from a Normal distribution with a mean of 0 and a standard deviation of \(\sigma\). This is often expressed as \[e_i \sim N(0, \sigma^2)\]

It is also common to see regressions expressed in matrix form

\[Y=XB+E\]

This expression describes all of our data.

Here, \(Y\) is a column vector of observed responses, \(X\) is a matrix of independent data (including a constant column), \(B\) is a vector of coefficients (\(\beta_0\) and \(\beta_1\) in our example), and \(E\) is a vector of random elements.

We can use the very same matrix notation for the cases where we have more than one \(x\).

Because mathematicians often value brevity in notation, this may be expressed succinctly as

\[ y \sim N(\beta_0+\beta_1x, \sigma^2) \]

\[Y \sim N(XB,\sigma^2)\]

A variant of this model is one where \(x\) is an indicator variable. This special case is often called an ANOVA model, and is often expressed with reference to group means, \(\bar{y}_j(=\mu_j)\) rather than \(\beta_k\). Common notation here is

\[y_{ji} = \mu_j + \epsilon_i\]

Then

\[\beta_0 = \bar{y}_0=\mu_0\] \[\beta_k = \bar{y}_k-\bar{y}_0=\mu_j-\mu_0\]

for \(k>0\), and

\[y \sim N(\bar{y}_j, \sigma^2)\]

1.1.1 Stata Regression

Let’s use the ubiquitous auto data set.

A simple regression might predict a vehicle’s gas mileage (\(y=`mpg`\)) using the vehicle’s weight (\(x=`weight`\)).

\[mpg_i=\beta_0+\beta_1weight_i+e_i\]

or

\[mpg_i \sim N(\beta_0+\beta_1weight_i, \sigma^2)\]

sysuse auto
regress mpg weight
      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =    134.62
       Model |   1591.9902         1   1591.9902   Prob > F        =    0.0000
    Residual |  851.469256        72  11.8259619   R-squared       =    0.6515
-------------+----------------------------------   Adj R-squared   =    0.6467
       Total |  2443.45946        73  33.4720474   Root MSE        =    3.4389

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |  -.0060087   .0005179   -11.60   0.000    -.0070411   -.0049763
       _cons |   39.44028   1.614003    24.44   0.000     36.22283    42.65774
------------------------------------------------------------------------------

In this example, we estimate \(\beta_0=39.44\), \(\beta_1=-.006\), and \(sigma = 6825.9\).

1.1.2 Stata ANOVA

A simple ANOVA might predict a vehicle’s price (\(y=`price`\)) by the categories of domestic or foreign vehicle (\(x=`foreign`\)).

\[foreign \sim N(\overline{foreign}_k, \sigma^2)\] where \(k \in {0,1}\).

oneway price foreign, mean
display "sigma = " e(rmse)
            | Summary of
            |    Price
   Car type |        Mean
------------+------------
   Domestic |   6,072.423
    Foreign |   6,384.682
------------+------------
      Total |   6,165.257

                        Analysis of Variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      1507382.66      1   1507382.66      0.17     0.6802
 Within groups       633558013     72   8799416.85
------------------------------------------------------------------------
    Total            635065396     73   8699525.97

Bartlett's test for equal variances:  chi2(1) =   0.7719  Prob>chi2 = 0.380

sigma = 3.4388896

In this example, we estimate \(y_0=6072.42\), \(y_1=6384.68\), and \(sigma=2966.38\).

1.1.3 Exercise

  • Based on the last Stata example, calculate (by hand or with a calculator) the \(\beta_0\) and \(\beta_1\) you expect to see in a regression of price on foreign.
  • Check you work by running the regression.

1.2 Hypothesis Testing

Once a regression/ANOVA model has been estimated (“fit”) it may be used as a basis for hypothesis testing - providing probabilistic evidence for or against some proposition.

We typically consider:

  • An overall F test (an ANOVA, analysis of variance). Does knowing \(x\) (or all the \(x_j\)) give us a better prediction of \(y\) than simply using the mean, \(\bar{y}\)?
  • Partitioned F tests, typically partitioned by collections of related indictor variables. These can be formulated in several different ways, with somewhat different interpretations.
  • Parameter t-tests (Wald tests). Is there evidence that a specific \(\beta_j\) is not zero?
  • Linear combinations of parameters (F tests). For custom hypothesis testing.

Some hypotheses can be framed in several equivalent ways. In addition to these tests, we can frame comparisons of models in the form of likelihood ratio tests.

1.3 Diagnostics