Linear Model Review
1 Exploring Linear Models with Stata
This is mostly about linear models, and a little about their implementation in Stata.
Linear models are used to summarize and describe data, to predict new data, and to test hypotheses about a population based on the data.
1.1 Some Regression Notation
A simple regression model may be written as
\[y_i = \beta_0 + \beta_1x_i + \epsilon_i\] an expression that applies to each row of our data, \(i=1, ..., N\).
Here, each observed response, \(y_i\), is a combination of a constant, \(\beta_0\) (also called the “intercept”), some multiple or fraction of each observed independent variable, \(\beta_1 \times x_i\), and an individual random component, \(\epsilon_i\) (occasionally called an “error” term, although this word is often misleading).
The random elements, \(\epsilon_i\), are from a Normal distribution with a mean of 0 and a standard deviation of \(\sigma\). This is often expressed as \[e_i \sim N(0, \sigma^2)\]
It is also common to see regressions expressed in matrix form
\[Y=XB+E\]
This expression describes all of our data.
Here, \(Y\) is a column vector of observed responses, \(X\) is a matrix of independent data (including a constant column), \(B\) is a vector of coefficients (\(\beta_0\) and \(\beta_1\) in our example), and \(E\) is a vector of random elements.
We can use the very same matrix notation for the cases where we have more than one \(x\).
Because mathematicians often value brevity in notation, this may be expressed succinctly as
\[ y \sim N(\beta_0+\beta_1x, \sigma^2) \]
\[Y \sim N(XB,\sigma^2)\]
A variant of this model is one where \(x\) is an indicator variable. This special case is often called an ANOVA model, and is often expressed with reference to group means, \(\bar{y}_j(=\mu_j)\) rather than \(\beta_k\). Common notation here is
\[y_{ji} = \mu_j + \epsilon_i\]
Then
\[\beta_0 = \bar{y}_0=\mu_0\] \[\beta_k = \bar{y}_k-\bar{y}_0=\mu_j-\mu_0\]
for \(k>0\), and
\[y \sim N(\bar{y}_j, \sigma^2)\]
1.1.1 Stata Regression
Let’s use the ubiquitous auto
data set.
A simple regression might predict a vehicle’s gas mileage (\(y=`mpg`\)) using the vehicle’s weight (\(x=`weight`\)).
\[mpg_i=\beta_0+\beta_1weight_i+e_i\]
or
\[mpg_i \sim N(\beta_0+\beta_1weight_i, \sigma^2)\]
sysuse auto
regress mpg weight
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(1, 72) = 134.62
Model | 1591.9902 1 1591.9902 Prob > F = 0.0000
Residual | 851.469256 72 11.8259619 R-squared = 0.6515
-------------+---------------------------------- Adj R-squared = 0.6467
Total | 2443.45946 73 33.4720474 Root MSE = 3.4389
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0060087 .0005179 -11.60 0.000 -.0070411 -.0049763
_cons | 39.44028 1.614003 24.44 0.000 36.22283 42.65774
------------------------------------------------------------------------------
In this example, we estimate \(\beta_0=39.44\), \(\beta_1=-.006\), and \(sigma = 6825.9\).
1.1.2 Stata ANOVA
A simple ANOVA might predict a vehicle’s price (\(y=`price`\)) by the categories of domestic or foreign vehicle (\(x=`foreign`\)).
\[foreign \sim N(\overline{foreign}_k, \sigma^2)\] where \(k \in {0,1}\).
oneway price foreign, mean
display "sigma = " e(rmse)
| Summary of
| Price
Car type | Mean
------------+------------
Domestic | 6,072.423
Foreign | 6,384.682
------------+------------
Total | 6,165.257
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 1507382.66 1 1507382.66 0.17 0.6802
Within groups 633558013 72 8799416.85
------------------------------------------------------------------------
Total 635065396 73 8699525.97
Bartlett's test for equal variances: chi2(1) = 0.7719 Prob>chi2 = 0.380
sigma = 3.4388896
In this example, we estimate \(y_0=6072.42\), \(y_1=6384.68\), and \(sigma=2966.38\).
1.1.3 Exercise
- Based on the last Stata example, calculate (by hand or with
a calculator) the \(\beta_0\) and
\(\beta_1\) you expect to see in a regression of
price
onforeign
. - Check you work by running the regression.
1.2 Hypothesis Testing
Once a regression/ANOVA model has been estimated (“fit”) it may be used as a basis for hypothesis testing - providing probabilistic evidence for or against some proposition.
We typically consider:
- An overall F test (an ANOVA, analysis of variance). Does knowing \(x\) (or all the \(x_j\)) give us a better prediction of \(y\) than simply using the mean, \(\bar{y}\)?
- Partitioned F tests, typically partitioned by collections of related indictor variables. These can be formulated in several different ways, with somewhat different interpretations.
- Parameter t-tests (Wald tests). Is there evidence that a specific \(\beta_j\) is not zero?
- Linear combinations of parameters (F tests). For custom hypothesis testing.
Some hypotheses can be framed in several equivalent ways. In addition to these tests, we can frame comparisons of models in the form of likelihood ratio tests.