3 Simulating Regression Models

Being able to generate your own example data is an imporant tool. You know where your data came from, which puts you in a position to judge how well your analysis recovered the essential features of the data.

3.1 Simple Regression

You begin with an empty slate, then generate \(x\). This could be from any arbitrary data distribution, so here we will use a continuous uniform distribution. Next we generate the \(\beta\)s. Here we will just pick two arbitrary values, 10 and -0.5. Then we generate the \(\epsilon\) values. These come from a random Normal distribution with a mean of 0 and an arbitrary \(\sigma\), 1.5.

Finally, we combine all of these to generate \(y\). The model for which we are creating data is:

\[ y \sim N(10-0.5x, 1.5^2)\]

clear all
set obs 50 // an arbitrary _N

3.1.1 Generate Data

generate x = runiform(0,20)
generate b0 = 10
generate b1 = -0.5
generate e = rnormal(0, 1.5) // sigma == 1.5

generate y = b0 + b1*x + e

The first five observations look like this:

list * in 1/5, noobs

(Rather than generating so much repeated data, we will find more efficient ways of doing this in future examples.)

3.1.2 Plot the Data

A scatter plot graphing \(y\) with \(x\):

graph twoway scatter y x

3.1.3 Plot the Random Component

We might graph the random element, \(e\), in several different ways.

histogram e, name(h)
kdensity e, name(kd)
qnorm e, name(q)
graph combine h kd q

3.1.4 Plot the Model

Then a graph including both the data and the regression line we are fitting is:

graph twoway (scatter y x) (lfit y x)

3.1.5 `Regress`

We can fit this model with the regress command in Stata:

regress y x
*anova y c.x // would fit the same model

      Source |       SS           df       MS      Number of obs   =        50
-------------+----------------------------------   F(1, 48)        =    187.78
       Model |  370.327761         1  370.327761   Prob > F        =    0.0000
    Residual |  94.6603772        48  1.97209119   R-squared       =    0.7964
-------------+----------------------------------   Adj R-squared   =    0.7922
       Total |  464.988138        49  9.48955383   Root MSE        =    1.4043

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |  -.4440427   .0324037   -13.70   0.000    -.5091948   -.3788906
       _cons |   9.144904    .405187    22.57   0.000      8.33022    9.959587
------------------------------------------------------------------------------

3.1.6 Repeat

We can repeat this example numerous times by using the simulate command. This requires us to set up our data generation and model estimation as a program (essentially, a single Stata command).

program define example1, eclass
    clear
    set obs 25
    generate x = runiform(0,25)
    generate b0 = 10
    generate b1 = -0.5
    generate e = rnormal(0, 1.5) // sigma == 1.5
    generate y = b0 + b1*x + e

    regress y x
end

simulate b0=_b[_cons] b1=_b[x] e=e(rmse), reps(250) nodots: example1
summarize

      command:  example1
           b0:  _b[_cons]
           b1:  _b[x]
            e:  e(rmse)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          b0 |        250    10.02846    .6277865   8.522183   11.82408
          b1 |        250   -.5025731    .0442147  -.6190096   -.356326
           e |        250    1.479487    .2249147   .9509845   2.075752

From here would could look at the distribution of b0 and b1.

3.2 Simple ANOVA (t-test)

A two group ANOVA is the same model as a two-group t-test.

For this example we’ll code \(group \in {0,1}\). The mean of \(y\) for each group is 0 and 2, respectively. And \(y\) has a \(\sigma=2\).

\[y \sim N(0+2*group, 2^2)\]

3.2.1 Generate Data

clear
set obs 50
generate group = _n > 25 // by observation number

generate y = rnormal(group*2, 2)

number of observations (_N) was 0, now 50

3.2.2 Plot the Data

We can plot our data as a scatterplot. It is also common practice to graph this as a boxplot.

twoway (scatter y group) (lfit y group), name(s)
graph box y, over(group) name(b)
graph combine s b

(file ttplot.png written in PNG format)

3.2.3 `regress`

This is a model we could fit with several different commands.

ttest y, by(group)
oneway y group, mean
anova y group

As a regression, we would use

regress y i.group

      Source |       SS           df       MS      Number of obs   =        50
-------------+----------------------------------   F(1, 48)        =     14.36
       Model |  56.9630552         1  56.9630552   Prob > F        =    0.0004
    Residual |  190.443253        48  3.96756776   R-squared       =    0.2302
-------------+----------------------------------   Adj R-squared   =    0.2142
       Total |  247.406308        49  5.04910832   Root MSE        =    1.9919

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     1.group |   2.134723   .5633875     3.79   0.000     1.001957     3.26749
       _cons |  -.4549388   .3983751    -1.14   0.259    -1.255926     .346048
------------------------------------------------------------------------------

Note the categorical (“factor”) variable prefix. In this particular example, because group has only the two levels, already coded 0 and 1, we could get away with skipping the prefix.

(The oneway and anova commands both assume group is a categorical variable. You could still add the prefix, for clarity.)