// The Problem of Overfitting, or, Why You Need a Theory // Although statisticians have been studying the problem of multiple // testing for ages, the era of big data and data mining has made this // problem more prominent in our thinking that it once was. // The problem can be subtle in models with just a few parameters, but becomes // much more readily discernable in models with many parameters. // Suppose we have data on 11 variables, and we know the data for each // variable is produced independently. But, we go ahead and run a regression // model anyway, and keep track of the individual p values for each parameter. // This is to say, if we simulate data so that we know there are no real // relationships between the variables (only the random appearance of // relationships), we will know that the true coefficients that are // estimated in a regression are all zero. // We'll use a data set with 100 observations, and collect up the // p values from our parameter tests after regression. // We'll do this a thousand times, for some perspective. postfile parms rep x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 F using "notsig.dta", replace quietly forvalues rep=1/1000 { clear set obs 100 forvalues i=1/11 { generate x`i' = runiform() } rename x11 y regress y x* forvalues i=1/10 { scalar p`i' = ttail(89,abs(_b[x`i'])/_se[x`i'])*2 } post parms (`rep') (p1) (p2) (p3) (p4) (p5) /// (p6) (p7) (p8) (p9) (p10) (Ftail(e(df_m), e(df_r), e(F))) } postclose parms // We've collected the p values into a data set. use "notsig.dta", clear // We'll sort the variables by their p values, within each model. We'll // use this ordering later. reshape long x, i(rep) j(var) rename x p sort rep p by rep: generate order=_n // Having sorted the p values, we'll count how many models had at least // one "significant" p value. The number of models where at least one // parameter was signficant (order=1) should be about 400! // The probability of a model with at least one false positive is // 1-0.95^10, or about 0.40. tabulate order if p < 0.05 // We are willing to take the risk of a few false positives, but this // is a far cry from 0.05! // A Bonferonni correction adjusts either the p-value or the significance // criterion to account for multiple tests. Here the criterion correction is // 0.05/10 = 0.005 tabulate order if p < 0.005 // Which is much more like 5%! // A Sidak correction is similar, 1-(1-0.05)^(1/10) = 0.0051162 tabulate order if p < (1-.95^.1) // In this extreme case, a difference without a difference. There // are a number of other approaches that address family-wise error // at the same level. // We can take some heart in this extreme situation from the global F // test, which still holds us to only 5% of models where we falsely // see something going on in our data. generate f_sig = F < 0.05 if var==1 // just one test per model, please! tabulate f_sig // However, if we had a mix of variables with and without true effects, // the global F test would no longer be enough of a guide. If our model // is not totally a "kitchen sink" model, this is going to require more // thought. // By throwing the p values on a graph, we can see that overall they run // the whole range, from nearly 0 to nearly 1, and that some models have // up to 3 or even 4 parameters with "significance." Pretty exciting, // considering there's really nothing going on! scatter p order // But we also see in this graph that ordering the // p values imposes some limits on them. We can guess that a model // with all ten parameters giving us false positives should be extremely // rare. A quick calculation gives us 0.05^10 or about 10^-13. tabstat p, by(order) statistics(mean sd) preserve collapse (mean) mean_p=p (sd) se_p=p, by(order) label variable order "ordered p values" generate lclm = mean_p - 1.96*se_p replace lclm = 0 if lclm < 0 generate uclm = mean_p + 1.96*se_p replace uclm = 1 if uclm > 1 twoway (rspike lclm uclm order) (scatter mean_p order) restore // Using this ordering is a key part of several step-wise // approaches to family-wise error control, as well as // false discovery control. // Benjamini-Hochberg adjusts the criterion, 0.05*order/10 generate bj = p < 0.05*order/10 table order bj, row // with the added restriction that p values later in the order // cannot be significant if earlier values failed. sort rep order by rep: replace bj=0 if bj[_n-1]==0 table order bj, row