// The Problem of Overfitting, or, Why You Need a Theory
// Although statisticians have been studying the problem of multiple
// testing for ages, the era of big data and data mining has made this
// problem more prominent in our thinking that it once was.
// The problem can be subtle in models with just a few parameters, but becomes
// much more readily discernable in models with many parameters.
// Suppose we have data on 11 variables, and we know the data for each
// variable is produced independently. But, we go ahead and run a regression
// model anyway, and keep track of the individual p values for each parameter.
// This is to say, if we simulate data so that we know there are no real
// relationships between the variables (only the random appearance of
// relationships), we will know that the true coefficients that are
// estimated in a regression are all zero.
// We'll use a data set with 100 observations, and collect up the
// p values from our parameter tests after regression.
// We'll do this a thousand times, for some perspective.
postfile parms rep x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 F using "notsig.dta", replace
quietly forvalues rep=1/1000 {
clear
set obs 100
forvalues i=1/11 {
generate x`i' = runiform()
}
rename x11 y
regress y x*
forvalues i=1/10 {
scalar p`i' = ttail(89,abs(_b[x`i'])/_se[x`i'])*2
}
post parms (`rep') (p1) (p2) (p3) (p4) (p5) ///
(p6) (p7) (p8) (p9) (p10) (Ftail(e(df_m), e(df_r), e(F)))
}
postclose parms
// We've collected the p values into a data set.
use "notsig.dta", clear
// We'll sort the variables by their p values, within each model. We'll
// use this ordering later.
reshape long x, i(rep) j(var)
rename x p
sort rep p
by rep: generate order=_n
// Having sorted the p values, we'll count how many models had at least
// one "significant" p value. The number of models where at least one
// parameter was signficant (order=1) should be about 400!
// The probability of a model with at least one false positive is
// 1-0.95^10, or about 0.40.
tabulate order if p < 0.05
// We are willing to take the risk of a few false positives, but this
// is a far cry from 0.05!
// A Bonferonni correction adjusts either the p-value or the significance
// criterion to account for multiple tests. Here the criterion correction is
// 0.05/10 = 0.005
tabulate order if p < 0.005
// Which is much more like 5%!
// A Sidak correction is similar, 1-(1-0.05)^(1/10) = 0.0051162
tabulate order if p < (1-.95^.1)
// In this extreme case, a difference without a difference. There
// are a number of other approaches that address family-wise error
// at the same level.
// We can take some heart in this extreme situation from the global F
// test, which still holds us to only 5% of models where we falsely
// see something going on in our data.
generate f_sig = F < 0.05 if var==1 // just one test per model, please!
tabulate f_sig
// However, if we had a mix of variables with and without true effects,
// the global F test would no longer be enough of a guide. If our model
// is not totally a "kitchen sink" model, this is going to require more
// thought.
// By throwing the p values on a graph, we can see that overall they run
// the whole range, from nearly 0 to nearly 1, and that some models have
// up to 3 or even 4 parameters with "significance." Pretty exciting,
// considering there's really nothing going on!
scatter p order
// But we also see in this graph that ordering the
// p values imposes some limits on them. We can guess that a model
// with all ten parameters giving us false positives should be extremely
// rare. A quick calculation gives us 0.05^10 or about 10^-13.
tabstat p, by(order) statistics(mean sd)
preserve
collapse (mean) mean_p=p (sd) se_p=p, by(order)
label variable order "ordered p values"
generate lclm = mean_p - 1.96*se_p
replace lclm = 0 if lclm < 0
generate uclm = mean_p + 1.96*se_p
replace uclm = 1 if uclm > 1
twoway (rspike lclm uclm order) (scatter mean_p order)
restore
// Using this ordering is a key part of several step-wise
// approaches to family-wise error control, as well as
// false discovery control.
// Benjamini-Hochberg adjusts the criterion, 0.05*order/10
generate bj = p < 0.05*order/10
table order bj, row
// with the added restriction that p values later in the order
// cannot be significant if earlier values failed.
sort rep order
by rep: replace bj=0 if bj[_n-1]==0
table order bj, row