// The Problem of Overfitting, or, Why You Need a Theory

// Although statisticians have been studying the problem of multiple
// testing for ages, the era of big data and data mining has made this
// problem more prominent in our thinking that it once was.

// The problem can be subtle in models with just a few parameters, but becomes
// much more readily discernable in models with many parameters.

// Suppose we have data on 11 variables, and we know the data for each
// variable is produced independently.  But, we go ahead and run a regression
// model anyway, and keep track of the individual p values for each parameter.

// This is to say, if we simulate data so that we know there are no real
// relationships between the variables (only the random appearance of 
// relationships), we will know that the true coefficients that are
// estimated in a regression are all zero.

// We'll use a data set with 100 observations, and collect up the
// p values from our parameter tests after regression.
//  We'll do this a thousand times, for some perspective.
postfile parms rep x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 F using "notsig.dta", replace

quietly forvalues rep=1/1000 {
	clear
	set obs 100
	forvalues i=1/11 {
		generate x`i' = runiform()
		}
	rename x11 y

	regress y x*
	forvalues i=1/10 {
		scalar p`i' = ttail(89,abs(_b[x`i'])/_se[x`i'])*2
		}
	post parms (`rep') (p1) (p2) (p3) (p4) (p5) ///
		(p6) (p7) (p8) (p9) (p10) (Ftail(e(df_m), e(df_r), e(F)))
	}
postclose parms

// We've collected the p values into a data set.
use "notsig.dta", clear

// We'll sort the variables by their p values, within each model.  We'll
// use this ordering later.
reshape long x, i(rep) j(var)
rename x p

sort rep p
by rep: generate order=_n

// Having sorted the p values, we'll count how many models had at least
// one "significant" p value.  The number of models where at least one
// parameter was signficant (order=1) should be about 400!
// The probability of a model with at least one false positive is
// 1-0.95^10, or about 0.40.
tabulate order if p < 0.05

// We are willing to take the risk of a few false positives, but this
// is a far cry from 0.05!

// A Bonferonni correction adjusts either the p-value or the significance
// criterion to account for multiple tests.  Here the criterion correction is
// 0.05/10 = 0.005
tabulate order if p < 0.005

// Which is much more like 5%!

// A Sidak correction is similar, 1-(1-0.05)^(1/10) = 0.0051162
tabulate order if p < (1-.95^.1)

// In this extreme case, a difference without a difference.  There 
// are a number of other approaches that address family-wise error
// at the same level.

// We can take some heart in this extreme situation from the global F
// test, which still holds us to only 5% of models where we falsely
// see something going on in our data.
generate f_sig = F < 0.05 if var==1 // just one test per model, please!
tabulate f_sig

// However, if we had a mix of variables with and without true effects,
// the global F test would no longer be enough of a guide.  If our model
// is not totally a "kitchen sink" model, this is going to require more
// thought.

// By throwing the p values on a graph, we can see that overall they run
// the whole range, from nearly 0 to nearly 1, and that some models have
// up to 3 or even 4 parameters with "significance."  Pretty exciting,
// considering there's really nothing going on!
scatter p order

// But we also see in this graph that ordering the 
// p values imposes some limits on them.  We can guess that a model
// with all ten parameters giving us false positives should be extremely
// rare. A quick calculation gives us 0.05^10 or about 10^-13.

tabstat p, by(order) statistics(mean sd)

preserve
collapse (mean) mean_p=p (sd) se_p=p, by(order)
label variable order "ordered p values"
generate lclm = mean_p - 1.96*se_p
replace lclm = 0 if lclm < 0
generate uclm = mean_p + 1.96*se_p
replace uclm = 1 if uclm > 1

twoway (rspike lclm uclm order) (scatter mean_p order)
restore

// Using this ordering is a key part of several step-wise
// approaches to family-wise error control, as well as
// false discovery control.

// Benjamini-Hochberg adjusts the criterion, 0.05*order/10
generate bj = p < 0.05*order/10
table order bj, row
// with the added restriction that p values later in the order
// cannot be significant if earlier values failed.
sort rep order
by rep: replace bj=0 if bj[_n-1]==0
table order bj, row