* Coarsened measure
* If the effect is linear, {&beta}{sub:1}is unbiased by coarsening,
* but {&beta}{sub:0} is biased. The direction of the bias
* depends on which direction the measure shifts, on
* average.
* Data that is uniformly distributed is the easy case.
* If coarsening lumps the data in bands/categories/classes, and the
* data values are recorded as one of the boundaries of the band
* (whether the floor or the ceiling), the linear effect is unchanged
* but the intercept shifts.
* If these data values are recorded as the midpoints of each band,
* then both the linear effect and the intercept are left unbiased.
* Suppose the independent data is lumpy with a recurring normal distribution,
* perhaps there are population booms at 5-year intervals. Suppose further
* that the data are coarsened at these 5-year spikes. Then the estimated
* linear effect becomes biased. The more pronounced the spikes in the
* original data, the greater the bias due to coarsening (coarsening
* produces leverage?). Changing the data point used to represent each
* band merely shifts the intercept. Coding data at the interval mid-point
* minimizes the effect of the bias, as the true line and the biased line
* intersect at the data mean.
* Suppose the data is lumpy with a recurring skew distribution (exponential).
* And suppose the data is coarsened to the lower band. Here the linear
* effect is unbiased, but the intercept has shifted. Shifting to the
* midpoint reduces the intercept bias, while shifting to the band mean
* eliminates it (but given coarse data, you won't generally know what
* the band mean was). Sharpness of the spike makes no difference.
postfile results sample measure b0 b1 using results, replace
forvalues i = 1/250 {
clear
quietly set obs 250
// lumpy age distribution
*generate age = 20 + 5*ceil(_n/50) + runiform(0,5)
*generate age = 20 + 5*ceil(_n/50) + runiform(-5,5)
generate age = 20 + 5*ceil(_n/50) + rexponential(5)
*generate age = 20 + 5*ceil(_n/50) + rnormal(0, 2.5)
*generate age = 20 + 5*ceil(_n/50) + abs(rnormal(0, 2.5))
// linear effect
generate inc = 1000 + 100*age //+ rnormal(0, 500)
quietly regress inc age
post results (`i') (0) (_b[_cons]) (_b[age])
// coarsen age measure, 5 year intervals
*generate age5 = 5*ceil(age/5) // shift age up, shift _cons down
generate age5 = 5*floor(age/5)
*generate age5 = 5*floor(age/5) +2.5 // shift age down
*replace age5=20 if age5< 20
*generate age5 = 5*ceil(age/5) - 2.5 // shift to midpoint
*generate age5 = 5*floor(age/5) + 1.8 // shift exp to band mean
quietly regress inc age5
post results (`i') (1) (_b[_cons]) (_b[age5])
}
postclose results
use results, clear
reshape wide b0 b1, i(sample) j(measure)
summarize b00 b10 b01 b11
gen b0shift = b01 - b00
label variable b0shift "{&Delta}{&beta}{sub:0}"
gen b1shift = b11 - b10
label variable b1shift "{&Delta}{&beta}{sub:1}"
ttest b0shift==0
ttest b1shift==0
*histogram b0shift, name(b0, replace)
*histogram b1shift, name (b1, replace)
*graph combine b0 b1
// quadratic effect
postfile results sample measure b0 b1 b2 using results, replace
forvalues i = 1/100 {
clear
set obs 500
generate age = 20 + 5*ceil(_n/50) + rexponential(3)
generate inc = 1000 + 100*age -0.5*age^2 + rnormal(0, 500)
quietly regress inc c.age##c.age
post results (`i') (0) (_b[_cons]) (_b[age]) (_b[c.age#c.age])
generate age5 = 5*ceil(age/5) - 3
quietly regress inc c.age5##c.age5
post results (`i') (1) (_b[_cons]) (_b[age5]) (_b[c.age5#c.age5])
}
postclose results
use results, clear
reshape wide b0 b1 b2, i(sample) j(measure)
summarize b00 b10 b20 b01 b11 b21
gen b0shift = b01 - b00
gen b1shift = b11 - b10
gen b2shift = b21 - b20
summarize b0shift b1shift b2shift