ICPSR Summer Program
MORE TOOLS FOR BETTER MINING
QUESTIONS FROM SECOND LECTURE
If validation “hurts”, then why do you do it?
I use cross-validation to demonstrate that the models I
build using regression indeed work as well as those
produced by more elaborate methods.
The sacrifice of cases in a real analysis can weaken the
model that you are able to find. You’ll find more
predictors with more data.
Fortunately, these are mostly small effects that improve
your model, but not dramatically.
For osteoporosis, using half of the sample yields a 5
predictor model with R2 = 39.5%. Fitting all 1232 finds
7 (some different) predictors with an R2 = 40.0%.
Is regression all there is to data mining?
Well, yes and no.
Yes, because properly modified, regression works as
well as anything. Also, regression illustrates the key
properties of any of the techniques (e.g., over-fitting).
No, because you eventually need to modify regression.
Out of the box, the standard implementation runs into
The solution requires more careful standard errors, a
better way to find p-values, and a routine dose of
Animated cost slider in JMP
More about the software
JMP is a commercial statistics program (like SAS), but
only costs about $65 for the version that I use.
It can read SAS transport files.
It handles large data sets, but it intended mostly for
visual types of data analysis.
It’s very interactive, but you can script it and indeed (as
in the cost slider) program with it as well.
It’s available from Ulrichs.
About the text
I donated a copy of the text Data Mining by David Hand
(noted in the syllabus) to the ICPSR library.
Identify the most promising candidates to receive
advanced training, a limited resource.
We have a dataset of past candidates, with a final
rating of how each turned o