ICPSR Summer Program
DATA MINING WITH REGRESSION
TOPICS FROM FIRST LECTURE
How does that stepwise tool work?
We’ll use stepwise regression and see how it picks the
features to add to a model. It’s just automating
something most of us have done manually from time
Are predictive methods useful in social science?
It depends on your data.
If you’ve done a randomized experiment, there’s little
need for data mining. If not, it has a powerful role.
For example, build your substantive model, motivated
by your theory. Is that all there is?
If so, data mining should not be able to add anything
to your description. If it does, then it can be useful to
amend your model.
Is that Bonferroni rule for real?
The rule is named for the Bonferroni inequality.1
The concern in data mining is over-fitting. Suppose
you build a model by picking from a collection of, say,
10,000 predictors. (Common in genomics these days.)
If you use a 5% test, you’ll end up with 500 predictors,
even if there’s no signal.
1 Which is sometimes called Boole’s inequality. Who knows.
To control the Type 1 error rate (the chance for a false
positive), we can use the Bonferroni method, among
others. Let Ei mean we goofed on Xi.
P(some goof) = P(E1 or E2 or E3 orLor Ep )
" P(E1) + P(E2) +P(E3) +L+ P(Ep )
If we do each of test at level .05/p, we’re protected.
P(some goof) "
4 4 4 4
4 4 4 4
From the “Why should I care about DM?” dept:
CUSTOM COUPONS. Then there's the loyalty card. Think it's low
tech? It may be, but the data it generates is a thick vein of gold for
retailers. Using the latest data-mining techniques, they can extract
information that will help them stock their stores, discount products,
and woo you back into their shops with sales on the product they
know you like.
Loyalty cards, issued