Loading ...
Global Do...
News & Politics
7
0
Try Now
Log In
Pricing
Robert Stine ICPSR Summer Program July 2005 Data Mining Lecture 2 DATA MINING WITH REGRESSION TOPICS FROM FIRST LECTURE How does that stepwise tool work? We’ll use stepwise regression and see how it picks the features to add to a model. It’s just automating something most of us have done manually from time to time. Are predictive methods useful in social science? It depends on your data. If you’ve done a randomized experiment, there’s little need for data mining. If not, it has a powerful role. For example, build your substantive model, motivated by your theory. Is that all there is? If so, data mining should not be able to add anything to your description. If it does, then it can be useful to amend your model. Is that Bonferroni rule for real? The rule is named for the Bonferroni inequality.1 The concern in data mining is over-fitting. Suppose you build a model by picking from a collection of, say, 10,000 predictors. (Common in genomics these days.) If you use a 5% test, you’ll end up with 500 predictors, even if there’s no signal. 1 Which is sometimes called Boole’s inequality. Who knows. Data Mining Lecture 2 2 July 2005 To control the Type 1 error rate (the chance for a false positive), we can use the Bonferroni method, among others. Let Ei mean we goofed on Xi. ! P(some goof) = P(E1 or E2 or E3 orLor Ep ) " P(E1) + P(E2) +P(E3) +L+ P(Ep ) If we do each of test at level .05/p, we’re protected. ! P(some goof) " .05 p + .05 p + .05 p +L+ .05 p p tests 1 2 4 4 4 4 3 4 4 4 4 = .05 From the “Why should I care about DM?” dept: CUSTOM COUPONS. Then there's the loyalty card. Think it's low tech? It may be, but the data it generates is a thick vein of gold for retailers. Using the latest data-mining techniques, they can extract information that will help them stock their stores, discount products, and woo you back into their shops with sales on the product they know you like. Loyalty cards, issued by pharmacy stores like CVS and Duane Reade and supermarkets Kroger and Albertsons, are one of the most popular gauges of shopping habits. Most of these stores offer special discounts only to card users. And each time you use it, you're helping the store take note of your shopping list. No wonder you've been getting coupons for just the kind of cinnamon cookies or organic yogurt you like -- the store knows that you bought them on a previous visit. And maybe you noticed that the customer toting a baby in front of you got different coupons for diapers and baby food. It's called targeted marketing -- and it's being used more and more by supermarkets. Data Mining Lecture 2 3 July 2005 And finally, from Lake Wobegon, this summary of 12th grade opinions in Monitoring the Future: A Continuing Study of American Youth 2002… Data Mining Lecture 2 4 July 2005 CHALLENGE Identify those women who are most at risk of osteoporosis, a condition in which bones lose calcium and become brittle. Osteoporosis is not so dangerous of itself (albeit there are complications with arthritis), but it can lead to severe injury if a person falls and breaks bones. Not a serious health risk if treated. Osteoporosis is easily detected by X-ray of hip, but these are expensive. Low bone mass (defined as 1.5 SD’s below normal) can be treated by variety of therapies, ranging from diet and exercise to pharmaceuticals. Problem Many women (and men) remain unaware of the problem, and the cost of the X-ray deters others. Solution Come up with a method to identify those individuals who are most at risk and justify cost of X-ray. If predictions offer a “cost effective” approach then HMOs might find result more compelling than someone’s suspicion that they have a problem. Data Mining Lecture 2 5 July 2005 DATA Sample of 1232 post-menopausal women living in nursing homes. Implication for extrapolating from this data? Response Standardized hip z-score, relative to “young normal”. Ideal score is Gaussian, with µ = 0, σ = 1. If your score is –2, you are 2 standard deviations below normal and would be diagnosed as having osteoporosis. Wide data set Data collection is an example of “planning by committee.” Some want women to fill in a questionnaire. Lab scientists believe biochemical markers are keys. Doctors believe they know from experiences. Result is 127 columns of data. Data browsing OSTEO.JMP Always interesting to see what we have. (Alas, I cannot distribute this dataset.) Notice anything snooping around the data that you had not thought about? In particular, what common “pain in the butt” characteristic does this data have? Data Mining Lecture 2 6 July 2005 MISSING DATA Missing data is common in this data, as in most real datasets. Virtually every woman is missing some of the columns (other than the response). If we were to use the familiar “listwise” deletion method to handle the analysis, we’d end up with virtually no cases. What should we do about the missing data? Remove columns with “too many” missing cases? Exclude women who are missing too many columns? Fill in the missing values via an imputation procedure? Other suggestions? An old fashioned, simple approach is well-suited to data mining: For every column that has missing data: 1. Fill in the missing values with a default, such as the average of the cases that are not missing. 2. Add an additional column, an indicator of whether the value is real or filled-in. Virtually doubles the number of columns, from 127 to 208. OSTEO_BIG.JMP has the resulting dataset. Draw scatterplots to see what this approach to missing data does to a regression. Data Mining Lecture 2 7 July 2005 PREPROCESSING THE DATA Filling-in missing data and adding indicator columns like these is an example of preprocessing. Preprocessing refers to preliminary tasks that have to be done before you ever start modeling the data. It includes tedious tasks like merging files, extractions from data warehouse, aligning events in time, checking definitions, etc. Another part of preprocessing concerns feature creation. Are there other columns that you want considered? Eg: The column “Fracture” sums the different types of fractures that a woman reports. It’s totally collinear with those columns. Should you combine other columns, as by ratios or other transformation? Feature creation requires a great deal of insight into the properties of the data, but should be done with no reference to the response. The references offer much more commentary on preprocessing. For many in Computer Science, preprocessing is data mining. Data Mining Lecture 2 8 July 2005 THE FIVE C’S OF DATA MINING2 Clarity Nothing fouls up data-mining like a vague purpose or fuzzy objective that changes as the analysis proceeds. Such confusion and ambiguity are sure to result in over-fitting or a similar problem. Cost What savings does the model produce? In the case of osteoporosis, we have to balance the cost of missing a woman who has a problem with the cost of sending women for needless diagnosis. Someone is going to have to figure out the costs of these errors if we’re to know how well the model performs. Does the model achieve the accuracy that you need? Comparison How does the model produced by data mining and using an automated search compare to a simple model or one motivated by substantive theory. Calibration A predictor is calibrated if, for example, among the days when it predicts “the chances of rain are 40%”, it rains on 40% of them. ! E Y | ˆ Y = y [ ] = y If your predictions are not calibrated, someone can out-predict you by simply calibrating them. 2 Inspired by the 5 C’s of diamonds (cut, clarity, color, carets, and cost). The list is a bit arbitrary and could be expanded if we wanted to be a lot longer. This will do for now! Data Mining Lecture 2 9 July 2005 Cross-validation Use a “hold-back sample” to test the predictions of the model and estimate the accuracy of your model. The hold-back sample is not used in the model selection nor estimation stages. We do not need to use a validation sample to pick a model, only to evaluate the model at the end of the day. Unless you predict new data, it can be very, very hard to know how well your model will perform when applied to new cases. Be careful taking the results of your own cross-validation too literally. Often, the cross-validation is optimistic because the split divides data collected in the same way, at the same point in time. When you go out and get genuinely new data, many other things may have changed. PLANNING FOR VALIDATION Validation requires a hold-back sample for testing. You must plan for validation from the start rather than gluing it on after you’ve done the analysis. Key question How much data should be used to select and estimate your model, and how much should be saved for validation? Dilemma Either get a poor estimate of the accuracy of a good model, or A good estimate of the accuracy of poor model. Data Mining Lecture 2 10 July 2005 Reversed cross-validation Estimation error gets smaller at a rate of ! n . ! SE(y ) = " n But random noise dominates prediction error, swamping estimation error. ! SD(ynew " y ) =# 1+ 1 n So when I have a lot of data, I validate using more observations than for estimation. But this requires a lot of data so that you have enough to find a model. For example, with enough data, I might estimate a model using 1/5 of the data, and validate the model using the other 4/5. This choice means that I think the data mining process will work well with just 1/5 of the data. k-fold balanced cross-validation Re-use the sample rather than only do the validation once. For example, you can randomly partition your data into 5 “folds”, then: Estimate model on “first” 1/5, then validate on other 4/5. (or vice versa: fit on 4/5, validate on 1/5) Estimate the model on the “second” 1/5, validate on the other 4/5. etc. Obtain further accuracy by repeating the random splitting of the data into folds multiple times. Data Mining Lecture 2 11 July 2005 Caution Need enough data for selecting model to detect “important” effects. The larger the sample you have for selecting the model, the more complex the model you can find. You have to decide whether there’s enough. BACK TO THE MODELING Split-sample validation Randomly select half of the 1232 cases for picking the model, and reserve the rest for validation. Could repeat this process multiple times. I’ll show what happens once. Feature domain We have 208 features. Should we consider others? Ought we build more features like the Fracture column from the offered columns? What about interactions among the columns (especially for dealing with missing data)? That’s a problem. Unless we’re careful, we’re opening up the scope to about ! 208 2 " # $ % & ' = 208 ( 207 2 = 21,528 more features. That’s more than packages can handle.3 3 Dean Foster and I have fit models that consider 67,000 features routinely, and with newer methods have explored more than 1,000,000 predictors with newer software. Data Mining Lecture 2 12 July 2005 OVERVIEW OF STEPWISE REGRESSION Forward stepwise regression is a “greedy” search procedure that builds up a regression model. 1. Start with an initial model. I usually begin with the empty model, but you can force certain features into the model. You might prefer to start with your model, and see if stepwise finds anything to add. 2. Find the feature that improves the current model the most. At the first step, this means finding the feature with the highest correlation with the response. 3. Compute the p-value for the identified feature.4 a. If the p-value is less than a threshold (called Prob- to-enter or p-to-enter), add the feature to the model. Then go back to step #2 (the relevant correlation is now a partial correlation) and continue. b. If the p-value is too large, stop. You’ve got your model. Use the “Step” button to see each step happen. The search is greedy because it picks the feature that offers the largest immediate improvement, without looking ahead to other things that might ultimately lead to the best model. 4Dean Foster and I modify stepwise regression to use a bit more care in finding reliable p- value when searching many predictors with sparse data. See our 2004 paper. Data Mining Lecture 2 13 July 2005 RUNNING STEPWISE Selection from 208 base features…5 Forward stepwise selection. Bonferroni cut-off (0.05 / 208 = .00024) Restrict to the cases in the estimation sample.6 Lots more lines are there, but I clipped the output. Click the “Go” button, and let it run…or click the “Step” button to watch each step. The selection process halts when none of the remaining features adds enough to pass Bonferroni threshold (set in the “Prob to Enter” box). 5 Use Fit Model > Stepwise Personality, then click Run button after picking the columns from the dialog. That will get you to the stepwise dialog. The stepwise dialog rounds the Prob to enter to 3 digits, but it’s using the value .00024 I typed in. You can also pick and lock in certain variables that you want to force into the model. 6 This is easily done in JMP by selecting the cases in the validation group, then use Rows > Exclude to set them aside temporarily. Data Mining Lecture 2 14 July 2005 MODEL SUMMARY History of the steps shows how each added predictor improves the fit of the model. Step History Step Parameter Action "Sig Prob" Seq SS RSquare Cp 1 AGE Entered 0.0000 234.8446 0.2121 221.55 2 WEIGHT Entered 0.0000 145.9499 0.3439 84.091 3 NTEL_URC Entered 0.0000 20.51918 0.3625 66.484 4 YR_POST Entered 0.0000 19.64329 0.3802 49.714 5 RHEUARTH Entered 0.0002 15.88204 0.3945 36.539 Significant overall fit, with R2 nearly 40% and strong effects, particularly Age and Weight. Summary of Fit RSquare 0.395 Root Mean Square Error 1.048 Mean of Response -1.564 Observations (or Sum Wgts) 616.000 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 436.8390 87.3678 79.5023 Error 610 670.3500 1.0989 Prob > F C. Total 615 1107.1891 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -0.7397 0.4186 -1.77 0.0777 WEIGHT 0.0139 0.0012 11.41 <.0001 AGE -0.0369 0.0064 -5.75 <.0001 NTEL_URC -0.0061 0.0013 -4.61 <.0001 YR_POST -0.0221 0.0054 -4.09 <.0001 RHEUARTH -0.6944 0.1827 -3.80 0.0002 Data Mining Lecture 2 15 July 2005 WHAT ARE THE COSTS OF THIS MODEL? Consider the errors of the following rule: “Classify a woman as needing follow-up if ! ˆ y < -1.5” Calibration plot7 shows the errors. Points in the “diagonal” quadrants are correctly classified. Those in the off-diagonal quadrants are incorrectly classified. Where should the red line go? -5 -4 -3 -2 -1 0 1 2 3 ZHIP-4 -3 -2 -1 0 1 Predicted ZHIP Classification results in tabular form. The error counts are shown in red.8 Actual Status By Predicted Status Count Row % Say Osteo Say OK OK 83 28.82 205 71.18 288 Osteo 233 71.04 95 28.96 328 316 300 616 7 JMP automatically shows you this plot of y on ! ˆ y . I got this one by saving the predicted values from the fitted model, then plotted them myself to have access to scatterplot tools. If you fit the line of y on x, you’ll get a line with intercept 0 and slope 1. 8 The black row percentages under the counts are the sensitivity and specificity. Data Mining Lecture 2 16 July 2005 Cost analysis Suppose an X-ray costs $200. Suppose that the probability that an untreated woman with low bone mass falls and break something is 10%. The hospitalization costs for these is $20,000. Costs of classification errors 83 × $200 + 95 × .10 × $20000 = $16,600 + $190,000 = $206,600 Get lower costs by changing the rule so that you find more of the women with osteoporosis. This rule “moves” the red line in the plot to the right. “Say osteoporosis if ! ˆ y < -1” Here’s the resulting table and cost analysis Actual Status By Modified Rule Count Row % Say Osteo Say OK OK 171 59.38 117 40.62 288 Osteo 305 92.99 23 7.01 328 476 140 616 Even though this rule makes more errors (194 vs. 178), it makes fewer expensive errors. 171 × $200 + 23 × .10 × $20000 = $34,200 + $46,000 = $80,200 Pay attention to the costs when making decisions!9 9 After all, why do we use a 5% cutoff for testing? What are the 2 errors in your test? Data Mining Lecture 2 17 July 2005 CHECKING FOR CALIBRATION If the costs of the two errors are equal and the predictions are calibrated, the obvious classification rule generates the lowest costs. The question to ask yourself: Is the relationship between the standardized hip score (ZHIP, the response y) and the predicted value captures by a line with intercept 0 and slope 1, or is something left over? This plot shows that neither smoothing splines nor polynomial adjustments add any value. -5 -4 -3 -2 -1 0 1 2 3 ZHIP-4 -3 -2 -1 0 1 Predicted ZHIP Smoothing Spline Fit, lambda=0.004194 Polynomial Fit Degree=6 Conclude that this fitted model is well-calibrated. Data Mining Lecture 2 18 July 2005 ANY SIGNS OF OVER-FITTING? Use estimated model to predict the cases in the validation sample. Does the model claim more accuracy that it obtains? Here’s the summary of the fit in the estimation sample. Summary of Fit RSquare 0.395 Root Mean Square Error 1.048 Observations (or Sum Wgts) 616 When used to predict the cases in the validation sample, I get this summary of the errors10 Means and Std Deviations Group Number Mean Std Dev Estimate 616 6e-15 1.04403 Validate 616 0.063043 1.01749 The mean squared error in the validation sample is ! MSE = mean 2 + SD 2 = 0.063 2 +1.0175 2 =1.0393 compared to 1.0442 = 1.090 in the full sample. In this example, the model did better in the validation sample than the data used to choose the model. No evidence of over-fitting. 10 Yes, this messes up the degrees of freedom. Compare the SD to the RMSE for the regression (1.048 vs 1.044). But it’s so easy that I could not resist. Data Mining Lecture 2 19 July 2005 DID WE GET EVERYTHING? We fit the model using just the basic columns as supplied. Do interactions improve the fit? Yes…we could have done better with a wider search Full search of all interactions improves R2 to 50% (a 20% improvement), albeit using a more complicated model with more than 20 predictors. To illustrate, if we add the following two features to the list considered in the stepwise search, we get this slightly better model. Weight * Height interaction (body mass index). Fracture? * Cal interaction The fit is perhaps more sensible because of the presence of BMI rather than weight alone. Response ZHIP RSquare 0.407833 Root Mean Square Error 1.037587 Observations (or Sum Wgts) 616 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 6 451.5482 75.2580 69.9043 Error 609 655.6409 1.0766 Prob > F C. Total 615 1107.1891 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -0.8768 0.4152 -2.11 0.0351 YR_POST -0.0206 0.0054 -3.85 0.0001 AGE -0.0324 0.0064 -5.04 <.0001 RHEUARTH -0.6784 0.1808 -3.75 0.0002 NTEL_URC -0.0056 0.0013 -4.33 <.0001 Weight*Height 0.0002 0.0000 11.53 <.0001 Fracture*Cal -0.5811 0.1521 -3.82 0.0001 Data Mining Lecture 2 20 July 2005 HOLDING BACK FOR VALIDATION HURTS We’d have found different features had we not kept a subset for validation… The following results summarize a model fit to all 1232 cases, including searching over the two added interactions. We discover, for example, that osteoporosis is not such a problem for black women. That’s known clinically, and with all of the data in play we can find it as well. We also get more precise estimates of the parameters in the model. Note the SE for Age has dropped from 0.0064 down to less than half, 0.0031. Response ZHIP RSquare 0.397544 Root Mean Square Error 1.005642 Observations (or Sum Wgts) 1232 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 6 817.4908 136.248 134.7240 Error 1225 1238.8613 1.011 Prob > F C. Total 1231 2056.3521 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -0.5013 0.2640 -1.90 0.0579 RACE-1 -0.3786 0.0936 -4.04 <.0001 AGE -0.0397 0.0031 -12.66 <.0001 RHEUARTH -0.6446 0.1273 -5.06 <.0001 OSTASE -0.0258 0.0050 -5.22 <.0001 Weight*Height 0.0002 0.0000 17.57 <.0001 Fracture*Cal -0.6327 0.1016 -6.23 <.0001 Data Mining Lecture 2 21 July 2005 WHERE ARE WE? Stepwise regression used properly is a powerful data mining tool. The five C’s of data mining. Clarity Cost Comparison Calibration Cross-validation WHAT NEXT? Calibration should not be taken for granted, and logistic regression often improves the calibration of a model. I’ll show you why. We’ll plant a few trees, too.