Posted on May 29, 2011
Musings from a few weeks of data mining
Ok, I’m still no expert data miner by a large margin, but I’ve learned a LOT in just a few weeks of playing with the Heritage Health Prize data. The folks on the Kaggle/HHP Chat Board are pretty helpful, and the internet is full of useful information. I’ve taken to using Excel and MYSQL far more than any mining-specific tools. I have been interested in R and RapidMiner, and I’ve been able to set up a few basic models with those tools. One thing I’ve been very happy with is the wealth of online tutorials available for just about everything. My resident 16 year old has been using them for a while to pick up piano and guitar songs, but I haven’t had much use until now; I’m pleased to report that the quality of these online free video or web tutorials is pretty high. I have a list started as a del.icio.us tag set if you want to see what I’ve been watching.
I’ve made 9 submissions (the first two or three of which I don’t count — let’s call those ‘test’ submissions). The 9th actually had a worse score than the 8th. Now that interests me. On my tests, which include several different sampling and “cross validation” methods on the two years of available data, my score on each submission improved from the last… not much in this last case, but enough for me to feel reasonable in submitting the algorithm. Why, then, did my result against the real data using the same algorithm go backwards? One possibility is that I’ve been overfitting the data. Basically, my algorithm makes assumptions that are either unnecessary or are only applicable to the sample data, and don’t hold true for the final data. At the tolerances we’re dealing with, it’s still possible that this is just a random selection bias issue, but it’s still interesting, and a common and very important problem in statistical data mining: how can you know when you’ve overfit? When do you know that you’re “trying too hard” as it is. 🙂
I also worry about the way my data has been looking. The goal is to predict how long someone is going to spend in the hospital. Most of the people are going to spend zero days in the hospital, although guessing “zero” for everyone actually isn’t a terribly good estimate. If you have to guess a single number for everyone it turns out that about 0.189 days is the best guess, in terms of achieving a good “score” — the score is a modification of the Root Mean Square Error estimate; basically the average difference between each guess and each actual value. The modification that the HHP competition uses introduces a logarithm to award precision for lower estimates… one of the other entrants has a nice discussion of the RMSLE algorithm used here. Anyway, my point was that I’m guessing almost no actual zeros, in an effort to improve my score. I wonder if this may have been better if the bias explicitly rewarded correct zeros or integers or something of the like. I mean this for purely functional reasons — knowing that a thousand people each have a very small percentage chance of going to the hospital is less valuable than having a high anticipation of a few going and a high expectation that the rest do not. It’s not that you don’t still get useful information on the larger dataset — some people will have higher “guesses” than others, but in reality my highest estimations are barely above 1. I’d love to see the frontrunners results of course, to see if they have the same situation, but I guess time will tell.
To give an idea of why I think this is an issue, I was doing the Memorial Day drive (about a 150 mile commute to my hometown for the holiday and my mom’s birthday), and I was musing about Google maps’ directions. I’d made the trip a few weeks ago and a friend of mine drove down separately. She’d gotten directions online, and when she called thinking she’d gotten lost I realized that the directions were nothing like the ones I would have given to her. Driving today I thought about taking the exit she’d been told by the Internet to see why an algorithm may have chosen it. Was it really faster? Shorter? Fewer turns? What exactly are these algorithms trying to optimize? When I give directions to my house, I leave out a few turns that I take in favor of taking easy to identify roads with fewer instructions. Easy to follow instructions are worth a quarter mile or so most of the time. I’m pretty sure this is common practice. Much like the goal of the HHP competition is to come up with actionable predictions, and the error scoring is a measure which hopefully correlates with the goal, the goal of directions is to get someone to their destination efficiently; distance and time are only two measures that correlate to the goal… annoyance at getting lost and taking unnecessary turns is another important metric.
I dig algorithm design… it’s really always been what I’ve enjoyed the most about programming (in some senses programming is nothing but algorithm design of course). This project has really been engrossing, and I’m excited about what I’m learning. I hope to share some more specific information, but in the mean time there are a few other people working on the Heritage Health Prize that are also blogging — Allan Engelhardt from CYBAEA is very active and helpful on the forums, and a blog called“Another Data Mining Blog” is being written by someone who goes by Sali Mali who seems to be exploring the Heritage data in much the same way I am, so it’s been fun to follow along. Keep up the nice blogging work, guys!
icanhazcheez?
Yes, yes you can.