Posted on June 16, 2013
R Slopegraphs and the HHP Leaderboard
I’m still working on my visualization-fu, so when the Heritage Health Prize finally got announced, the final scores provided a simple source of data that I wanted to investigate.
I’ve written about the HHP before.  After spending three years with the competition, the winners were announced at Health Datapalooza just a few days ago.  Prior to the announcement, the teams had been ranked based on a 30% sample of the final data, so it was of some interest to see what happened to the scores against the full 100%.  For one thing, I personally dropped from 80th place to 111th, and the winners of the $500,000 prize jumped from 4th place to take the prize… not an unheard of jump, but given the apparent lead of the top 3 teams it was somewhat unexpected.  The results were published on the HHP site, but I scraped them manually into a .csv format for a little simpler manipulation.  An Excel file with the raw and manipulated data is attached here:  HHP Final Standings for convenience.
A decent visualization for this before-and-after style information is the slopegraph.   Here’s an example:
The top 50 teams from the pre-announcement list, and the top 50 teams from the final standings are shown (a total of 74 teams due to teams moving into or out of the top 50).  Lower scores are better, so the first place teams are at the bottom.  You notice a few things quickly from this chart.  First, the first three teams from the public leaderboard (the 30% dataset), had scores that dropped significantly, allowing the fourth place team to overtake them on the right (100% dataset).  It’s possible this is due to “overtraining”… where, through hundreds of test submissions, scores are artificially inflated by making improvements customized to the 30% and losing generality in the process.  There’s no guarantees that’s what happened here, but it’s a reasonable possibility.
The other blatant result is the upward slope of practically every entry. Â This would be expected if everyone overtrained in some way, but it could also be due to some data bias between the 30% and 70% that made the held-back data harder to predict. Â To get a bit more of an idea, we expand the list to include the top 500 teams on each side in the same manner:
I could have fixed the upper and lower y-axis, but I chose not to. Â You still see that the upward trend holds pretty consistently. Â There appear to be two different typical slopes, one steeper than the other, but I haven’t gone to any lengths to prove that mathematically. Â Alas, I seem to be in the steeper slope, which means I lost more ground than those in the shallower slope, thus my 30+ position drop.
One last interesting tidbit: Â while in the first graph it appears that the overall range of scores is reduced (the left side of the graph is much taller than the right side), that trend is far less pronounced in the second chart. Â This is a good example of the risk of limiting the analysis to only the top 50 scores, and how the obvious outliers (the top 3 public-scores) can alter a chart’s perception. Â Then again, some reduction was necessary. Â For completeness and comparison, here is the completely unfiltered chart:
Obviously this paints a very flat and oddly distributed picture. Â The range of outliers at the top (the very bad entries, probably made by teams just experimenting and making single near-random entries), causes the important detail near the peloton to squish down to be difficult to discern.
So, a quick little chart analysis that led to some interesting insights. Â For those that are interested, here’s the R code for the top-50 chart (which is easy to modify to produce the others):
hhp <- read.csv("HHP Final Standings Sparkline Source.csv") hhp$Team.Name. <- gsub("[^[:alnum:]///' ]", "", hhp$Team.Name.) hhp$left <- 1 hhp$right <- 2 hhp$rank <- rank(hhp$Final.Score) hhp$prerank <- rank(hhp$Public.Score) xrange=c(.5,1.5) yrange=range(c(hhp[hhp$rank<=50|hhp$prerank<=50,]$Public.Score,hhp[hhp$rank<=50|hhp$prerank<=50,]$Final.Score)) png(file="hhp_slopegraph_full.png",width=6,height=6,units="in",res=600) palette(grey(0:75/75)) with(hhp[hhp$rank<=50|hhp$prerank<=50,], { xrange=c(.75,2.25) yrange=range(c(Public.Score,Final.Score)) plot( xrange, yrange,type="n", ylab="score", xaxt="n", xlab="" ) grid() axis(1,at=1:2,labels=c("public","final")) segments(left, Public.Score, right, Final.Score, col=rank) }) dev.off()