Shooting Flying Bees is Hard

The flower patch was in full bloom today. I think we’ve finally seen everything that was in the yard when we bought it crop up (although July and August may still have some surprises). For the bees, however, this is prime pollen season, and they were very excited about it today. So I did what anyone who was in precisely my position and state of mind would do: I grabbed my camera.

Now, I have no idea if I used the right lens. The Canon EF 70-200mm f/2.8L IS II USM Telephoto Zoom Lens for Canon SLR Cameras is a magnificent lens, but it’s big and you can’t be too close when you focus. Of course, I didn’t WANT to be much closer, and it seemed to agitate the bees, so you never know.

Catching a few that were sitting on the flowers was easy enough:Nice profile bee on flower (small)

This Bee has some Pollen on it (zoomed) Read More

R Slopegraphs and the HHP Leaderboard

I’m still working on my visualization-fu, so when the Heritage Health Prize finally got announced, the final scores provided a simple source of data that I wanted to investigate.

I’ve written about the HHP before.  After spending three years with the competition, the winners were announced at Health Datapalooza just a few days ago.  Prior to the announcement, the teams had been ranked based on a 30% sample of the final data, so it was of some interest to see what happened to the scores against the full 100%.  For one thing, I personally dropped from 80th place to 111th, and the winners of the $500,000 prize jumped from 4th place to take the prize… not an unheard of jump, but given the apparent lead of the top 3 teams it was somewhat unexpected.  The results were published on the HHP site, but I scraped them manually into a .csv format for a little simpler manipulation.  An Excel file with the raw and manipulated data is attached here:  HHP Final Standings for convenience.

A decent visualization for this before-and-after style information is the slopegraph.    Here’s an example:

hhp_slopegraph_top_50 (720)

Top 50 Teams

Read More

Quick Python for ICD-10 XML Parsing

ICD-10 coding is a hot topic in medical data circles this year.  The short version is that, when you visit a doctor, they have a standard set of codes for both the Diagnoses and the Procedures relevant to your visit.  ICD, which stands for “International Classification for Diseases” has been around since 1900… that’s right, 113 years of standard medical coding and we still have a mess of healthcare data.  Ugh.  But ICD-9, which was the first to formally include Procedure codes (as ICPM) and not just Diagnoses, started in 1979 and is due for a facelift.

ICD-10 is the facelift, and it’s a pretty large overhaul.  Where ICD-9 had over 14,000 diagnosis codes, ICD-10 has over 43,000.  Many U.S. laws (mostly those that are touched by HIPAA) are requiring adherance to ICD-10 by October, 2014,  spawning a flurry of headless-chickens, and a rich field for consulting and the spending of lots of money.

Enter my job.  I’m trying to graft the “official” ICD9/10 crosswalk and code data into a Data Warehouse, in preparation for the analysis that needs to follow.  Naturally, I go and download the official data from here:  (Broken link, see update) and set of in SSIS to get things moving, because that’s what we use here.

UPDATE (2016-08):  The cms.gov links change annually, and the old ones die apparently… the latest is here: https://www.cms.gov/Medicare/Coding/ICD10/2017-ICD-10-CM-and-GEMs.html but I’m not going to keep udpating it.  Search Google for cms.gov ICD10.  Also, a very nice SEO person from zogmedia.com pointed this out, in a bit of a linkbaiting message, but, hey, they have a point, and they were cool about it… they wanted me to link to this site which may be better updated:  http://www.nuemd.com/icd-10/codes.

SSIS is plagued with issues.  I really must say that I don’t like it.  Having worked with everything from Informatica (obnote: I own some INFA stock) to mysqlimport via bash shell for ETL,  SSIS is low on my list.  In particular, for this project, when trying to load the XML files provided by CMS, SSIS complained that it can’t handle XML with mixed content in the XMLSource widget.  Once I tweaked the .xsd (which I shouldn’t have to do) to get around this, it complained of special characters in fields and got too frustrating to deal with.  Yes, there are alternatives in SSIS, but most involve coding in Visual Basic or C# and STILL using the SSIS tool.  This is a monolithic hammer to handle a very simple problem.

Look, all I really want is a list of codes and descriptions from the XML document.  There is a LOT of other useful metadata in there, but for now, it can wait.  Here’s a simple (not robust) solution in a handful of python lines:


import xml.etree.ElementTree as ET
import csv

csvwriter = csv.writer(open('diagnostics.csv', 'wb'))

tree = ET.parse('ICD10CM_FY2013_Full_XML_Tabular.xml')
root = tree.getroot()

for diag in root.iter('diag'):           # Loop through every diagnostic tree
   name = diag.find('name').text.encode('utf8')  # Extract the diag code
   desc = diag.find('desc').text.encode('utf8')  # Extract the description
   csvwriter.writerow((name,desc))       # write to a .csv file

And there we have a .csv which is much easier to load with whatever tool we want. This works well for the other XML files as well such as the DIndex and EIndex files, except for some reason they use different, optional, tags for their hierarchies… “mainTerm”s are the parent diagnostic codes and “term”s are the optional children. I’ll leave that as an exercise, though, it’s not too bad. 😉

Two Years with the Heritage Health Prize

ED: I spoke to a reporter yesterday for a half hour or so, discussing the final stretch of the Heritage Health Prize data mining competition I’ve been a competitor in for the past couple of years. Her article came out today and is posted here: 3-Million-Health-Puzzler-Draws-to-a-Close. I’m quoted as saying only: “They set the bar too high”. I probably said that; I said a lot of things, and I don’t want to accuse Cheryl of misquoting me (she was quite nice, and her article is helpful, well written, and correct), but I feel like a lot of context was missed on my comment, so I’m just going to write an article of my own that helps explain my perspective… I’ve been meaning to blog more anyway. 🙂

On April 4th 2011, a relatively unknown company called “Kaggle” opened a competition with a $3 Million bounty to the public. The competition was called the “Heritage Health Prize”, and it was designed to help healthcare providers determine which patients would benefit most from preventive care, hopefully saving the patients from a visit to the hospital, and saving money at the same time. And not just a little money either … the $3 Million in prize money pales in comparison to the billions of dollars that could be saved by improving preventive care. The Trust for America’s Health estimates that spending $10 in preventive care per person could save $16 billion per year, which is still just the tip of the iceberg for soaring health care prices in the United States.
Read More

Big Data: It’s not the size that matters

Many an article has been spent defining “Big Data”… everyone agrees that “Big Data” must be, well, large, and made up of data. There may be seemingly new ways of handling big data:tools such as Hadoop and R (my personal favorite) and concepts like No-SQL databases, and an explosion of data due to new collection tools: faster and more prolific sensors, higher quality video, and social websites. Large companies with the wherewithal to build petabyte and larger data centers are learning to collect and mine this data fairly effectively, and that’s all very exciting — there’s a wealth of knowledge to be gleaned from all this data. But what about the rest of us?

The thing is, it’s not really a matter of collecting and hoarding a large amount of data yourself. It’s how you use and take advantage of the data that you do have available that is at the core of these new trends. Read More