## When less is more

Recently the Center for Medicaid and Medicare Services (CMS) released provider utilization and payment data. This is part of the government's ongoing push for transparency into medical services and costs. You may remember that last year they released hospital specific Medicare data. This time the data is practioner specific.

This large dataset (over 9 million records) has a wealth of great information if you're interested in how much doctors charge and receive from Medicare, for which treatments, location, etc ... And frankly, who isn't?

There's plenty of good analysis on this stuff out in the internets, but one such post that caught my eye was the one from the Brookings Institute. Part of what caught my eye was their choropleth. Yes that.

Two thing struck me:

- Graphics these days should be interactive, fun and engaging. This one wasn't.
- More analytically, why did they choose to segment the map in 6 different shades of color (and hence 6 categories)?

Is it because they looked at the distribution of Medicare payment amounts, split them into 6 even buckets, and placed each state into a bucket? Or did they split all 50 states into 6 buckets and then figured out what the Medicare payments would be?

There are multiple ways to segment the states. I wanted to nerdify this process a bit, so I ran a kmeans cluster on all 50 states. I clustered them around Medicare payments, and I let the algorithm tell me how many clusters are necessary, based on the ratio of variance explained by the clusters, as a percent of total variance in the data.

The idea is that I want to know what the minimum number of clusters I need to segment my data such that I can explain the highest proportion of variation. For example, having 1 cluster explains 0% of the variation. Having 50 clusters (i.e. 50 states) explains 100% of the variation. But what's the sweet spot? 5? 10? 20? And how much of the variation is explained?

Let's see...

This chart shows that roughly 90% of the variation is explained by only 3 clusters! And as I increase the number of clusters, the amount of extra information that is explained is only marginally higher. So being parsimonious, I choose 3 clusters.

Equipped with this information, I can now draw my (interactive) choropleth:

Right. So 3 clusters tells me that some states receive a lot of Medicare payments, and others get much less*. Whether or not this is the best way to segment my data is another conversation, but at least now I have powerful and scientific explanation for the break down. And that simple analysis explains 90% of the variation in total payment disrepancy. I could add more clusters, but the amount of marginal information they explain may not be worth having more dimensions in my data.

After all, the point of performing data analysis is to make things simpler and more digestible for the consumer.

And in this case, 3 >> 6.

_{*Note that doesn't answer the question of "why" certain states receive more Medicare payments. This is likely driven by the distribution of procedures performed within each state, and the demographics of the population.}