The goal of Tortured Data is to make wise data consumers out of our over-eager data-loving culture. Don't believe me when I say we have a sometimes unhealthy reliance on data? How about the last time you wanted to talk about labeling fruit as GMO? Don't tell me nobody brought out the statistics on how much happier blue whales were before we introduced genetically modified zooplankton off the California coast. So when you were shown that chart (and it invariably looked like the chart below), if you are a data ninja, (as you will be shortly) you repeated the hypnotic chant: "Association is Not Causation" (I'm thinking like Master Splinter when he hypnotized Michelangelo to reject pizza. Splinter = Me, Michelangelo = You).
This data comes from UN Office of Drugs and Crime (UNODC). I took a sampling of countries similar to the U.S. in terms of per capita GDP, political structure, etc. This is essentially Western Europe, North America, Australia and Japan. If you're for gun control, you focus on Japan and the U.S. (circled in blue) and draw a line roughly like the blue line shown (an actual regression looks similar to this). If you're against gun control, you look at Turkey and Iceland (circled in red) and draw a line like the red line shown (this is roughly what you'd get in a regression if you dropped the U.S.)
(Disclaimer: I really don't care about gun politics right now. See the forest, not the tree young padawan).
Let's ignore the red now. As I said, an actual regression line looks like the blue line shown. Thus the implication is more guns = more crime, and the obvious corollary, fewer guns = less crime. Thus, let us work to remove guns in any way possible. We could restrict the sale of guns in the U.S. so that we have no more than 10 guns per 100k people. That would put us right near Spain and Italy, and according to this ineffable formula for crime reduction, we would have roughly 1.25 crimes per 100k people. That's a near 75% reduction in crime!
("But Aaron, the 2nd amendment, the right to bear arms-" No grasshopper! Listen to your sensei, and focus!)
I've created a formula that would single-handedly produce the greatest reduction in crime in the history of the US! Though this beautiful graph is so compelling, we must remember: association is not causation. If you're not saying that aloud, you can never expect to be Michelangelo! Now repeat! "Association is not causation!" Shout it till you mean it!
There may be a statistical association between firearms per 100k people and crime per 100k people, but that does not mean that firearms cause greater crime. More guns is associated with more crime, but we cannot say from this data alone that more guns causes more crime. Establishing causality is hard but possible in some cases with tricky math and other cool things (technical term).
In this gun scenario, we could just as easily say that crime causes the number of guns per 100k people to go up. Thus if we want to remove guns from a nation, we simply have to lower crime. BAM! *That was logic hitting you square in the face!*
To finish this up, another fun chart (real data).
As you can see, perhaps an even more viable way to reduce violent crime in the U.S. would be to eliminate all use of Internet Explorer. Bet you didn't know that Firefox and Google Chrome were the greatest crime deterrents we know of! But before you write your representative and call the White House, let's have a lesson on logic. As you now know, association is not causation. Have a look at my awesome diagram.
When we see an associative relationship, there are three possible general scenarios. One is that A causes B (causation shown by solid arrow). Similarly, another is that B causes A. As you can see, these fall under "Causation". However, the third option is the kicker. There may be a third element C that we can't measure or isn't accounted for where C causes both A and B. Thus whenever we have C, we will see both A and B, and it may also be the case that whenever we don't have C, we see less of A and B. This would make A and B associated in our data when they are certainly not causal. (The other possible explanation which I haven't included is simply that the association is purely coincidental as is likely the case in the Internet Explorer market share and murders in the U.S. especially if the scale is messed with).
Less abstractly, in the case of guns and crime rates, it may be the case that more guns cause more crime, or that more crime causes there to be more guns. If either of these causal relationships is established, it's much easier to know what to do if our goal is to lower crime or lower gun ownership. However, there may be another element that we're not accounting for or can't measure. An example would be cultural factors that lead to increased crime (disproportionate income distribution in urban areas, racial conflicts, prevalence of gangs, etc.) Those cultural factors may lead to more U.S. citizens owning guns for protection while also being a cause for more crime. But if this is the case, it would mean that increased gun ownership is not causing violence.
Another example of an unknown variable could be that we have an extremely happy blue whale population off the coast of California (because we labeled all GMO food for blue whales with a bright yellow sticker and, health conscious as blue whales are, they stopped eating it), and happier blue whales means less krill but more other food for other sealife, which means Americans eat more seafood, and since everybody knows that gangs and mobs eat a ton of seafood, we have a ton of gangs and mobs, and thus more violence. Seafood also causes extreme paranoia (because of the mercury) so Americans feel a stronger need to own guns (breathe) so clearly, happier blue whales has caused more crime and more gun ownership. Damn those yellow labels! Monsanto was right all along!





I think you have misinterpreted your Internet Explorer market share data. The more likely explanation is that the murder rate has dropped for reasons totally unrelated to browser usage. However, Chrome and Firefox users are actually just way more likely to be the victims of murder, so as murder has decreased, more non-IE users are living longer, diluting Microsoft's market share.
ReplyDeleteThe student has become the master. You are right, that is the more likely explanation.
DeleteMore serious note. This was in your lead on g+: "Don't believe me when I say we have a sometimes unhealthy reliance on data?". Did you just mean to say "we ascribe too much value to misinterpreted data" or what you actually said?
ReplyDeleteMore a mix of the two sentiments. I do think that we often feel data is telling us more than it really is, at least as far as economic data goes. But I also think we are too easily susceptible and willing to listen to somebody as soon as they give us numbers, regardless of how accurate or well-interpreted those numbers are. The first sentiment I think is hard for engineers to understand because for them, data is like measuring the temperature: aside from very negligible measurement error, the temperature is the temperature and there's nothing more to be said about it. But in Economics, take even the most basic measurements like GDP, unemployment, or even population. All of these are difficult to measure, come with important disclaimers, and are frequently revised. That's not to say they're bad or shouldn't be used, only that we almost always need more than one indicator to understand the reality of the economic situation, and to correctly interpret the other indicators. That'd be like saying we need wind speed to understand temperature. We need personal income measures, unemployment, labor force, employment, housing prices, inflation, business starts, etc. just to have an honest idea of the state of the economy. This is because any one of these measures alone is relatively meaningless. In other words, economic indicators are low-entropy compared to physical measurements.
DeleteSuper main point? Statements like, "The unemployment rate has improved more under President Obama than under any other president since FDR," are taken even more seriously when it's stated as, "The unemployment rate has improved 22%, which is the greatest single-term improvement since FDR," even though both statements are nearly meaningless. Economic data is very politically motivated and very easy to manipulate, hence it very often needs an educated eye to understand correctly. Part of that education begins with realizing how limited information from this data can be.
In the end, when President Obama's staff gets up and goes on and on about how much the economy is improving because the unemployment rate fell and GDP grew at historic norms and gas prices are low etc., those are meaningless if you're at home and your previously working wife is no longer looking for a job (because she couldn't find one), and though gas prices are lower than they were a year ago, they're twice as high as they were two years ago, and though GDP grew at a historic 5%, that's from a historically low GDP. That's what I mean by having an unhealthy reliance on data. It's like we let data tell us something different than what our eyes are seeing, as if the data knew better than reality.
When data is presented in full and accurately and interpreted honestly, it's meaningful. Economic data rarely meats those criteria.