Tortured Data

If you torture the data long enough, it will confess

Month: September 2014

Twitter Clouds

I know I mentioned in the last post that I was planning to keep working on the Chicago data, but I got distracted by a cool bit of code that April Chen sent to me, which pulls in posts from Twitter for text analysis in R.  You can enter a specific search term, or pull in a specific user’s past posts.  It seems to be a bit limited in that you can only receive up to 199 tweets per request, and it also seems to only allow you to go back in time by one week.  I was able to get decent sample sizes of ~1000 tweets or so by requesting a sample from each day of the past week for each search term.

Here are a few word clouds I created which show which other words are commonly used with certain hashtags:


datascience #blessed


I also looked at a few comparison clouds, which show the differences in how often words are used between searches.

Here is #liberal vs. #conservative:


And here is a cloud showing the differences between the tweets from the Progressive Insurance Corporate twitter account and the customers who tweet something @Progressive:


This cloud shows that Flo is a popular term among customers, which at first glance would lead me to believe that customers love Flo.  However the word “bitch” also shows up in the cloud, could these be related?


The chart above displays the top 15 words in the cloud, and the connections between the words illustrate which words are correlated to each other. Indeed, customers think Flo is a bitch 🙁

Chicago Food Inspections

The city of Chicago makes many government data sets publicly available through their data portal.  One of those data sets contains all the data from food inspections conducted throughout the city.  This data contains the records from all food inspections from January 2010 forward, and it is updated weekly (the version I worked off of was last updated August 21st 2014).

There is a column in this data set that explains what prompted the particular inspection, with possible values being “canvass” (basically random selection), “complaint” (a patron of the venue complained), “sickness” (a special case of complaint, a patron believes he/she contracted food poisoning from the venue), etc.  I thought it might be interesting to use this to try to find out which specific violations are correlated with food poisoning.

The data set provides 45 different violations that a restaurant could have.  I condensed these 45 separate violations into a few larger categories of violations.  I wanted to see which categories are more prevalent during inspections that are prompted by sickness than random canvass inspections.  Here is a chart showing my findings:


Probably not surprisingly, the violations that seems to be the most likely cause of sickness are hygiene violations (not washing hands, not wearing hairnets, etc.).  From this, I was curious which venues are the most and least likely to make people sick.  Here is a bar chart showing which venue types have the most to least hygiene violations:


 And here is a chart showing hygiene violations by zip code (the dark regions have the most, light areas have the least):


Link to Viz (at this link you can hover over each region to see the zip code and the percentage of the violations at that zip code that were due to hygiene)

Given this, it looks like you are the safer from food poisoning if you eat at a gas station in Portage Park than if you eat at a coffee shop near Goose Island.  Who would have thought?  I think it’s also a bit surprising how high health focused venues (like juice bars and cafes in fitness centers) are in hygiene violations, perhaps not so healthy after all!

For my next post I’m hoping to merge a few other datasets into this data keyed on zip code in an attempt to “profile” each neighborhood based on characteristics like number of 311 calls, types of restaurants, and whatever else I can pull from the data portal, so stay tuned!

© 2017 Tortured Data

Theme by Anders NorenUp ↑