After a long hiatus, I’m back with a new post! I really wanted to write about Chicago traffic for this post, due to my recent observation of new “bus only” lanes running down Madison and Clinton in the loop which I feel would have been better utilized elsewhere (for example Lasalle, which has 4 bus routes and takes 20 minutes to travel 4 blocks during rush hour). On a related note, there is a project in the works for a new “flyover” L track at the Belmont stop to reduce train delays due to red and brown lines crossing, but there are downsides to this development, including cost and aesthetics of the neighborhood. Ideally I was hoping to find the data sets to analyze ridership and traffic data to investigate both the Loop Link and the Flyover projects, but unfortunately the necessary data is not currently available. I made a request which hasn’t been moderated yet, but when it is posted you can help me get the data by signing in and up-voting my request here.
Since my data set of choice was not an option, I needed to come up with a new topic for analysis, and I approached this decision in a very scientific manner (googling “interesting data sets”). Luckily I found an amazing resource, 100+ Interesting Data Sets for Statistics. The second data set on the list is a compilation of the last statements of all inmates who were executed in Texas since the 1980’s. That seemed like an interesting corpus for some exploratory text analysis, so I decided to work with that.
Getting the data proved to be a bit of challenge. On the website’s main page there is a chart with some demographic information about each inmate, and in this chart is a link to another page with the text of each inmates last words. There are 500+ rows and links, so I definitely did not want to click through each link and copy/paste all that data. This gave me an opportunity to test out a tool I found a while ago called import.io. This tool enabled me to provide the web addresses of the data I wanted, and then it would sift through each link and harvest the text into a file that I could download as a .csv.
Once I had all the data in a flat file format, I started familiarizing myself with the content. Reading through a few of the last statements, I noticed that there are certain themes that appear. Some offenders ask for forgiveness. Some maintain innocence, or speak to mitigating circumstances around their crime. Some talk about their faith. Many say thank you to family and friends. I thought it might be interesting to see if these themes have any correlation to the demographic data in the data set, like age, race, location, etc., however since the themes are not in the data set, I would need to create them. I decided to use topic modeling to score the statements for themes, and then I could use the results of that model to do the demographic analysis.
Of the 512 records that I was able to import (19 of the 531 on the website failed import), 153 inmates either declined to make a statement or made one less than 15 words, so these were excluded from the data set. That left me with 359 last statements, including the race, age, date of execution and county of the prisoners who made them.
The next step was to clean the text data and put it in a format that can be used for analysis. The text is cleaned for punctuation, numbers and whitespace. Stopwords are removed (words like “a”, “and”, “but”, etc.) and stemming is performed (grouping words that have the same root but different endings, for example run, runs and running). From there I created a term-document matrix, which transforms the data from a compilation of documents to a matrix that contains the number of times each word appears in each document.
Looking at this matrix, I can see that by far the most frequently used word in an offender’s last statement is love. Other frequently used words include family, thank, forgive, god/lord, and hope (some of the words have unexpected endings, like familia instead of family, due to the stem completion function).
Next I transformed the values in the matrix from counts into term frequency inverse document frequency (tf-idf) values. This statistic gives higher weight to terms in a particular document if they are frequent in that document, but not in the corpus. Words that are frequent in the corpus overall get lower weights.
With the matrix prepared, I could finally perform the topic analysis. I used a package called “topicmodels” which contains a function to perform LDA (latent dirichlet allocation), a type of topic modeling. In LDA, the terms in the documents are tested for co-occurrence, and words that occur together often are grouped into a topic. This method is unsupervised, meaning the outcome data is not provided by the user, instead the user would need to review the words in each group to give meaning to the topics.
This package requires that the number of topics be specified. To choose a number of topics, I used three different measures; log-likelihood (measures how well the model generalizes to a test sample), entropy (measures how the topics are distributed in the documents), and human judgement of semantic meaning. First, I ran the LDA model for 2 through 20 topics (for each topic count I ran 20 iterations and averaged the measures) and collected the log-likelihood and entropy measurements. I then plotted these:
The plots on the left show the average log-likelihood (top, higher is better) and entropy (bottom, lower is better) for each topic count. The plots on the right are the distance of each point from the diagonal, which is helpful for determining the point of diminishing returns. From these, it appears that log-likelihood is optimized at 8 topics, and entropy is optimized at 7. From there I ran models with 7 and 8 topics, and used my judgement to determine which version produced topics with the most semantic meaning. I ended up selecting a model with 8 topics:
Here are a few examples of statements in the corpus, and how the topic model scored them:
Now that I’ve created the topics, given them meaningful labels based on the common words and verified that these categories make sense by looking through a few examples, I want to visualize the documents and topics in the data set to see how they relate to each other:
In this diagram each node is a document, and the nodes are colored to demonstrate which topic was most prevalent in that particular document. The nodes are connected to other nodes that are most similar in topical makeup, so this allows you to see which topics are often mentioned together, and which are more discrete. It appears that topics 7 (Encouragement) and 2 (Our Father) are often mentioned together, while topics 1 (Mitigating Circumstances) and 8 (Message to Family) are rarely grouped with any other topic.
Next I would like to look at the breakdown of the topics by a few variables that I have in my data set.
Here are two charts showing the topics through the two time variables, date and age:
There aren’t many clear trends across time. It looks like topics 3 and 8 have generally decreased, and topics 4 and 6 have generally increased, but there is so much variability that it is difficult to draw conclusions from these charts.
Looking at the prevalence of the topics by offender age, it does seem that two of the topics have general patterns. As an offender gets older, there is a clear increase in topic 1 and a decrease in topic 7. This is interesting as these are the only two topics that included the word “innocent” with high probability. It seems that the number of offenders who mention innocence (the combination of these two topics) remains relatively constant, but the messaging changes with age. When an offender is younger, they state their innocence along with pleas to their friends and family to stay strong and continue the fight for justice. As the age of the offenders increases, they use this messaging less and instead detail their own circumstances and how it is that they ended up where they are.
Next I looked into regional differences. I had county data in my data set, which I was able to tie to larger public health regions to reduce the number of pivots.
From there I combined regions with sparse data into larger areas to bolster sample size, which resulted in 6 regions for the analysis:
The chart above shows the deviation from the mean for each topic in the various regions. The Northeast and Southeast are relatively average in terms of which topics are in offenders’ last statements. Offenders in the South use topic 6 more than average (Apology and Forgiveness), and topic 1 (Mitigating Circumstances) less, so it seems that this is the most empathetic region. Offenders in the Northwest use topic 6 less than average, so this might be the least empathetic region. Offenders in the East use topic 5 (Religion) less than most, so this seems to be the least religious region. The most variable region is the Central region, in which offenders use topics 2 (Our Father) and 5 (Religion) more than average, and topics 6 (Apology and Forgiveness) and 7 (Encouragement) less than average, which seems to imply that this region is the more focused on the afterlife and less on the current life than the others.
To check whether there are differences in the topic usage by race I ran ANOVAs on all 8 topics. I found that topics 1 and 3 show marginally significant differences between races (p < .1) and topics 6 and 7 are significant at the p <.005 level for race differences. Looking at the two innocence related topics (1 and 7), this tells us that both white and black offenders speak of innocence more often than Hispanic offenders, although they use different language. White offenders tend to refer to the circumstances, while black offenders tend offer words of encouragement to their loved ones. Hispanic offenders tend to apologize and ask for forgiveness by far the most often of the three races (topic 6).
With this methodology, these statements could be reviewed along with other statistics for bias in conviction rates. Specifically when it comes to the topics related to innocence, determining whether a specific population (race, age, region) is more likely than others to maintain innocence could help to research whether wrongful convictions are occurring at a higher rate in that population. In this case, I would only use these algorithmically generated topics as a starting point, as the number of documents is relatively low and the most accurate way to determine the content of the statements would be to have human raters scoring them. Topic modeling however allows for expedited initial research, and in cases where the number of documents would be too large to have human raters, it may be the only way to generate an outcome variable.
I know the subject matter was a bit dark, especially considering my last post was on “The Bachelor”, but I hope that this post was interesting and informative. I welcome any feedback in the comments, and I hope to have a new post out soon about Chicago traffic data!