What Can Text Analysis Tell Us About This Year’s SXSW Keynotes?

Well howdy folks, this week officially kicks off the SXSW craziness for 2014. Here in Austin we look forward to this week every year. To friends and colleagues making their way to our fine, warm state to overindulge in tacos, Shiner Bock, and digital goodness, we hope you get your fill. But beyond all the startups and parties, the conference brings in some serious talent to present and amaze crowds with great presentations each March--and this year is no exception.

To get ready for SX, I thought I’d check out the Twitter feeds of a few of the keynote speakers for this year’s Interactive Conference. Instead of looking at influence or geolocation, today we’ll dive into some basic text analysis. What topics do they talk about? And more importantly, what topics get the biggest reaction and engagement from their followers? Let’s find out.

I grabbed the 3,000 of the most recent Tweets from the Twitter accounts of three of the SXSW Interactive keynote speakers - Neil Degrasse Tyson, Adam Savage, and Chelsea Clinton. Next, I used the great NLTK Python package to perform natural language processing on the text of each speaker’s tweets. Don’t be scared, it’s not as geeky as it sounds. Or maybe it is, but I'll do the hard work for you on this one.

What Can Text Analysis Tell Us?

When dealing with large blocks of text (or a collection of lots of small blocks of text, like we get with a Twitter feed), text analysis can reveal insights from a high-level view. Today we’ll use a few different methods to look at top words and phrases used by each author, and investigate how diverse their vocabulary is. Hopefully by the end of this post, you’ll have a better idea of what topics might come up in each keynote, and amaze your hungover friends with your prediction skills.

In addition to just finding out what words are used the most, we’ll dive into which topics are resonating with each speaker’s audience. Let’s do this, y’all.

What’s the Most Frequently Used Word in Texas?

It’s “Texas”.

Let’s start off looking at very simple text analysis measurement: word frequency. First, I remove all “stop words” from each set of tweets. Stop words are boring, commonly used words that don’t tell us much about the topic being discussed (words like “me”, “but”, “am”, etc.) We throw all those out to get to the good stuff. Then, we break all the remaining text into individual words (called “tokenization”.) After that, it’s easy to perform a frequency distribution analysis for each speaker.

Here’s Neil Degrasse Tyson’s most frequently used words on Twitter:

So...what you’re saying is that Neil Degrasse Tyson likes to talk about space stuff? Nice job, Einstein. Sure, this is a very simplified look at his vocabulary, but it validates that the text in his Twitter feed aligns with his personal brand. Let’s dive deeper and look at frequency two-word phrases (called “bigrams”), which might tell us a bit more.

Mr. Tyson had a series of Tweets where he jokingly talked to the Mars Curiosity Rover, mentions “Cosmic Perspective,” and the “Space Shuttle” a lot. He also really, really likes moons. But these are just the most commonly used phrases by count. Let’s use another measurement called Pointwise Mutual Information (PMI), which gives us back the most interesting and informative phrases vs. phrases with words that are used more in the English language. Moving from frequency to PMI is a small step for social, but a giant step for...oh never mind.

Neil Degrasse Tyson’s highest-scoring PMI phrases:

  • “Wednesday Word Arithmetic”: where he plays around with adding text strings together, which is totally meta considering the topic of this post
  • “Los Angeles”: for a series of Tweets about the Los Angeles Angels of Anaheim's weird name
  • “Sound Banquet”: for weekend links to long-form audio and video on science topics
  • “Cartoon Laws”: for understanding the physics of Bugs Bunny
  • “Big Bang Theory”: surprisingly, mostly Tweets about the TV show, not the origin of the Universe

So anytime Mr. Tysons mentions any of the above phrases in his keynote, take a sip from your Mexican Martini and act like you knew NDT before he was cool.

Let’s quickly look at Adam Savage’s language distribution:

He talks about MythBusters a lot (and his co-host, @JamieNoTweet) which makes sense. What do his highest-ranking PMI phrases tell us?

  • Maker Faire”: a do-it-yourself conference/gathering that he is very involved in
  • “Ping pong”: mostly discussing them firing and destroying ping pong balls on the show
  • “Tape Island”: an episode of MythBusters where they have to survive on a desert island with only a bunch of duct tape
  • “Star Wars”: if you don’t know what this is, please stop reading now, seriously

With Chelsea Clinton, we see the following pattern:

She talks about Clinton-ish stuff a lot, including the CGIU (Clinton Global Initiative University), global political issues, the Clinton Foundation, etc. How about her highest PMI phrases?

  • “Supreme Court”
  • “Childhood obesity”
  • “Poaching crisis”
  • “make a difference”

Well, we’re definitely not going to confuse Chelsea Clinton’s Twitter feed with Neil Degrasse Tyson’s. Her top topics are all around politics and global issues, which makes total sense.

Well Look at the Big Brain on Brad

So of our three keynote speakers, who has the best Twitter vocabulary? Text analysis can tell us this via a measurement called lexical diversity (try dropping that term at the door for your next SX party, you’ll get in guaranteed.) Lexical diversity is a measurement of the range of vocabulary in a block of text. How do our speakers rank in mixing it up?

Lexical Diversity:

  • Chelsea Clinton: .2799
  • Adam Savage: .2281
  • Neil Degrasse Tyson: .2087

Chelsea Clinton is using the largest range of vocabulary in her Tweets, easily beating out Adam Savage and leaving Neil Degrasse Tyson in the cosmic dust. So if you're attending her keynote, be sure to bring your thesaurus.

Topics vs. Response

So enough of just counting stuff. Let’s look at how each speaker’s audience is responding to each topic that they mention. Or said another way, what should each speaker tweet about more?

I’ll take a look at the primary success metrics for engagement with Tweets: retweets and favorites. We could build out a sophisticated scoring algorithm here, but let’s just keep it simple. For each speaker, I just took the top quartile of Tweets that had the highest retweets and favorites, and ran a word frequency analysis on that subset. Or said another way, if we just look at the most successful Tweets for engagement, what words are used the most?

Most Effective Topics: Neil Degrasse Tyson

Most Retweeted Phrases

  • “Vaporizing”: Tweets about the Earth being destroyed or what temperature people vaporize at. I’m not kidding about this one
  • “God, sports”: Tweets about athletes thanking someone above for their performance

Most Favorited Phrases

  • “Have a nice day” Tweets, the phrase that he includes after dropping a big, scary fact
  • Mentions of Bill Nye: Mr. Tyson should keep mentioning his buddy Bill Nye, his followers love it every time he does.

Most Effective Topics: Adam Savage

Most Retweeted Phrases:

  • “Please RT”: this may seem like a silly one, but it’s actually a great thing to know. When he asks his followers to retweet something, they do it.
  • “Mythbusters”: mentions of the show get good sharing rates from followers
  • “SOPA”: his stance on the SOPA legislation was shared widely

Most Favorited Phrases:

  • mentioning his co-host, @JamieNoTweet
  • mentioning “Star Wars”: he seems to have found his demographic

Most Effective Topics: Chelsea Clinton

Most Retweeted Phrases:

  • “Full recovery”: a series of Tweets about her mom being in the hospital
  • “$1 men”: while at first this phrase seems to fit better on SnapChat vs. Twitter, these Tweets are actually about income inequality between men and women
  • “10th state”, “11th state”: Tweets about marriage equality in the US

Most Favorited:

  • “Mom Dad”: her followers love them some Clintons


What Did We Learn

Analyzing a huge amount of content can be a daunting task. Text analysis can help take a look at large volumes of social data from a 10,000 foot level to understand patterns and high frequency language. But with whatever we do, we also want to bring in success metrics and KPIs to understand not just what is being talked about, but what is resonating with the audience.

I hope you enjoy SX week, whether you’re here in Austin or watching from the sidelines. Based on my calculations, my high frequency terms this week will be “beer”, “Advil”, and “if you see my family please tell them I love them.”

Chris Kerns's picture

Chris Kerns

Chris Kerns has spent more than a decade defining digital strategy and is at the forefront of finding insights from digital data. He currently leads Analytics and Research at Spredfast. His research has appeared in The New York Times, Forbes, USA Today and AdWeek, among other publications.