One of the most popular methods of text visualization is the word cloud. These word maps are a great way to quickly get a sense of the major buzz words around a topic. While these maps can look complicated, they are actually quite easy to make in R using the wordcloud2
package.
In this post we will scrape tweets from a specific hashtag, clean the data, and create a custom word cloud with full control of the colors, size, and shape of the cloud.
Circling the Wagons
For this word cloud let’s scrape some tweets related to a New York football team. Since there is only one football team that actually plays in New York I guess we will use that one. 😉
First we have to get an API key. Inter#acktives has a great, very detailed, article on setting this up.
We have the option to collect tweets by user or by hashtag, let’s use the hashtag #gobills.
library(twitteR)
bills_tweets_df <- twListToDF(bills_tweets)
This line gathers 5,000 tweets, written in English, that contain the #gobills hashtag.
Cleaning the Data
Now we have a big ugly list of five thousand tweets. The next step is to extract the text and clean it up so it will play nicely with our word cloud function.
For cleaning the text Michael Harper did a phenomenal job developing a cleaning pipeline for word clouds, so I will be building on his method here.
library(tidytext)
library(tm)
library(qdapRegex)
library(stringr)
bills_tweets_df <- twListToDF(bills_tweets)
text <- str_c(bills_tweets_df$text, collapse = "")
text <- text %>%
str_remove("\n") %>%
rm_twitter_url() %>%
rm_url() %>%
str_remove_all("#\S+") %>%
str_remove_all("@\S+") %>%
removeWords(stopwords("english")) %>%
removeNumbers() %>%
stripWhitespace() %>%
removePunctuation()
Each of the above functions removes an unwanted element from our text. If the %>% is confusing, it is called a pipe and we will cover it another time, for now just think of it as a way to apply multiple functions to the text variable.
Next we need to convert the text into a data frame with two columns. The first column contains the word and the second contains its frequency.
textCorpus <- sort(rowSums(textCorpus), decreasing=TRUE)
textCorpus <- data.frame(word = names(textCorpus),
freq=textCorpus, row.names = NULL)
If we wanted we could plug our textCorpus
into the world cloud function, but let’s just check the distribution of word frequencies real quick using the boxplot()
function.
boxplot(textCorpus$freq, ylab='Word Frequency')
Woah! This doesn’t even look like a boxplot! It appears that we have some gigantic outliers on the upper end. If we try to feed this into a word cloud those words with large frequencies will dominate the word cloud and obscure the smaller words.
We can tune this by setting an upper frequency limit for our word counts. Based on this boxplot let’s set our max count at 150.
textCorpus$freq[textCorpus1$freq >150] <-150
This line takes any number in the frequency column that is greater than 150 and changes it into 150.
Customizing the Word Cloud
At this point we could make a regular word cloud but, I’d like to do something special. The wordcloud2
package allows you to choose a custom shape for your word cloud. All you need to do is provide it with a black and white image to act as a mask. For our word cloud let’s use this image…
Now we feed the data frame into the wordcloud2()
function and pass the path to our saved silhouette to the figpath
argument. The function also gives us the option to control color and size of the words.
figPath = "path/to/your/mask/image"
wordcloud2(textCorpus1, figPath = figPath, size = 0.5,
color = "#00338D",
backgroundColor = "white")
Let’s see the result (click on the image to zoom in)
Conclusion
Wow! That filled out nicely. Looks like there is a lot of buzz about the the upcoming playoff game against the Ravens that will be played at Orchard Park. I wonder what the results would be if this was done with other sports teams? I guess the word snow would probably come up a lot less. 😂
Thanks for the read and if you decide to create a word cloud for your favorite team I’d love to see it.
Bonus: If you are wondering why the word cloud looks so similar to the actual Bills logo despite not having an outline, check out Gestalt Psychology (more specifically the law of closure)
Categories: Uncategorized
Great post, circle the wagons, go bills and good data cleaning skills
Thanks Austin! Credit goes to Michael Harper’s blog for that.