This is the second part of my analysis of tweets containing #RugbyWorldCup based on a dataset collected from Twitter using rtweet between 24th September and 12th October. In this part I will be analysing the tweets themselves, doing some data exploration, sentiment analysis and topic modelling.
As I am analysing the tweets themselves I have discarded any retweets from the dataset leaving just organic tweets and replies for analysis. The first thing to do is clean the text. I converted all text to lowercase, removed any, hashtags, mentions, punctuation, hyperlinks, controls and special characters, and took out any extra whitespace. I also converted all text to latin, ASCII characters. I’m not sure how Twitters algorithms label a tweets language but there were plenty of Chinese logograms mixed in with my tweet words. Presumably some tweets were a mixture of English and Chinese and got labelled as English hence turning up in my search. Converting to latin text removes these.
Once cleaned the tweet text is tokenised (split into individual words). The most frequently occurring words in any nearly any corpus will be short words, “I”, “the”, “to”, “was” and so on. These are called stop words and typically offer little informational content. These are removed and word frequencies of the remaining words calculated.
Nothing too exciting here, lots of mentions of the countries participating. The above word cloud shows the 100 most common words in the dataset. We can also check the 20 most common two-word sequences (bigrams). In this sample, nearly half of the most common bigrams seem to relate to betting.
Rather than looking at bigrams individually as in the diagram above, we can visualise the network of commonly occurring bigrams.
The graph above shows which commonly occurring bigrams (n>30) occur together. The direction of the arrows shows the order of words and the darker the link the more common the bigram. For example you can see that typhoon occurs commonly before season and also before hagibis, quarter occurs before finals and also before final which occurs before score. The cluster of words in the bottom left of the graph appear to relate mainly to betting.
One thing I didn’t do in preprocessing this data set is stem the words, which involves converting them to their stem form e.g. finals would become final and anthems would become anthem. This has the effect of reducing and consolidating the number of words in the corpus and can make patterns easier to spot though text may be less legible in that case.
The graph above shows word correlations with correlation calculated using the phi coefficient. I’ve calculated common pairs of words occurring together in the same tweet. Only words with frequency greater than 50 were used and only correlations greater than 0.3 were plotted.
I did a basic sentiment analysis on the tweets using the bing lexicon and a simple dictionary lookup to calculate the numbers of positive and negative words in tweets per day. As you can see there was only 1 day where there were more negative than positive words in the tweets, this was the 10th October (possibly typhoon related?). The day with most positive tweet words was September 28th. Overall, sentiment is getting more negative over the time frame of this sample.
One thing that is very apparent is that the tweets in this dataset contain more positive words than negative ones, so days that had more tweets tended to have more positive words overall.
The graph above shows the 12 words contributing most to positive and negative sentiment in this dataset as a whole.
I used tidytext to do the sentiment analysis by day. This treats each days tweets as a bag of words and simply matches positive and negative words from the dictionary (bing in this case) and calculates the overall number per day. This has the major limitation that it does not consider word context so for example tweets that say “Ireland were not good” and “Ireland were good” will get the same score. There is currently only one sentiment analysis package in R that can handle negators such as “not” and other valence shifters (words that change a sentiment words polarity – more details here). That package is sentimentr. Using this should give us a more accurate picture of sentiment in the dataset.
In the graph above, sentimentr has calculated an average sentiment score for each tweet in the dataset and I then took the daily average of these and plotted them. The picture sentimentr gives us actually isn’t too different to the more basic sentiment analysis performed with tidytext. Sentiment is positive in the early stages but starts to drop as the tournament progresses with a precipitous drop on the 10th October after which it seems to be starting to climb again.
The graphs above, which show common words and bigrams from tweets sent on 10th October, indicate that the typhoon and annoyance at the resulting effects (i.e. matches being cancelled) is indeed what caused the drop in the sentiment scores on that day. There’s even some profanity in there! In particular the cancellation of the England-France match seems to have drawn a lot of focus, the cancellation of New Zealand-Italy less so.
To finish with the sentiment analysis I decided to divide the dataset into groups. In the wordcloud above, many of the most common words in this dataset are countries. I identified which tweets in the dataset contained country names (and common rugby team nicknames e.g. springboks) and then calculated the average daily sentiment in tweets about each countries rugby team with sentimentr and plotted on the line graph below. I only included teams who made the quarter finals in this grouping. The graph is interactive so if you double click on the legend it will clear all the lines and you can then click on individual countries to compare their sentiment.
So could we use sentiment in fans tweets about their teams during the pool stage of the tournament to predict their countries performance in the quarter-finals? Going on the results of the matches earlier today, then possibly. If you compare fans sentiment in tweets containing England and tweets containing Australia in the graph above, tweets about England were much more positive during the pool stages. Similarly tweets about New Zealand were more positive than those about Ireland.
For the matches tomorrow, tweets about Wales had more positive sentiment than those about France. It should be noted though that Frances game vs England being called off might have contributed to some of the negative sentiment in their tweets. Also with many of their fans tweeting in French, there is less data on them in this dataset. France are mentioned the least number of times in #RugbyWorldCup tweets of any of the quarter final countries.
In the second match both teams have relatively high sentiment scores, although Japans is slightly higher overall. Given that we saw already that there was a lot of negative sentiment about matches being called off, and that as hosts this is likely to have impacted negatively on them, the fact that tweets about Japan still show higher sentiment scores than South Africa overall might indicate that rugby fans are favouring them. However, in the last few days of the dataset, tweets about South Africa had more positive sentiment in them than tweets about Japan did and appeared to be trending upward. It would be nice to have data after the 12th October to see if that trend continued.
If predicting the results of the matches tomorrow based on the sentiment in tweets about rugby teams then one would probably have to go for Wales and at a push, Japan, but it would be nice to have more data and for a longer date range.
I wanted to try topic modelling with this dataset. Topic modelling is similar in idea to clustering except the goal is to find groups occurring in textual data rather than groups in a set of objects defined numerically. It might be interesting to see what topics are occurring in this dataset of tweets. The exploratory analysis has already identified some possible ones.
Topic modelling assumes that documents (tweets in this case) are constructed from a set of topics and each topic is constructed from a set of words. The goal is to find the topics or themes underlying a set of documents. There is an easy to follow explanation here, which goes into more detail about how the particular algorithm I’m using, Latent Dirichlet Allocation (LDA- not to be confused with linear discriminant analysis) works.
LDA is an unsupervised learning method so it requires the analyst to specify in advance how many topics to look for in the document corpus. There is no ‘best’ way of doing this. Some suggest just to try different values and choose the one that produces the most meaningful results. Others suggest building topic models for different values of k using cross validation, with the best value being chosen by comparing perplexity measures of the different models on the validation set (i.e. seeing which of the generated probability models best fits the validation set). However in this case I used the ldatuning package which calculates 4 metrics to allow one to choose the best k value for LDA.
I’ve used ldatuning to calculate metrics for k values from 5 to 70 (in steps of 5) and plotted the results above. The graph appears to indicate an optimum value of 10 for k. Running LDA on this dataset for 10 topics doesn’t give easily interpretable results though.
The graph above shows the 8 most common words within each topic and it’s apparent that there is lots of overlap between topics.
The difficulty with applying a topic modelling method to a corpus of tweets is that individual tweets are quite short to use as documents to build the models. Alvarez-Melis & Saveski (2016) suggest that the best approach involves pooling the tweets to create larger documents and then analysing these. They pooled tweets according to hashtag, according to username and also according to conversation and trained topic models using LDA and Author Topic Models (ATM). They found best results came from using ATM with tweets pooled according to conversation. It’s not clear how they were able to build their tweet conversation documents though, they indicate they used the reply_to_status_id field to do this but, in my corpus at least, most of the original status ids to which users were replying are not in my dataset which necessitates a different approach.