We’re about halfway through the 2019 Rugby World Cup, the pool stages finished yesterday and I thought it might be interesting to look at what the Twitterverse has made of the tournament so far.
Since just after the beginning of the tournament (24th September to be precise) up til 12th October I have been connecting to the Twitter API every few days using rtweet.

Each time I collected around 18,000 tweets in English containing the hashtag #RugbyWorldCup. I collected over 210,000 tweets in total with this hashtag using the standard (free) API. As indicated in my last post, this API won’t return all tweets matching a query but instead returns a sample containing the most relevant tweets to a query, relevancy being measured by Twitter. The sample size therefore dropped to just 52,429 observations when tweets with duplicate status ids were removed. A tip if you want to avoid getting lots of duplicate tweets like this in your data when using rtweet, is to use the max_id argument and setting it to the id of the last tweet you collected in the previous batch. This has the effect of specifying the oldest tweet beyond which results should be returned.
The time period covered is the 20th September to the 12th October inclusive, so that’s an average of 2,383 tweets per day. Given that 13 of the countries participating are predominantly English speaking, we can be fairly confident that this is just a small sample of all the tweets sent during this time that contained #RugbyWorldCup.

The days seeing most tweets in this dataset are the Saturday and Sunday of the opening weekend of the tournament. There are smaller spikes on the 28th September, which was the date of Japans impressive victory over Ireland, and the 5th October, a day on which the hosts also played. Lowest numbers of tweets are occurring on October 1st and 7th, days on which there were no rugby World Cup matches.
The graph below shows the percentages of the data set made up of organic tweets, retweets and replies.

Organic tweets made up just over half the observations with retweets making up under half and a relatively small number of replies.

Looking at the sources of tweets containing #RugbyWorldCup in this dataset, it’s apparent most of them are being sent from mobile devices. Slightly more tweets were sent from iOS devices such as iPhones than from Android devices. There are 330 different apps in the Other Apps group most of them only used to send a few tweets with the exception of Instagram (n=635) and Twitter Web Client (n=601).
Twitter provides data about user location in a few ways. The most accurate is Tweet location where a user has opted to attach location information to their tweets. Most people don’t use this feature though. The second is location mention where any location names are parsed from the tweet text however mentioned locations may not be the same location as the user. Thirdly is the profile location. Lots of users have set this in their account information but again it may not always be accurate.
Using the lat_lng function in rtweet will add latitude and longitude information to tweets in the dataset that have location information attached. This is only about 5% (n=2579) of the tweets in this dataset. I’ve plotted these on the map below. You can zoom in and drag the map for more detail.
Most of the tweets are concentrated in Ireland and the UK but nearly all countries participating in the World Cup are represented. Interesting to see so many of these geotagged tweets in the USA particularly on the East Coast. Maybe rugby is growing in popularity stateside!
Another way of determining location is to use the account location information returned by rtweet. 78% of user accounts in this sample have non-empty location fields so this could provide a better approximation of where users are than the small number of geotagged tweets above . However the input field for Location is a free-text box meaning users can customise their location to whatever they like. There are 12,092 unique values for account location in this sample (some of which are towns and cities, some others are countries, and some are neither). The first 20 are below.
Location n
1 11400
11425 United Kingdom 1173
6860 London, England 914
6743 London 835
10055 South Africa 801
5632 Ireland 682
11344 UK 461
4258 England, United Kingdom 423
5866 Johannesburg, South Africa 416
2724 Cape Town, South Africa 320
3836 Dublin City, Ireland 286
3804 Dublin 272
1909 Belfast 265
9112 Pretoria, South Africa 234
11751 Wales, United Kingdom 228
3865 Dublin, Ireland 223
1634 Australia 195
1614 Auckland, New Zealand 185
9700 Scotland 173
8171 North East, England 170
There are a few ways of approaching the aggregation of these. One involves using regex and lots of data wrangling. The second way which is doubtless faster is to use ggplots geocode function either with googlemaps API (this requires an authentication key) or with the geocoder at Data Science toolkit.
The graph below shows the 20 most frequent tweeters of #RugbyWorldCup in this sample.

These are the accounts tweeting the most but not necessarily generating the most engagements. One way of measuring engagement is to look at the amount of replies, likes and retweets a users tweets generate. The graph below shows the 20 most replied to users in the dataset.

Perhaps unsurprisingly the account receiving most replies in this sample is the @rugbyworldcup one, the official Twitter channel for the rugby world cup. Look at number 2 though, it’s a company that manufactures steel lintels! If you look at their timeline you’ll see they ran a competition on October 11 using #RugbyWorldCup, so that’s where their replies are coming from. Personal accounts that are generating engagements are the referee Nigel Owens (@Nigelrefowens) and Tarama Taruhei (@menghu_nankuru) who seems to be tweeting mostly in Japanese. The official national rugby accounts with most replies are England, South Africa, New Zealand, Ireland , Scotland in that order. Again this is a small dataset though so we can’t infer too much from this.
Many tweets contain more than just 1 hashtag. The word cloud below shows the 100 most popular hashtags in tweets also containing #RugbyWorldCup.

We can see the fruits of some hashtag campaigns here, #AsOne (Scotland) and #ShoulderToShoulder and #TeamOfUs (Ireland). #TyphoonHagibis is getting plenty of mentions too.
Some of the hashtags above relate to specific matches. Gathering all the hashtags relating to matches we can see which ones were most popular. No surprises that the first 5 all involve only Tier One nations, except for the hosts Japan, who for me have been the most entertaining team at the World Cup so far. They will probably be considered a Tier One nation after this tournament anyway!

That’s all for Part 1 of the analysis of this dataset. Part 2 will focus on the text of the tweets themselves.