The #RugbyWorldCup so far on the Twitterverse

We’re about halfway through the 2019 Rugby World Cup, the pool stages finished yesterday and I thought it might be interesting to look at what the Twitterverse has made of the tournament so far.

Since just after the beginning of the tournament (24th September to be precise) up til 12th October I have been connecting to the Twitter API every few days using rtweet.

Each time I collected around 18,000 tweets in English containing the hashtag #RugbyWorldCup. I collected over 210,000 tweets in total with this hashtag using the standard (free) API. As indicated in my last post, this API won’t return all tweets matching a query but instead returns a sample containing the most relevant tweets to a query, relevancy being measured by Twitter. The sample size therefore dropped to just 52,429 observations when tweets with duplicate status ids were removed. A tip if you want to avoid getting lots of duplicate tweets like this in your data when using rtweet, is to use the max_id argument and setting it to the id of the last tweet you collected in the previous batch. This has the effect of specifying the oldest tweet beyond which results should be returned.

The time period covered is the 20th September to the 12th October inclusive, so that’s an average of 2,383 tweets per day. Given that 13 of the countries participating are predominantly English speaking, we can be fairly confident that this is just a small sample of all the tweets sent during this time that contained #RugbyWorldCup.

The days seeing most tweets in this dataset are the Saturday and Sunday of the opening weekend of the tournament. There are smaller spikes on the 28th September, which was the date of Japans impressive victory over Ireland, and the 5th October, a day on which the hosts also played. Lowest numbers of tweets are occurring on October 1st and 7th, days on which there were no rugby World Cup matches.

The graph below shows the percentages of the data set made up of organic tweets, retweets and replies.

Organic tweets made up just over half the observations with retweets making up under half and a relatively small number of replies.

Looking at the sources of tweets containing #RugbyWorldCup in this dataset, it’s apparent most of them are being sent from mobile devices. Slightly more tweets were sent from iOS devices such as iPhones than from Android devices. There are 330 different apps in the Other Apps group most of them only used to send a few tweets with the exception of Instagram (n=635) and Twitter Web Client (n=601).

Twitter provides data about user location in a few ways. The most accurate is Tweet location where a user has opted to attach location information to their tweets. Most people don’t use this feature though. The second is location mention where any location names are parsed from the tweet text however mentioned locations may not be the same location as the user. Thirdly is the profile location. Lots of users have set this in their account information but again it may not always be accurate.

Using the lat_lng function in rtweet will add latitude and longitude information to tweets in the dataset that have location information attached. This is only about 5% (n=2579) of the tweets in this dataset. I’ve plotted these on the map below. You can zoom in and drag the map for more detail.

Most of the tweets are concentrated in Ireland and the UK but nearly all countries participating in the World Cup are represented. Interesting to see so many of these geotagged tweets in the USA particularly on the East Coast. Maybe rugby is growing in popularity stateside!

Another way of determining location is to use the account location information returned by rtweet. 78% of user accounts in this sample have non-empty location fields so this could provide a better approximation of where users are than the small number of geotagged tweets above . However the input field for Location is a free-text box meaning users can customise their location to whatever they like. There are 12,092 unique values for account location in this sample (some of which are towns and cities, some others are countries, and some are neither). The first 20 are below.

                       Location     n
1                                11400
11425             United Kingdom  1173
6860             London, England   914
6743                      London   835
10055               South Africa   801
5632                     Ireland   682
11344                         UK   461
4258     England, United Kingdom   423
5866  Johannesburg, South Africa   416
2724     Cape Town, South Africa   320
3836        Dublin City, Ireland   286
3804                      Dublin   272
1909                     Belfast   265
9112      Pretoria, South Africa   234
11751      Wales, United Kingdom   228
3865             Dublin, Ireland   223
1634                   Australia   195
1614       Auckland, New Zealand   185
9700                    Scotland   173
8171         North East, England   170

There are a few ways of approaching the aggregation of these. One involves using regex and lots of data wrangling. The second way which is doubtless faster is to use ggplots geocode function either with googlemaps API (this requires an authentication key) or with the geocoder at Data Science toolkit.

The graph below shows the 20 most frequent tweeters of #RugbyWorldCup in this sample.

These are the accounts tweeting the most but not necessarily generating the most engagements. One way of measuring engagement is to look at the amount of replies, likes and retweets a users tweets generate. The graph below shows the 20 most replied to users in the dataset.

Perhaps unsurprisingly the account receiving most replies in this sample is the @rugbyworldcup one, the official Twitter channel for the rugby world cup. Look at number 2 though, it’s a company that manufactures steel lintels! If you look at their timeline you’ll see they ran a competition on October 11 using #RugbyWorldCup, so that’s where their replies are coming from. Personal accounts that are generating engagements are the referee Nigel Owens (@Nigelrefowens) and Tarama Taruhei (@menghu_nankuru) who seems to be tweeting mostly in Japanese. The official national rugby accounts with most replies are England, South Africa, New Zealand, Ireland , Scotland in that order. Again this is a small dataset though so we can’t infer too much from this.

Many tweets contain more than just 1 hashtag. The word cloud below shows the 100 most popular hashtags in tweets also containing #RugbyWorldCup.

We can see the fruits of some hashtag campaigns here, #AsOne (Scotland) and #ShoulderToShoulder and #TeamOfUs (Ireland). #TyphoonHagibis is getting plenty of mentions too.

Some of the hashtags above relate to specific matches. Gathering all the hashtags relating to matches we can see which ones were most popular. No surprises that the first 5 all involve only Tier One nations, except for the hosts Japan, who for me have been the most entertaining team at the World Cup so far. They will probably be considered a Tier One nation after this tournament anyway!

That’s all for Part 1 of the analysis of this dataset. Part 2 will focus on the text of the tweets themselves.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.