Gathering Twitter Data
Analysis of Twitter data has become a popular topic in recent years. Data mining of tweets is used by companies for things like brand monitoring and trends research, by academics for research purposes and of course by data science students doing projects for coursework! But what’s the best way of building a corpus of tweets for your data work? Well, it depends. Below are some of the ways of putting together a collection of tweets.
The most obvious way to collect data from Twitter is via one of their APIs. The standard Twitter APIs are REST APIs and Streaming APIs. As you can probably guess from the name, Streaming APIs allow you to request data from the Twitter live stream, i.e. will send you a stream of responses that match your GET request. The REST API allows you to access historical data based on the parameters you request.
To access the standard API you should register a developer account with Twitter and create an app that allows you to call an endpoint for the specific type of information or request you require. The standard API provides endpoints that will allow you to search for, post and engage with tweets, pull public account information, post and receive direct messages, create and manage lists, follow users and retrieve trends.
While you can access the Search API in Twitter for free, it’s important to know that endpoints in this API are rate limited. Rate limits are divided into 15 minute windows. and with a free account you can submit 180 requests every window. Each request can return up to 100 tweets which means you can access a maximum 18,000 tweets every 15 minutes or 72,000 in an hour. Furthermore results are limited to tweets from the last 7 days or so and the tweets returned are a sampling of tweets based on relevance rather than completeness. While this might be fine for a small project or proof of concept, these limitations mean that researchers whether commercial or academic need to consider something else.
In addition to the standard free version there are other tiers of the Search API which you can pay to access; Enterprise (which is for enterprise scale applications and likely to be outside the budget of most individuals) and Premium which is a pay as you go service. Through the Premium tier you can access the full archive of tweets and, as of the date of this post, it costs $99 to send up to 100 requests through this tier. With requests sent through this tier returning full fidelity tweet data at 500 tweets per request, this could also get pricey if looking to collect large volumes of tweets.
Providers other than Twitter, for example Discovertext, also sell Twitter data and it might be worth checking the pricing with some of these. Depending on your requirements they might be more economical than Twitter.
Most data analysis applications contain packages or modules with which you can access the Twitter API directly and analyse the data obtained therein.
If using R, then there are a few packages to choose from for interacting with the Twitter APIs. I tend to use rtweet though streamR might also be worth a look if needing to connect to the streaming API in particular. Once you have registered your Twitter app and got your authentication tokens you can call the Search API endpoint directly using rtweet which will return a dataframe of tweets and related data.
Rapidminer Studio has the Twitter connector which once you have your authentication details allows you to create a Rapidminer process to connect to Twitter and pull down the data you require.
The most popular Python library for interacting with Twitter seems to be tweepy. There are others, Python seems to have more libraries than most for Twitter.
Note that all the above are calling the Search API through the standard tier so unless you have paid for Premium or Enterprise access your requests will be limited to the rate indicated above.
An alternative to downloading tweets through the API, is to scrape them directly from the web. For example you can run searches through Twitter Advanced Search and then scrape the tweets on the returned pages. I’ve used the ScrapeHero chrome extension to do this and it worked fine but if needing to scrape lots of tweets it will take a long time as you will need to run many searches manually. If you are able to use advanced search operators for Twitter and don’t need large volume then it’s a viable option and bypasses the 7 day restriction in the free search API.
A useful option if you do require volume is Twitterscraper which essentially automates the above process. This is a Python package that uses the requests library to make multiple parallel requests for Twitter data and Beautiful Soup to parse the returned content. Twitterscraper has a number of advantages:
- It’s free. You can run it from inside Python or from the command line if you have Python installed.
- It doesn’t submit GET requests via the Search API so there is no rate limiting meaning you can build your dataset much more quickly.
- You can scrape from the full archive of publicly available tweets, no 7 or 30 day limits.
There are some other things to note if using Twitterscraper:
- If scraping for keywords or hashtags, Twitterscraper won’t return any retweets. Scraping for usernames will return retweets.
- You won’t get the same quantity of associated data as you do through the API. For example, scraping for a hashtag on Twitterscraper will return a .json or .csv file (whichever you specify in the –output argument) with 16 fields whereas rtweet returns a dataframe of 90 columns.
- If using the –enddate argument, I found that Twitterscraper returned data up the day before the specified end date on the query. I needed to specify an end date 1 day after the last day I wanted data for.
- If doing a lot of scraping there will sometimes be gaps in the returned data. The easiest way to check this is by doing a time series plot of tweet frequency by day and noting any gaps and rescraping for those.
- Other similar solutions for scraping Twitter have been proposed. See this paper for example which outlines a methodology for bypassing Twitter API restrictions using the scrapy framework. You should know that in general Twitter are not fans of people scraping their site without prior permission. In their terms of service they actually forbid it, obviously they would prefer people to pay for a higher tier of access to the API. There is some discussion of this issue in relation to Twitterscraper here. For what it’s worth, I agree with the package creator Ahmet Taspinar when he states that scraping public information from a publicly available website does not require acceptance of that websites terms of service but this is something potential users of the package should be aware of. Twitterscraper does not require you to login to Twitter to use and only scrapes publicly available information through front end search.
Finally, if you don’t have the budget to purchase the data you need and Twitterscraper or similar scraper isn’t suitable, there might be another solution. A number of individuals and organisations have made Twitter datasets available for free online. There is a useful catalogue of these datasets here. The datasets are provided in the form of tweet ids so you will need to use something like Hydrator to turn the tweet ids back into the original .json files for analysis. This is a great option if you can find a dataset applicable to your research, as all the collecting work is already done.
Having a clear specification of your data requirements and the question you are trying to answer is obviously very important when considering where to go to get your data. This is as true for Twitter as it is for any other data source. Hopefully the above is of some use and good luck in your Twitter mining efforts!