How many tweets would your code crunch if it could crunch Twitter, or why holidays are bad for using Twitter’s streaming API – DataCrunch

Twitter has emerged as a convenient source of data for those who want to explore social media. The company provides several access endpoints through APIs. There is a REST API for collecting past tweets and a streaming API for collecting tweets in real time. R has libraries for working with both. As is usual in data collection, the catchphrase is “more” – we want more tweets, ideally all that are relevant to our research question. While REST API is rate-limited (a user can submit 180 requests per 15 minutes, with each request returning 100 tweets), the streaming API holds a promise of delivering much more. The nagging question, though, is “how much?”

Originally Twitter provided three levels of access to streaming data: there was a firehose access with 100% tweets, a garden-hose with 10%, and a “spritzer” with 1%. The firehose access is reserved for Twitter’s partner companies, the garden-hose access has been discontinued, and a regular user is limited only to the “spritzer”, officially known as “streaming API”. (API stands for application programming interface and is a format of submitting data requests over the web that each company creates and documents for its clients. The code snippet at the end of this post shows how to use it in R.) Although the official documentation page (link) no longer explicitly mentions the one percent, its memory lingers in the user community. A crucial question that often comes up is “one percent of what?” – is it one percent of the tweets relevant to the question (that is, the search parameters) or one percent of the global Twitter traffic?

As it happens, the answer existed but was hidden inside specialized publications. (If you Google the question and are lucky, that is, the page rankings have not changed in the meantime, you may stumble upon some forum posts discussing the issue in abstract.) A team of computer scientists (Fred Morstatter and Huan Liu from Arizona State University, Jurgen Pfeffer and Kathleen M. Carley from Carnegie Mellon University) was able to obtain the answer empirically and had it published in a paper, Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose, printed by the Association for Advancement of Artificial Intelligence (link) with a free copy available from arXiv.org. Fred Morstatter kindly agreed to share the raw data, which is shown in the chart below.

December 2011 saw the onset of large anti-government protests in Syria. Morstatter and his co-authors wanted to track the tweets coming out of Syria. The filter parameters included a number of hashtags (#syria, #assad, #aleppovolcano, #alawite, #homs, #hama, #tartous, #idlib, #damascus, #daraa, #aleppo, plus Syria spelled in Arabic), one user account (@SyrianRevo) and a geographic bounding box. They collected tweets from December 14th, 2011 to January 10th, 2012 using streaming API. Afterwards, they purchased the full data from GNIP – the official data vendor for Twitter. The chart below shows the raw counts of tweets day by day. For convenience, you can also examine the number of tweets from streaming API as a percentage of firehose tweets – use the radio buttons above the chart.

[d3-source canvas=”wpd3-7193-0″]

As you can see, streaming API did pretty well until December 20th, 2011 – it was capturing about 90% of the total number of tweets. Later on, however, the number of captured tweets has markedly declined: while the anti-government uprising in Syria was gaining force and the number of tweets more than doubled, the captured number of tweets has declined. Morstatter and his co-authors believe that this is the direct illustration of the 1% cap that Twitter imposes on streaming API and the fact that this cap is calculated on the basis of the global traffic in tweets. As the Christmas season began, users were more preoccupied with holiday matters and were tweeting less. This has reduced the global traffic and Twitter automatically lowered the cap on the number of retrievable tweets. If you are in the business of using streaming API, the moral of this chart is “beware the holidays”.

This phenomenon raises a question about temporal patterns in the use of Twitter. An interesting paper in this regard is When-To-Post on Social Networks, by Nemanja Spasojevich, Zhisheng Li, Adithya Rao, and Prankit Bhattacharyya (arXiv.org). Spasojevich and the co-authors were interested in reactions to posts (retweets, shares) – a metric of some importance to marketers who are trying to promote their products online. The reactions can be taken as a measure of user activity on a social media platform. The authors found that Twitter use in the United States declines by 50% on Saturdays and Sundays (while Facebook use gets a bit stronger on Sundays). There are also noticeable cross-cultural differences in Twitter use between US (represented by New York and San Francisco), France (Paris), UK (London), and Japan (Tokyo). Whereas in the US the peak use hours are the business hours (with a secondary peak at 7-8 pm), in Paris the peak falls on the afternoon, in London – on the end of the business day, and in Japan the peak use is during the off-business hours. Theoretically, these variations may also affect the cap on streaming API, but to a much lesser extent than a holiday celebrated around the world.

If the above discussion has not discouraged you and you would still like to try working with streaming API, below is a sample R code for collecting tweets that contain a specific hashtag (in this case, #NBA). In addition to a working version of R with three packages (streamR, ROAuth, and RCurl), you will need to obtain the consumer key and consumer secret from Twitter. There are several YouTube videos (like the one here) showing the steps. Once you have the key and the secret, paste them into the code and run it. In the end you will obtain a csv file with tweets that you can use either in Excel or any other data analysis platform.

### load the required packages
library(streamR)
library(ROAuth)
library(RCurl)

## set the working directory to YOUR OWN folder
setwd("My_folder")

requestURL = "https://api.twitter.com/oauth/request_token"
accessURL = "https://api.twitter.com/oauth/access_token"
authURL = "https://api.twitter.com/oauth/authorize"

### Obtain consumer key and consumer secret from the 
### "Manage your apps" page: https://apps.twitter.com
### Don't forget to enclose the strings into quotation marks
consumerKey = "paste consumer key here"
consumerSecret = "paste consumer secret here"

download.file(url="http://curl.haxx.se/ca/cacert.pem",
              destfile="cacert.pem")

my_oauth = OAuthFactory$new(consumerKey=consumerKey,
                             consumerSecret=consumerSecret,
                             requestURL=requestURL,
                             accessURL=accessURL, authURL=authURL)

### do the authorization of the connection
my_oauth$handshake(cainfo = "cacert.pem")

### initiate tweet collection - tweets with hashtag #NBA,
### with connection closing down after 300 seconds
filterStream(file.name="tw_demo.txt",
             track="#NBA", timeout=300,
             oauth=my_oauth )


### convert tweets from JSON format into a dataframe
df = parseTweets("tw_demo.txt")

### write the dataframe to disk for future use
write.csv(df, "tw_demo.csv", row.names=FALSE)