In order to analyze Twitter data, I first needed a reliable (and inexpensive) way to gather the necessary data. As part of the Twitter Developer Agreement and Policy, Twitter requires that individuals scraping Twitter go through one of their sanctioned seach APIs (Application Programming Interface). Twitter offers three tiers of search APIs for their site: Standard, Premium, and Enterprise. The Standard API, which is available for free use, only searches “against a sampling of recent Tweets published in the past 7 days,” meaning that the data is severely limited in both time (to the past seven days) and completeness (only a portion of tweets published in the past seven days will be collected). The Premium and Enterprise APIs allow for more robust searches going back to 2006 (when Twitter was founded) and promise full data fidelity. Given, however, that the Premium and Enterprise APIs are designed to assist marketing companies collect data to better understand and market to their user base, these APIs can be pricy to use and offer features irrelevant to the kind of academic research undertaken in this project. Alternative, free Twitter scraping services, such as the Twarc API search by Documenting the Now (DocNow), is similarly limited in terms of collection rates and limits as that of Twitter’s Standard API.
Given these limitations, I decided to search for an alternative, and I was fortunately granted a Prototyping Fellowship through the University of Virginia to assist with the development of this script. Alyssa Collins, who was co-recipient with myself on the grant, and I fortunately found a pre-existing Twitter scraping script, developed by Tom K. Dickinson, which was freely available on GitHub. This script latched onto one of Twitter's search APIs, but it was able to collect data going back to Twitter's inception in 2006, and it was not restricted by search sampling but promised full data fidelity. While extremely useful, this script did not collect all of the data that I needed for Project TwitLit, so Alyssa, myself, and Shane Lin (a developer at the University of Virginia Scholars’ Lab) set out to modify this script. Like Dickinson's original script, our modified python scraping script can likewise search for terms from 2006 to the present while maintaining full data fidelity. We also modified the script to automatically rename files according to the search term used. Additionally, we modified the kinds of metadata that the python script would allow us to gather about each tweet. Whereas the original script collected eight fields of metadata about each tweet (the Tweet ID, full text, user ID, user screen name, user name, date and time when the tweet was created, number of retweets, and number of favorites), we modified the script to search for additional metadata related to user description, user language, user location, user time-zone, and tweet translation status. The full list of data that we can collect through this Python script is listed below: