portfolio site templates


The Python Scraper

In order to analyze Twitter data, the Project TwitLit team first needed a reliable (and inexpensive) way to gather the necessary data. As part of the Twitter Developer Agreement and Policy, Twitter requires that individuals scraping Twitter go through one of their sanctioned seach APIs (Application Programming Interface). Twitter offers three tiers of search APIs  for their site: Standard, Premium, and Enterprise. The Standard API, which is available for free use, only searches “against a sampling of recent Tweets published in the past 7 days,” meaning that the data is severely limited in both time (to the past seven days) and completeness (only a portion of tweets published in the past seven days will be collected). The Premium and Enterprise APIs allow for more robust searches going back to 2006 (when Twitter was founded) and promise full data fidelity. Given, however, that the Premium and Enterprise APIs are designed to assist marketing companies collect data to better understand and market to their user base, these APIs can be pricy to use and offer features irrelevant to the kind of academic research undertaken in this project. Alternative, free Twitter scraping services, such as the Twarc API search by Documenting the Now (DocNow), is similarly limited in terms of collection rates and limits as that of Twitter’s Standard API.  

Given these limitations, Christian Howard-Sukhil decided to search for an alternative, and she was awarded a Prototyping Fellowship through the University of Virginia to assist with the development of this script. Alyssa Collins, who was co-recipient on the grant, and Christian Howard-Sukhil found a pre-existing Twitter scraping script, developed by Tom K. Dickinson, which was freely available on GitHub. This script latched onto one of Twitter's search APIs, but it was able to collect data going back to Twitter's inception in 2006, and it was not restricted by search sampling but promised full data fidelity. While extremely useful, this script did not collect all of the data needed for Project TwitLit, so Collins, Howard-Sukhil, and Shane Lin (a developer at the University of Virginia Scholars’ Lab) set out to modify this script. Like Dickinson's original script, the modified python scraping script can likewise search for terms from 2006 to the present while maintaining full data fidelity. The script was also modified to automatically rename files according to the search term used. Additionally, team members modified the kinds of metadata that the python script would allow them to gather about each tweet. Whereas the original script collected eight fields of metadata about each tweet (the Tweet ID, full text, user ID, user screen name, user name, date and time when the tweet was created, number of retweets, and number of favorites), the team members modified the script to search for additional metadata related to user description, user language, user location, user time-zone, and tweet translation status. The full list of data that is collected through this Python script is listed below:






























JSONL Counter

The JSONL Counter Python Script is a simple script written by Varundev Sukhil, who is currently a PhD Candidate in Computer Engineering at the University of Virginia, in order to count the number of lines in a JSONL file. Once the results from the Twitter Scraper employed in the project (see above) are collected, these results can be easily turned into a JSONL file through tools developed by Documenting the Now (DocNow). In a JSONL file, each Twitter result (including full metadata associated with each Tweet) are rendered as a single line. Using the JSONL Counter Python Script, the Project TwitLit team was easily able to count the number of lines (and hence the number of tweets) in each JSONL file.


CSV Scraper

Nick Caravias created the CSV Scraper for Project TwitLit in order to assist with textual analysis for tweets pulled from Twitter’s API. More specifically, this Python script is an easy-to-use tool for applying computational textual analysis on collected Twitter data that allows the user to save pertinent findings all in one file. This script pulls four pieces of information: 

     1. Word frequency

     2. Collation, words that occur together

     3. Sentiment analysis

     4. Language detection and frequency

The word frequency and collation features of this code create a dictionary of the words used.

It should be noted that this script can be used to analyze any csv file containing language text, not just those with data collected via the Twitter scraping process employed by Project TwitLit.


Visit GitHub to Download the Scripts



Download the Instructions for Using the Scripts


© Copyright 2020 Christian Howard-Sukhil - All Rights Reserved