build your own website for free

THE PYTHON SCRIPTS

The Python Scraper

In order to analyze Twitter data, I first needed a reliable (and inexpensive) way to gather the necessary data. As part of the Twitter Developer Agreement and Policy, Twitter requires that individuals scraping Twitter go through one of their sanctioned seach APIs (Application Programming Interface). Twitter offers three tiers of search APIs  for their site: Standard, Premium, and Enterprise. The Standard API, which is available for free use, only searches “against a sampling of recent Tweets published in the past 7 days,” meaning that the data is severely limited in both time (to the past seven days) and completeness (only a portion of tweets published in the past seven days will be collected). The Premium and Enterprise APIs allow for more robust searches going back to 2006 (when Twitter was founded) and promise full data fidelity. Given, however, that the Premium and Enterprise APIs are designed to assist marketing companies collect data to better understand and market to their user base, these APIs can be pricy to use and offer features irrelevant to the kind of academic research undertaken in this project. Alternative, free Twitter scraping services, such as the Twarc API search by Documenting the Now (DocNow), is similarly limited in terms of collection rates and limits as that of Twitter’s Standard API.  

Given these limitations, I decided to search for an alternative, and I was fortunately granted a Prototyping Fellowship through the University of Virginia to assist with the development of this script. Alyssa Collins, who was co-recipient with myself on the grant, and I fortunately found a pre-existing Twitter scraping script, developed by Tom K. Dickinson, which was freely available on GitHub. This script latched onto one of Twitter's search APIs, but it was able to collect data going back to Twitter's inception in 2006, and it was not restricted by search sampling but promised full data fidelity. While extremely useful, this script did not collect all of the data that I needed for Project TwitLit, so Alyssa, myself, and Shane Lin (a developer at the University of Virginia Scholars’ Lab) set out to modify this script. Like Dickinson's original script, our modified python scraping script can likewise search for terms from 2006 to the present while maintaining full data fidelity. We also modified the script to automatically rename files according to the search term used. Additionally, we modified the kinds of metadata that the python script would allow us to gather about each tweet. Whereas the original script collected eight fields of metadata about each tweet (the Tweet ID, full text, user ID, user screen name, user name, date and time when the tweet was created, number of retweets, and number of favorites), we modified the script to search for additional metadata related to user description, user language, user location, user time-zone, and tweet translation status. The full list of data that we can collect through this Python script is listed below:

created_at

id

text

in_reply_to_user_id

user__id

user__name

user__screen_name

user__location

user__description

user__followers_count

user__friends_count

user__created_at

user__favourites_count

user__utc_offset

user__time_zone

user__geo_enabled

user__verified

user__statuses_count

user__lang

user__is_translator

user__is_translation_enabled

user__translator_type

geo

coordinates

place

retweet_count

favorite_count

lang

__________

JSONL Counter

The JSONL Counter Python Script is a simple script written by Varundev Sukhil, who is currently a PhD Candidate in Computer Engineering at the University of Virginia, in order to count the number of lines in a JSONL file. Once the results from the Twitter Scraper employed in the project (see above) are collected, these results can be easily turned into a JSONL file through tools developed by Documenting the Now (DocNow). In a JSONL file, each Twitter result (including full metadata associated with each Tweet) are rendered as a single line. Using the JSONL Counter Python Script, I was easily able to count the number of lines (and hence the number of tweets) in each JSONL file.

__________

Visit GitHub to Download the Scripts

PYTHON SCRIPTS

__________

Download the Instructions for Using the Scripts

INSTRUCTIONS

© Copyright 2020 Christian Howard-Sukhil - All Rights Reserved