free web maker

BLOG

     

Analyzing Tweets with the Natural Language Toolkit

By Nick Caravias
December 17, 2020

The TwitLit research project has been a deeply rewarding and enriching experience. I entered this position hoping to improve my computational skills, but instead I was shown much more. My eyes have been open to the intersection between the humanities and technology, and I have been able to see how powerful the tools we use to process information can be. As a computer science student, I feel that it is very common to study the subject in a vacuum- spending many hours working on complicated programs without knowing the full scope of what they can accomplish. As a part of Project TwitLit at Bucknell University, I was able to leverage my technical background to do something foreign to me: study the humanities through the sciences.

I began the project at the beginning of the Fall 2020 semester, working as a research assistant for Dr. Christian Howard-Sukhil, who is the Digital Humanities Postdoctoral Fellow at Bucknell University. Initially, I was tasked with using Python scripts to scrape data from the Twitter API and upload the mined data to a shared directory. However, the project was almost immediately thrown a wrench. The developer’s license that we had been using to access the Twitter API had been using a backdoor to scrape tweets, and this backdoor was suddenly and unexpectedly closed. As a result, my task came to a halt. After talking with Dr. Howard-Sukhil, we decided it would be best to shift my efforts to applying computational techniques on already mined data, since any further work with the Twitter API would have to be delayed. The data I was given to work with was collected through previous scraping efforts (undertaken by Dr. Howard-Sukhil and previous research assistants); this data consisted of tweets that employed popular hashtags within the literary community on Twitter (e.g., #haiku, #writing, etc.). There were also a few collections of non-English language tweets that employed the equivalents of #writing, but this volume of files was much smaller.

After exploring the general practices of natural language processing and textual analysis, I decided there were three main ways I could gain a deeper understanding of the Twitter data that we had already gathered.

First, I wanted to see the level of activity between users on Twitter when discussing literary topics. I did this by searching each tweet’s text to see if an “@” character existed or not, since this is used to tag another individual on Twitter. I then stored the information about mentions or lack thereof. Ultimately, I was able to record the average number of mentions per tweet and the ratio of tweets with mentions to those without mentions.

My second objective was to find additional hashtags that were cross-referenced among literary categories. I used a similar tactic to find the top hashtags, searching each tweet for a “#” and storing information about it in order to find the most popular hashtags per topic.

For the third, I used the powerful library, Natural Language Toolkit (NLTK) to solve the computation challenges I wanted to overcome. In particular, I wanted to see what phrases were most commonly used in discussion. In order to solve this, I needed to reorganize the tweet texts into tokens (fragments of language that can be interpreted by NLTK). After tokenizing the tweet information, I was able to use the vast library to find information about the language, including common two-word and three-word phrases.

After completing the script and beginning to pass scrapped Twitter files through the script, a few trends in the data have already became apparent. One interesting thing I noted was the number of mentions per tweet (the amount of Twitter users tagged in a tweet) was consistent across most literary topics. The average was around .1-.2 mentions per tweet. This meant that most tweets actually did not mention anyone at all, but instead were broadcasting their views or discoveries to their followers broadly. It was also interesting to see how common it was to have interaction between Instagram and Twitter. After looking at the common two-word and three-word phrases for each tweet, almost every category had a link to Instagram in their top-ten phrases. This was particularly common for categories related to poetry, including #haiku and #micropoem. Additionally, poetic hashtags also commonly incorporated images and were cross-tagged with hashtag #pic or #picture.

I hope to continue my research on Twitter textual analytics and be able to find even more trends in the tweets! The python scraper I used to for natural language processing on the Twitter files can be found at https://github.com/TwitLit.

_______________

TwitLit Project: Spring Semester Work and Looking Towards the Summer

By Jimmy Pronchick
May 27, 2020

Spending the spring semester working on the TwitLit project was, for me, an engaging and hands-on first experience with the Digital Humanities (DH). As a research assistant, I worked with another student assistant, Meg Coyle, to document and record data on tweets in 2019 related to the writing community. Christian Howard-Sukhil, the head of the project and the DH Postdoctoral Fellow at the university, trained us to use Python scripts developed for scraping Twitter as well as Twarc tools developed through Documenting the Now (DocNow) in order to collect tweets (and accompanying metadata) that contained different writing-related hashtags. Using these scripts, we can record the number of tweets that contained a particular hashtag within a given time period, as well further information on each individual tweet, such as the timestamp or the number of likes and retweets.

From here, we are looking to expand the interpretation of this data into new avenues and to find ways to shed more light onto the sizable writing community on Twitter. For example, currently there are line graphs on the TwitLit website that display the growth of some of these hashtags, with analysis on what this data could mean. We have also speculated on ideas such as displaying viral tweets from the Twitter writing community on the website, in order to show what is drawing the most attention from inside and outside the community. One particularly exciting idea, which we unfortunately are unable to undertake without physically being at the university, is the geographic mapping of these tweets. It is possible to record the “geo-tag” of individual tweets, and through this we would be able to map where the writing community on Twitter comes from in the world, and further interpret this data and ask why tweets are concentrated in one place or another. Throughout the summer we plan to continue thinking of interesting ways to display the data we’ve collected and to keep the DH community at Bucknell updated through these blogs.

*This blog has been cross-posted on the DH@Bucknell website.

__________

My TwitLit Adventure

By Meg Coyle
May 26, 2020

Since the beginning of 2020, it has been an awesome experience working on Project Twitter Literature (“TwitLit”) in an effort to break down Twitter literature over the course of the past couple of years. I was stranger to the technique of “scraping” or “scrubbing” tweets, but was immediately engaged with the idea when I heard about the opportunity. I have always had a love for writing and in this new age where social media is everyone’s outlet to express themselves, and Dr. Christian Howard-Sukhil, who heads the project, made me understand the shift in literature in this new media era.

In particular, I have worked to scrape over 30 hashtags, some taking hours to process, while others only a matter of minutes. Once COVID-19 became a factor and our campus had to turn remote, our team continued to meet once a week in an effort to finish the job. Despite technical difficulties distances away, it was awesome to see how much work we accomplished. I was able to scrape all of the hashtags and upload them each to their own file on Google Drive, while Jimmy Pronchick, the other student research assistant on the team, hydrated and counted each tweet, uploading the finished project to the Drive. It was a long process because if at any point my laptop shut down or lost Wifi for a second, I would have to rescrape for the term in order to ensure it was accurate. We followed the scraping process as outlined on the project website; the scraping script is freely available for download on GitHub.

In the future, we will begin to interpret the data. On the TwitLit website, Christian has used line graphs to exemplify the growth of literature hashtags. She breaks it down into two different categories, “Writing Community” and “Fiction and Poetry.” This allows us to see the difference in what individuals are using as a platform to share to a greater audience. We will continue to do this for new data and try to think of creative ways to share it.

For more information on the project, visit the TwitLit website at www.twit-lit.com.

*This blog has been cross-posted on the DH@Bucknell website.

© Copyright 2020 Christian Howard-Sukhil - All Rights Reserved