In total secrecy from June to November 9th 2012, a huge amount of data was harvested from Twitter, collecting in total over one and a half million Twitter profiles from one hundred thousand tweets. Every bit of data was collected by following the rules and limits of Twitter’s API, without violating any of the built-in protection of the web platform, but simply by exploiting the openness offered by Twitter. One single script through one single Twitter account was harvesting data from Twitter every hour for five months.
The script was instructed to harvest all the tweets with selected hashtags which were updated after each new development in the campaign for the presidential election, such as after TV debates, popular memes or scandals regarding the candidates. Using those hashtags was strongly indicative of political affiliation.
For every tweet with those hashtags that the script intercepted and stored, the author’s data was harvested along with some of their followers. This resulted in several Twitter users harvested for every single Tweet. These types of relations already provided suspects for the later sorting process.
The first filter established was to avoid Twitter users who were media outlets or public figures, and in doing so focusing only on random people in the general audience. Other types of filters were established during the data harvest to avoid noise and spam on Twitter.
After the process of harvesting was completed, the database was filtered and sorted through keywords and rules to score the profiles. The method used to sort people is described in the following script that indicates the instructions and keywords that generated the rating of political affiliation. This script indicates the steps to recreate this conceptual work with any other pool of data.
- Browse and Search in the complete raw database of Tweets and all the Twitter's users havastaed.
- Download the Script and Lists of keywords and hastags used to sort the data (txt files zipped).
The Script for the sorting:
* Excluding all the users with blacklisted keywords:
1) Users with keywords indicating they were brands or similar, official media, campaign spokespeople or with unclear names
2) Users not living in U.S., identified through locations and languages
3) Users with default images / icons / logo, identified through the filename of their photo
* Rating hashtags and keywords:
1) Separate hashtags into four-class hierarchy
2) Separate keywords into four-class hierarchy
* Filtering tweets:
1) Rate tweets by hashtag - % based on the class of hashtags
2) Rate tweets by keyword - % based on the class of keywords
3) Rating tweets by author with names containing hashtags, candidates and keywords
4) Rate all users/authors of each tweet
5) Rate all the followers of the rated users - % based on the class of hashtags
* Filtering Users:
1) Rate user descriptions by hashtag - % based on the class of hashtags
2) Rate user descriptions by keyword - % based on the class of keywords
3) Rate user name by hashtag - % based on the class of hashtags
4) Rate user name by keyword - % based on the class of keywords
* Repeating some steps for false sorting:
1) Reverse to exclude keywords if they occur within right- or left-wing results
2) Rating users with location from U.S.