• Twitter mentions: Social Network Analytics

    Using igraph in R, along with Python and UNIX terminal

  • DATASET INFO

    The dataset was the biggest challenge of this project.

    We worked with a dataset of 476 million tweets, obtained through a Stanford resource.

     

    The format of the downloaded file was the following:
     

    T 2009-07-01 00:04:20
    U http://twitter.com/greeneyed_panda
    W @Dprinzessin jajajajajajaja.......... no.....
     

    In the tweet above, T indicates the time the tweet was posted (2009-07-01 00:04:20), U indicates the user that posted it (greeneyed panda) and W is the text of the tweet (@Dprinzessin jajajajajajaja.......... no.....).

     

    Twitter subscribers can use the ‘@’ character to make a mention to another subscriber, e.g., in the above tweet, user greeneyed panda made a mention to user Dprinzessin. To analyse the network, first we needed to manipulate the raw data with a programming language of our choice, to create a total of 5 .csv files, one for each of the first five days of July 2009, using the following format:


    from,to,weight
    user1,user2,5
    user2,user1,1
    user1,user3,2
    ...


    Each .csv file describes the weighted directed mention graph for the respective day, e.g., in the example above user1 has made 5 mentions to user2, user2 has made 1 mention to user1, and user1 has made 2 mentions to user3.

    THEMES

    R, Python, Unix Terminal, iGraph, Network Analysis, PageRank

    PROJECT DESCRIPTION

    In this study, we attempted to study the network of Twitter users and the mentions between them. Starting with a very large and incorrectly structured dataset, we used the Unix terminal (sed) and regular expressions to efficiently perform filtering and various transformations to end up with a lighter dataset. Then, using Python, we completely transformed the dataset from a linear (line by line) to a tabular format (columns), in order to load the data in iGraph. Using iGraph, we created a weighted directed graph and performed various tasks to explore the network:

    • Identifying basic properties of the network, such as the Number of vertices, Number of edges, Diameter of the graph, Average in-degree and Average out-degree.
    • Visualising the 5-day evolution of these metrics and commenting on observed fluctuations.
    • Identifying the important nodes of the graph, based on In-degree, Out-degree and PageRank
    • Performing community detections on the mention graphs, by applying fast greedy clustering, infomap clustering, and louvain clustering on the undirected versions of the 5 mention graphs.
    • Visualising the different communities in the mention graph.

    CODE

    Code in R is available upon request.

  • REPORT