Companion repository to The big picture of public discourse on Twitter by clustering metadata where we analyze Swedish Twitter data from 2015.
To illustrate what we did, a couple of Python scripts are provided. You need to have the networkx library for the first two, and the gensim library for the third one.
You also need to have a working installation of Infomap (http://www.mapequation.org/code.html, the standalone version). We use the standalone version because we have been unable to make the Python Infomap library generate the same results as the standalone.
The first script, 0_make_mentiongraph.py, is not meant to be run unless necessary. It is used to process a large number of tweets to generate a networkx graph, which is stored as a pickle file. Example call:
python 0_make_mentiongraph.py tweet_dir my_graph
... where "tweet_dir" (in this example) is the name of a directory containing user tweets, one file per account named .txt, containing tweets for that user, and "my_graph" is the prefix of the pickle file that will store the generated graph.
We will provide "ready-made" pickle files so that you should hopefully be able to skip this step.
The second script, 1_pickle_to_communities.py, does the actual community detection analysis by using Infomap.
NOTE! For this script to work, you must change the path to the Infomap executable in the code!
python 1_pickle_to_communities.py my_graph.pickle
If you download the pickle file linked from the blog post, you should be able to do:
python 1_pickle_to_communities.py undirected_g_2015.pickle
directly on that, however, note that Python 3 is probably needed for that (I think this pickle format is not supported by Python 2).
The third script, 2_content_analysis.py, calculates the most distinctive words for each of the largest communities (using TF-IDF) and gives some information on each cluster.
python 2_content_analysis.py tweet_dir my_graph_trees
... where "tweet_dir" is again the path to the directory of user tweet files, and "my_graph_trees" is a directory that has been generated by Infomap in the previous step and which contains the community decomposition of the graph in two separate files.