This is an user-friendly python package for interfacing with large collections of tweets. Developped at the SMaPP lab at New York University.
Supports Python 2.7
Simplest: using pip
:
pip install smapp-toolkit
To update to the latest version, if you have an older one installed:
pip install -U smapp-toolkit
Or download the source code using git
git clone https://github.com/SMAPPNYU/smapp-toolkit
cd smapp-toolkit
python setup.py install
or download the tarball and install.
The smapp-toolkit
depends on the following packages, which will be automatically installed when installing smapp-toolkit
:
- pymongo, the Python MongoDB driver
- smappPy, a utility library from SMaPP
- networkx, a library for building and analyzing graphs
- pandas, a Python data analysis library
from smapp_toolkit.twitter import MongoTweetCollection
collection = MongoTweetCollection(address='mongodb-address',
port='mongodb-port',
username='mongodb-user',
password='mongodb-password',
dbname='database-name')
from smapp_toolkit.twitter import BSONTweetCollection
collection = BSONTweetCollection("path/to/file.bson")
collection.containing('#bieber').count()
texts = collection.containing('#bieber').texts()
Apply a filter that adds labels to BSONTweetCollection or MongoTweetCollection and outputs the result to a bson file
collection.apply_labels(
list_of_labels
,list_of_fields
,list_for_values
,bsonoutputpath
)
The method applies a set of named labels and attaches them to objects from a collection if the certain fields in the collection meet certain criteria.
collection.apply_labels(
[['religious_rank', 'religious_rank', 'political_rank'], ['imam', 'cleric', 'politician']]
,['user.screen_name', 'user.id']
,[['Obama', 'Hillary'], ['1234567', '7654321']]
,'outputfolder/bsonoutput.bson'
)
NOTE: ['1234567', '7654321'] are not the actual ids of any twitter users they are just dummy numbers.
list_of_labels
is a list with two lists inside it where the first list contains names for labels and the second list
contains the labels themselves. For example: religious_rank
and imam
would be a label called religious_rank for the label value imam.
Each field in the list_of_fields
array is a string that takes dot notation. user.screen_name would be the screen_name
entry in the user entry in the collection object. You can nest these for as many levels as you have in the collection
object.
list_for_values
is a list that contains as many lists as there are fields to match. Each of these lists (inside list_for_
values) is a list of the values you would like that field to match. So if you want the user.screen_name to match "obama"
"hillary" or "lessig" then you would use:
list_of_fields = ['user.screen_name']
list_for_values = [['obama', 'hillary', 'lessig']]
as inputs.
bsonoutputpath
is the path realtive to where you run the script that will be the output file with the new labels.
After you run this method each tweet object in your output BSON will now have a field called "labels" like so:
{
.
.
.
"labels" : {
"1": {name: “religious_rank”, type: “cleric”},
"2": {name: ”religious_rank”, type: ”imam"},
"3": {name: “eye_color”, type :”brown"}
}
.
.
.
}
collection.containing('#bieber', '#sexy')
collection.term_counts(['justin', 'miley'], count_by='days', plot=False)
Out[]:
{'2015-04-01': {'justin': 1312, 'miley': 837},
'2015-04-02': {'justin': 3287, 'miley': 932}}
collection.containing('#bieber').sample(0.33).texts()
from datetime import datetime
collection.since(datetime(2014,1,30)).count()
collection.since(datetime(2014,2,16)).until(datetime(2014,2,19)).containing('obama').texts()
collection.language('en').texts()
collection.language('ru', 'uk')
collection.user_lang_contains('de', 'fr')
collection.excluding_retweets().count()
collection.user_location_containing('new york', 'nyc')
collection.field_containing('user.description', 'python', 'data', 'analysis')
collection.geo_enabled()
collection.non_geo_enabled()
collection.sort('timestamp',-1)
collection.sort('timestamp',-1).limit(10).texts()
#####top 10 hashtags on a given day
counts = collection.since(datetime(2015,1,1)).until(datetime(2015,1,2)).top_hashtags(n=10)
#####top bigrams in the last hour
counts = collection.since(datetime.utcnow()-timedelta(hours=1)).top_bigrams(n=5)
#####top urls
counts = collection.top_urls(n=10)
#####other top_x
methods
top_unigrams()
top_trigrams()
top_images()
top_mentions()
top_links()
top_user_locations()
top_geolocation_names()
#####Multiple top_x methods in one go
The function top_entities(...)
returns a dictionary object with pandas.Series
objects for each top entity list
In []: col = BSONTweetCollection('/home/yablee/Projects/SMAPP/tmp/arabevents_sample.bson')
In []: top_things = col.top_entities(ngrams=(1,2,3))
In []: top_things['2-grams']
Out[]:
فيديو قوات 350
الطوارى السعودية 330
قوات الطوارى 305
#السعودية #saudi 266
#ksa #السعودية 244
قوات الطوارئ 236
الطوارئ السعودية 236
#saudi #الرياض 226
يقبضون على 185
السعودية يقبضون 185
dtype: int64
#####writing top_x()
results to a csv file
All top_x()
methods return pandas.DataFrame
objects. They may be easily exported to a csv file, as follows:
hashtags = collection.top_hashtags(n=5)
hashtags.to_csv('/path/to/my/output.csv', encoding='utf8')
#####top retweets
To get the top retweets for a certain collection, use the top_retweets()
method. Specify which columns (of the original tweet) to include in the result, by passing thr rt_columns
argument, as follows:
top_rts = collection.since(datetime.utcnow()-timedelta(hours=1)).top_retweets(n=10, rt_columns=['user.screen_name', 'user.location', 'created_at', 'text'])
The default columns included are ['user.screen_name', 'created_at', 'text']
.
Use the collection.group_by(time_unit)
method to group tweets by time slices. Supported time slices are days
, hours
, minutes
, and seconds
. Here's a basic example:
for time, tweets in collection.group_by('hours'):
print("{time}: {count}".format(time=time, count=len(list(tweets))))
which outputs:
2015-01-12 17:00:00: 13275
2015-01-12 18:00:00: 23590
In []: col.since(datetime(2015,6,18,12)).until(datetime(2015,6,18,15)).group_by('hours').count()
Out[]:
count
2015-06-18 12:00:00 164181
2015-06-18 13:00:00 167129
2015-06-18 14:00:00 165057
The framework also supports top_x
methods with results grouped by time slice.
Example:
collection.since(datetime(2015,6,1)).group_by('days').top_user_locations(n=5)
# London London, UK Manchester Scotland UK
# 2015-06-1 4 2 1 1 2
# 2015-06-2 11 4 9 3 3
# 2015-06-3 14 1 5 NaN 4
# 2015-06-4 17 1 5 1 6
# 2015-06-5 10 3 3 3 3
In []: col.group_by('hours').entities_counts()
Out[]:
_total url image mention hashtag geo_enabled retweet
2015-01-12 17:00:00 13275 881 1428 6612 2001 10628 15
2015-01-12 18:00:00 23590 1668 2509 12091 3575 19019 36
In []: col.since(datetime.utcnow()-timedelta(minutes=10)).until(datetime.utcnow()).group_by('minutes').language_counts(langs=['en', 'es', 'other'])
Out[]:
en es other
2015-06-18 21:23:00 821 75 113
2015-06-18 21:24:00 2312 228 339
2015-06-18 21:25:00 2378 196 339
2015-06-18 21:26:00 2352 233 295
2015-06-18 21:27:00 2297 239 344
2015-06-18 21:28:00 1776 173 247
2015-06-18 21:29:00 1825 162 269
2015-06-18 21:30:00 2317 237 326
2015-06-18 21:31:00 2305 233 342
2015-06-18 21:32:00 2337 235 308
2015-06-18 21:33:00 1508 136 228
In []: from smapp_toolkit.twitter import BSONTweetCollection
In []: col = BSONTweetCollection('arabevents_sample.bson')
In []: unique_users = col.group_by('minutes').unique_users()
In []: tweets = col.group_by('minutes').count()
In []: unique_users['total tweets'] = tweets['count']
In []: unique_users
Out[]:
unique_users total tweets
2015-04-16 17:01:00 377 432
2015-04-16 17:02:00 432 582
2015-04-16 17:03:00 442 610
2015-04-16 17:04:00 393 531
2015-04-16 17:05:00 504 756
2015-04-16 17:06:00 264 365
The smapp_toolkit.plotting
module has functions that can make canned visualizations of the data generated by the functions above.
For more examples, see the examples folder.
See examples in the gallery.
from smapp_toolkit.plotting import stacked_bar_plot
data = col.since(datetime(2015,6,18,12)).until(datetime(2015,6,18,12,10)).group_by('minutes').entities_counts()
data['original tweet'] = data['_total'] - data['retweet']
plt.figure(figsize=(10,10))
stacked_bar_plot(data, ['retweet', 'original tweet'], x_tick_date_format='%H:%M', colors=['salmon', 'lightgrey'])
plt.title('Retweet proportion', fontsize=24)
plt.tight_layout()
data = col.since(datetime(2015,6,18,12)).until(datetime(2015,6,18,12,10)).group_by('minutes').top_user_locations()
stacked_bar_plot(data, ['London', 'New York'], x_tick_date_format='%H:%M')
plt.title('Tweets from London and New York users', fontsize=18)
plt.tight_layout()
See more examples in the gallery.
The following functions make plots by first getting data from collection and then making the plots. Their use is discouraged as getting the data can sometimes be slow. Always prefer to get the data and make plots separately, saving the data first.
bins, counts = collection.containing('#sexy').tweets_over_time_figure(
start_time,
step_size=timedelta(minutes=1),
num_steps=60,
show=False)
plt.title('Tweets containing "#sexy"')
plt.show()
collection.term_counts(['justin', 'miley'], count_by='days', plot=True, plot_total=True)
plt.show()
collection.since(datetime(2015,6,1)).tweet_retweet_figure(group_by='days')
you may set group_by=
to days
, hours
, minutes
, or seconds
.
collection.since(datetime(2015,6,1)).geocoded_tweets_figure()
collection.tweets_with_urls_figure()
collection.tweets_with_images_figure()
collection.tweets_with_mentions_figure()
collection.tweets_with_hashtags_figure()
for tweet in collection.containing('#nyc'):
print(tweet['text'])
Here are functions for exporting data from collections to different formats.
collection.dump_csv('my_tweets.csv')
This will dump a CSV with the following columns:
'id_str', 'user.screen_name', 'timestamp', 'text'
The desired columns may be specified in the columns=
named argument:
collection.dump_csv('my_tweets.csv', columns=['id_str', 'user.screen_name', 'user.location', 'user.description', 'text'])
dump_bson_topath ('output.bson')
This will dump a bson file of tweets. Once you have this bson you can convert it to JSON format with the bsondump tool (if you have it) like so:
bsondump output.bson > output.json
The full list of available fields from a tweet may be found on the twitter REST-API documentation. In order to get nested fields (such as the user's location or the user's screen_name), use user.location
, user.screen_name
.
For geolocated tweets, in order to get the geolocation out in the csv, add coordinates.coordinates
to the columns list. This will put the coordinates in GeoJSON (long, lat) in the column.
Alternatively¸ add coordinates.coordinates.0
and coordinates.coordinates.1
to the columns list. This will add two columns with the longitude and latitude in them respectively.
If the filename specified ends with .gz
, the output file will be gzipped. This typically takes about a 1/3 as much space as a non-compressed file.
collection.dump_csv('my_tweets.csv.gz')
This will dump whole tweets in JSON format into a specified file, one tweet per line.
collection.dump_json("my_json.json")
Available options are:
- append=True, to append tweets in the collection to an existing file
- pretty=True, to write JSON into pretty, line-broken and properly indented format (this takes up much more space, so is not recommended for large collections)
This will dump whole tweets in MongoDB's BSON format into a specified file. Note that BSON is a "binary" format (it will look a little funny if opened in a text editor). This is the native format for MongoDB's mongodump program. The file is NOT line-separated.
collection.dump_bson("my_bson.bson")
Available options are:
- append=True, to append BSON tweets to the given filename (if file already has tweets)
The toolkit supports exporting a retweet graph using the networkx
library. In the exported graph users are nodes, retweets are directed edges.
If the collection result includes non-retweets as well, users with no retweets will also appear in the graph as isolated nodes. Only retweets are edges in the resulting graph.
Exporting a retweet graph is done as follows:
import networkx as nx
digraph = collection.containing('#AnyoneButHillary').only_retweets().retweet_network()
nx.write_graphml(digraph, '/path/to/outputfile.graphml')
Nodes and edges have attributes attached to them, which are customizable using the user_metadata
and tweet_metadata
arguments.
user_metadata
is a list of fields from the User object that will be included as attributes of the nodes.tweet_metadata
is a list of the fields from the Tweet object that will be included as attributes of the edges.
The defaults are
user_metadata=['id_str', 'screen_name', 'location', 'description']
tweet_metadata=['id_str', 'retweeted_status.id_str', 'timestamp', 'text', 'lang']
For large graphs where the structure is interesting but the tweet text itself is not, it is advisable to ommit most of the metadata. This will make the resulting file smaller, and is done as follows:
import networkx as nx
digraph = collection.containing('#AnyoneButHillary').only_retweets().retweet_network(user_metadata=['screen_name'], tweet_metadata=[''])
nx.write_graphml(digraph, '/path/to/outputfile.graphml')
The .graphml
file may then be opened in graph analysis/visualization programs such as Gephi or Pajek.
The networkx
library also provides algorithms for vizualization and analysis.
Smapp-toolkit has some built-in plotting functionality. See the example scripts, and check out the gallery!
Currently implemented:
- barchart of tweets per time-unit (
tweets_over_time_figure(...)
) - barchart by language by day (
languages_per_day_figure(...)
) - line chart (tweets per day) with vertical event annotations (
tweets_per_day_with_annotations_figure(...)
) - geolocation names by time (
geolocation_names_by_day_figure(...)
) - user locations by time (
user_locations_by_day_figure(...)
)
In order to get these to work, some extra packages (not automatically installed) need to be installed:
matplotlib
seaborn
SMAPP stores tweets in MongoDB databases, and splits the tweets across multiple MongoDB collections, because this gives better performance than a single large MongoDB collection. The MongoDB Database needs to have a smapp_metadata
collection with a single smapp-tweet-collection-metadata
document in it, which specifies the names of the tweet collections.
The smapp-tweet-collection-metadata
document has the following form:
{
"document": "smapp-tweet-collection-metadata",
"tweet_collections": [
"tweets_1",
"tweets_2",
"tweets_3",
]
}
The MongoTweetCollection
object may still be used if the metadata collection and document have different names:
collection = MongoTweetCollection(..., metadata_collection='smapp_metadata', metadata_document='smapp-tweet-collection-metadata')
All you need to do is insert the following collection and document into your MongoDB database:
(from the mongo shell)
db.smapp_metadata.save({
"document": "smapp-tweet-collection-metadata",
"tweet_collections": [ "tweets" ]
})
and the default behavior will work as advertised.
Code and documentation © 2014 New York University. Released under the GPLv2 license.