Basic utilties for processing Tweets. Includes:
TweetTokenizer
for tokenizing the Tweet contentTweetReader
for easily iterating over TweetsTweetWriter
for conveniently writing one or more Tweets to a file in JSONlines format
There are two options for installing littlebird
.
git clone https://github.com/AADeLucia/littlebird.git
cd littlebird
python setup.py develop
pip install git+git://github.com/AADeLucia/littlebird.git#egg=littlebird
The example below reads in a Tweet file, filters to Tweets that have a hashtag, and writes out to a new file.
TweetWriter
can write a single Tweet or list of Tweets to a file in JSONlines format. It will also automatically open a GZIP file if the provided filename has a .gz
extension. If you are writing to a GZIP file, it is recommended to write all Tweets at once instead of writing incrementally; this provides better file compression. If you do need to write incrementally, I recommend writing to a normal file and GZIPping after.
from littlebird import TweetReader, TweetWriter
# File in JSONlines form. Automatically handles GZIP files.
tweet_file = "2014_01_02.json.gz"
reader = TweetReader(tweet_file)
# Iterate over Tweets
# Save Tweets that contain hashtags
filtered_tweets = []
for tweet in reader.read_tweets():
if tweet.get("truncated", False):
num_hashtags = len(tweet["extended_tweet"]["entities"]["hashtags"])
else:
num_hashtags = len(tweet["entities"]["hashtags"])
if num_hashtags > 0:
filtered_tweets.append(tweet)
# Write out filtered Tweets
writer = TweetWriter("filtered.json")
writer.write(filtered_tweets)
A basic example using the default Tokenizer settings is below.
from littlebird import TweetReader, TweetTokenizer
# File in JSONlines form. Automatically handles GZIP files.
tweet_file = "2014_01_02.json.gz"
reader = TweetReader(tweet_file)
tokenizer = TweetTokenizer()
# Iterate over Tweets
# Make sure to check for the "truncated" field otherwise you will only access the
# 140 character Tweet, not the full 280 character message
for tweet in reader.read_tweets():
if tweet.get("truncated", False):
text = tweet["extended_tweet"]["full_text"]
else:
text = tweet["text"]
# Tokenize the Tweet's text
tokens = tokenizer.tokenize(text)
Available TweetTokenizer
settings:
language
: right now it only really supports English, but as long as you change thetoken_pattern
accordingly, it should work with other languages. A future integration is usingMoses
for Arabic tokenization.token_pattern
: Pattern to match for acceptable tokens. Default isr"\b\w+\b"
stopwords
: provide a list of stopwords to remove from the text. Default isNone
for no stopword removal.remove_hashtags
: Default isFalse
to keep hashtags in the text (only strips the "#" symbol)lowercase
: Default isTrue
to lowercase all of the text. Change this toFalse
if you are doing case-sensitive tasks like Name Entity Recognition (NER)
The tokenizer works in the following steps:
- Remove hashtags (optional)
- Remove URLs, handles, and "RT"
- Lowercase the text (optional)
- Find all tokens that match the
token_pattern
withregex.findall(token_pattern, text)
- Remove stopwords (optional)
The token pattern is extremely important to set for your use case. Below are some sample token patterns, but I highly recommend refreshing on your regular expressions if you need something more advanced.
Note: the regex
package is used to access character classes like \p{L}
. Basically Java regex patterns.
r"\b\w+\b"
matches any token with one or more letters, numbers, and underscores. This is equivalent to"[\p{L}\_\p{N}]+"
r"\b\p{L}+\b"
matches any token with one or more "letters" (across all alphabets).
This package is a work in progress. Feel free to open any issues you run into or recommend features. I started this package as an inbetween for something lighter than Twokenizer but more customizable than NLTK.
@misc{DeLucia2020,
author = {Alexandra DeLucia},
title = {Little Bird},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/aadelucia/littlebird}},
}
Copyright (c) 2020 Alexandra DeLucia