/twitterknife

Primary LanguagePythonMIT LicenseMIT

Twitterknife

Simple helper functions for tweet preprocessing.
We support processing the tweet jsonl files extracted from client libraries such as twarc.

Features

We currently support:

  • tweet parsing
  • cleaning the text (strip accents, user handles and urls normalization, etc.)
  • frequent word sets mining (through FPGrowth from mlxtend)
  • association rules mininig
  • topic detection (through CTM)

Installation

pip install git+https://github.com/g8a9/twitterknife.git

Getting Started

Preprocessing

import twitterknife.twitterknife as tkf

# 1. parse the raw jsonl file
tweets = tkf.parse_jsonl("tweets.json")

# 2. extract base information from the tweet structure
tweet_info = tkf.get_base_info(tweets)

# 3. remove tweets we don't have data for
tweet_info = [t for t in tweet_info if t["has_data"]]

# 4. clean texts
proc_texts = tkf.clean_texts((t["tweet_text"] for t in tweet_info))

Text Mining

Frequent Word Sets and Association Rules Mining.

with open("stopwords.txt") as fp:
    stopwords = [l.strip() for l in fp.readlines()]

# clean texts first
texts = tkf.clean_texts(raw_texts, strip_user_handles=False, strip_punctuation=True)

# FP Growth Mining
frequent_word_sets = tkf.frequent_word_sets(
    texts,
    stopwords=stopwords,
    fpgrowth_args={"min_support": 0.005, "max_len": 10}
)

# Association Rules Mining
ass_rules = tkf.association_rules_mining(
    frequent_word_sets, metric="confidence", min_threshold=0.4
)

Topic Discovery

kt = tkf.find_topics(texts, n_topics=5, stopwords=stopwords)

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.