/preprocessor

Elegant and Easy Tweet Preprocessing in Python

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Preprocessor

image

Preprocessor is a preprocessing library for tweet data written in Python.

When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.

Features

Currently supports cleaning, tokenizing and parsing:

  • URLs
  • Hashtags
  • Mentions
  • Reserved words (RT, FAV)
  • Emojis
  • Smileys

Supports Python 2.7 and 3.3+

Usage

Basic cleaning:

Tokenizing:

Parsing:

Fully customizable:

Preprocessor will go through all of the options by default unless you specify some options.

Available Options:

Option Name Option Short Code
URL p.OPT.URL
Mention p.OPT.MENTION
Hashtag p.OPT.HASHTAG
Reserved Words p.OPT.RESERVED
Emoji p.OPT.EMOJI
Smiley p.OPT.SMILEY
Number p.OPT.NUMBER

Installation

using pip: