text_preprocessing


This is a project to practice some of the things we've learned on git, env management, makefiles, templates and python during the 2020 ML in Prod training.

We will build a library (package) to perform a different type of "embedding" (conversion from text to arrays), and integrate it into the capstone project. While we do it, we will practice with many of the tools we have been learning.

  1. Create a text_preprocessing package using an existing cookiecutter template or manually, with the following structure and empty files:
text_processing/
... text_processing/
....... __init__.py
... tests/
....... .gitkeep
... setup.py 
... README.md
  1. Go to the new package folder and create a git repository:
cd text_processing
git init .
git add .
git status
# check everything is there :)
git commit -m "Initial skeleton."
  1. Go to github and create a git repository text_preprocessing (WARNING: do not create a README.md or .gitinore file)

  2. Add the remote to your local git repository

git remote add origin https://github.com/atibaup/text_preprocessing.git git push -u origin master

  1. Let's set up a conda environment for this project:

conda create -n text_preprocessing Python=3.7

  1. We are now ready to start coding. We will create a bow_embed function in an embeddings.py module:
import numpy as np
import hashlib

OTHER_TOKEN = "__OTHER__"
EMPTY_TOKEN = "__EMPTY__"
VOCABULARY = ['python', 'java', 'sql', 'delphi', 'c++', OTHER_TOKEN, EMPTY_TOKEN]

def tokenize(doc):
    return doc.split()

def bow_embed(documents):
    embedded_docs = []
    for document in documents:
        tokenized_doc = tokenize(document)
        embedded_doc = np.zeros(len(VOCABULARY))
        for token in tokenized_doc:
            if token in VOCABULARY:
                embedded_doc[VOCABULARY.index(token)] = 1
            else:
                embedded_doc[VOCABULARY.index(OTHER_TOKEN)] = 1
        if np.alltrue(embedded_doc == 0.0):
            embedded_doc[VOCABULARY.index(EMPTY_TOKEN)] = 1
        embedded_docs.append(embedded_doc)
    return np.array(embedded_docs)
  1. We will now add unit tests to make sure this function is running well
import unittest

class TestEmbedding(unittest.TestCase):
	def test_bow_embed_on_empty_texts(self):
		pass

	def test_bow_embed_on_single_words(self):
		pass
		
if __name__ == '__main__':
    unittest.main()

Advanced 1: use parameterized

# contents of setup.py
from setuptools import setup

setup(
    name='text_preprocessing',
    version='0.0.1',
    packages=['text_preprocessing'],
    install_requires=['numpy >1.16,<2.0']
)
  1. When we are happy and our tests pass, we will commit the changes

  2. We will now add our package as a dependency of the train package in the capstone project

  3. We will make the changes in train.py to now use and test our new bow_embedding

from CLI:

pip install git+https://github.com/django/django.git@45dfb3641aa4d9828a7c5448d11aa67c7cbd7966#egg=django[argon2]

in conda env, add those lines to the conda env yml:

[Bonus]

  1. Our bow_embedding model was a little silly, because it had a 10 word vocabulary, hence our accuracy was not very good. We are now going to improve it by implementing a HashEmbedding, which in this case it will need to be a class because it will hold state:

class HashEmbedding: def init(self, dim): self.dim = dim def embed(self, texts): return [self.embed_one(text) for text in texts]

  1. Let's add