/nlp-tokenizer

Clean, Normalize, and Tokenize medical records data.

Primary LanguagePythonMIT LicenseMIT

Kardiasclean

Clean, Normalize, and Tokenize medical records data.

Install

pip install kardiasclean

Temporarily removed from PyPi. Install via git!

Usage

import kardiasclean

data['procedure'] = kardiasclean.split_string(data['procedure'], delimiter="+")
data_map = kardiasclean.spread_column(data['procedure'])

data_map['procedure'] = kardiasclean.clean_accents(data_map['procedure'])
data_map['procedure'] = kardiasclean.clean_symbols(data_map['procedure'])
data_map['keywords'] = kardiasclean.clean_stopwords(data_map['procedure'])
data_map['token'] = kardiasclean.clean_tokenize(data_map['keywords'])

list_df = kardiasclean.create_unique_list(spread_df, spread_df['token'])
list_df = list_df.drop(["patient_id", "index"], axis=1)

spread_df['procedure'] = kardiasclean.normalize_from_tokens(spread_df['token'], list_df['token'], list_df['procedure'])

>>>    patient_id                 procedure               keywords      token
>>> 0           0  Reparacion de CIA parche  cia parche reparacion  SPRXRPRSN
>>> 1           1  Reparacion de CIA parche  cia parche reparacion  SPRXRPRSN
>>> 2           2  Reparacion de CIA parche  cia parche reparacion  SPRXRPRSN
>>> 3           3  Reparacion de CIA parche  cia parche reparacion  SPRXRPRSN
>>> 4           4  Reparacion de CIA parche  cia parche reparacion  SPRXRPRSN

How does it work?

This package contains ETL functions for extracting all the unique natural language medical terms from a pandas DataFrame. The steps are much like any other ETL process for creating a bag of words/terms but this includes methods for normalizing the original column via "fuzzy string matching", as well as preparing new DataFrames for loading to an SQL database, and ML pre-processing like binning of low frequency records and categorical data encoding.

Development

poetry run pytest

Changelog

  • 0.3.3: Separated Tokenizer methods to allow steps in between clean and normalize

  • 0.3.2: Create Tokenizer class for ETL

  • 0.3.1: Updated dependencies

  • 0.3.0: Removed SQLAlchemy dependency

  • 0.2.1: Replaced psycopg2 dependency with psycopg2-binary.

  • 0.2.0: Fixed perform_binning implementations, new api for all functions.

  • 0.1.7: Added support for not appending column name to matrix encoding.

  • 0.1.6: Small fixes to stopwords, updated readme.

  • 0.1.5: Fix stopwords implementation, added lowercase conversion.

  • 0.1.3: Added Documentation.

  • 0.1.2: Added SQL support and improved pre-processing functions.

  • 0.1.1: Small readme fixes.

  • 0.1.0: Initial Release.