/NLP_Toolbox

This repository contains various tools intended for handling Natural Language Processing (NLP) tasks

Primary LanguagePython

Natural Language Processing Toolbox

This repository contains various python scripts developed for handling most of the Natural Language Processing (NLP) tasks including classification, named entity recognition and text generation. Most of the scripts are completely run of the mill and are included here because I simply needed them at my job, some scripts, however, contain original solutions and you are most welcome to use them. I intend to further extend this repositoty as I continue tackling various NLP challenges.

  • DATA_AUGMENTATION - here you could find scripts intended for data augmentation using mostly Selenium and in some cases applying algorithmic approach to generating data, e.g. artificial CoLa

  • NER - in this directory there are scripts for training NER models and also running them in inference mode with extra feature - visualization of the displacy work in the html format

  • TEXT_CLASSIFICATION - similarly, here there are scripts for training binary text classificators including on unbalanced data as well as in the form of adapters

  • NLP_tools - miscellaneous custom NLP tools incluidng cosine similarity, extracting embeddings, some tools for working with embeddings and tools based on pymorphy2 and SpaCy

  • MASKED_LANGUAGE_MODELLING - at the moment, there is only a script for quick mlm model training, nothing fancy

  • CAUSAL_LANGUAGE_MODELLING - same as above but with some inference mode code and instructions

  • TOPIC_MODELLING - at this folder you can find a class which generates vowpal type dataset which is supposed to be used in the unsupervised BigARTM model