Text Similarity with Contextualised Embeddings

This repository contains a library that I use for my Natural language processing projects.

All the code in the library is based on Pytorch.

Most of the models in the library are built upon pretrained models from the sentence-transformers library,

which offers a wide variety of options for very performant sentence embeddings models, which is in turn based on the popular transformers library by Huggingface

✨ Contents

Scripts to train and test word-level and sentence-level embeddings models on various NLP tasks
Wrappers around Huggingface pretrained model to perform experiments on text similarity tasks
A semantic search pipeline built on top of performing sentence embedding models and approximate nearest neighbours algorithms
A model compression pipeline that includes functions to distill, prune, quantize and convert models to optimized formats such as Onnx, Tensorflow Lite and Torchscript to use in edge devices
Scripts to train models on a variety of text similarity and sequence classification tasks

Work in Progress

Sense-aware embeddings creation exploiting WordNet relations and contextualised embeddings
PySpark integration for faster text preprocessing for larger datasets

Author

Mirco Cardinale Personal website

🔖 LICENCE

Apache-2.0

cr1m5onk1ng/text_similarity

Text Similarity with Contextualised Embeddings

✨ Contents

Work in Progress

Author

🔖 LICENCE