Table of Contents
The final project consists of Streamlit UI, FastAPI backend with PostgreSQL and Faiss indexes, and DistilUsev1 (trained with ContrastiveCE) on separate embedder module. The embedder module riches 25 RPS in peak on 13th Gen Intel(R) Core(TM) i9-13900HX CPU.
We tried four raw (not trained) popular sentence transformers:
- DistilUsev1
- DistilUsev2
- mpnet
- MiniLM Concluded that DistilUsev1, even though it was not trained on our data, had the same quality as Doc2Vec. DistilUsev1 was chosen as a base model.
Also service API is now available for searching corresponding vacancies and resumes, using Faiss for storing essintal embeddings and PostgreSQL for other info.
Experiments with Doc2Vec were made: Doc2Vec - v1 (vector_size = 35, epochs = 50): positive similarity = 0.414, positive similarity = 0.298, difference = 0.116. Meteor score: 0.342, Rouge score: 0.28.
Doc2Vec - v2 (vector_size = 15, epochs = 50): positive similarity = 0.158, positive similarity = 0.106, difference = 0.0515. Meteor score: 0.126, Rouge score: 0.103.
For more info please visit our notion page.
We had to work on our data as there weren't any 'ready' datasets for our project. Dataset with vacancies was matched with resume data manually, with 2 different approaches:
- By calculating similarities between full texts of resumes and vacancies using Word2Vec, Doc2Vec and TFidVectorization (file resume_matching_data.ipynb). But the results we got here were dissatisfying.
- By matching on key words and setting strict filters on data. This approach turned out to be effective.
In this project we provide both highly efficient and accurate service for matching CVs with available vacancies using Distiluse sentence-transformer. We use FastAPI with PostgreSQL and Faiss for storing, adding and searching similar resumes and vacancies, Sentence_Transformers for training and inferencing models and Streamlit for cool and minimalistic frontend.
- Clone the repo
git clone -b randv_main https://github.com/pavviaz/itmo_pdl.git
- Place SentenceTransformer checkpoint folder into
embedder/weights
directory, and example resume and vacancies CSVs intoapi/init_data
(our weights and data,ce_model.zip
is model folder,resume_train_no_index.csv
andvac_train_no_index.csv
are for resume and vacancies data respectively). Change path for model and data in config files if needed - Create
.env
file in root directory with following keysDB_NAME=<EXAMPLE_DB_NAME> DB_USER=<EXAMPLE_DB_USER> DB_PASSWORD=<EXAMPLE_DB_PASSWD> DB_HOST=<EXAMPLE_DB_HOST> DB_PORT=5044 EMBEDDER_URL=http://embedder:5043
- Build & run containers
sudo docker-compose build sudo docker-compose up
Congratulations! Streamlit is now available at http://localhost:8501/
and API endpoints are at http://localhost:5041/docs
.
-
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
@inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", }
Fyodorova Inessa
Kudryashov Georgy
Vyaznikov Pavel