DhruvMiyani/Duplicate-Detection-for-Job-Postings-using-Milvus

Python

MilvusProject

Introduction

The project focuses on the following key tasks:

Data Preprocessing: Clean job postings data, handling missing values , removing HTML tags
Generating Embeddings: Used a pre-trained model Sentence Transformers to generate embeddings
Milvus for Duplicate Detection: Set up a Milvus instance, insert embeddings, and implement a method to search for potential duplicates.
Docker/Docker Compose Integration: Containerize the project for easy reproducibility.

Demo:

https://drive.google.com/file/d/15rrFDdftzcWTLXRy5gJbzYsBdPyZLvmA/view?usp=sharing

Project Structure

/MilvusProject
|-- job_postings.csv/
|-- preprocessing.py/
|-- detect_duplicates.py/
|-- embeddings/
|   |-- job_description_embeddings.pt
|-- files/
|   |-- cleaned_file.py
|   |-- embedding_csv.py
|-- Dockerfile
|-- docker-compose.yml
|-- README.md

Requirements

Python 3.x
PyTorch
Sentence Transformers
pymilvus

Install dependencies using:

pip install -r requirements.txt

Installation and Docker Compose Integration

Clone the repository:

git clone https://github.com/DhruvMiyani/Duplicate-Detection-for-Job-Postings-using-Milvus

Start Milvis:
```
bash standalone_embed.sh start
```
Build Image:
```
docker build -t milvus2:latest .
```
Run:
```
docker run -p 80:80 milvus2
```

Usage

Data Preprocessing:

Explore and clean the data in the job_postings.csv file.
Generating Embeddings:

Run the following command to generate embeddings:
```
python preprocessing.py
```
Milvus for Duplicate Detection:
- Set up Milvus instance and duplicate detection:
```
python milvus/duplicate_detection.py
```