The project focuses on the following key tasks:
- Data Preprocessing: Clean job postings data, handling missing values , removing HTML tags
- Generating Embeddings: Used a pre-trained model Sentence Transformers to generate embeddings
- Milvus for Duplicate Detection: Set up a Milvus instance, insert embeddings, and implement a method to search for potential duplicates.
- Docker/Docker Compose Integration: Containerize the project for easy reproducibility.
https://drive.google.com/file/d/15rrFDdftzcWTLXRy5gJbzYsBdPyZLvmA/view?usp=sharing
/MilvusProject
|-- job_postings.csv/
|-- preprocessing.py/
|-- detect_duplicates.py/
|-- embeddings/
| |-- job_description_embeddings.pt
|-- files/
| |-- cleaned_file.py
| |-- embedding_csv.py
|-- Dockerfile
|-- docker-compose.yml
|-- README.md
- Python 3.x
- PyTorch
- Sentence Transformers
- pymilvus
Install dependencies using:
pip install -r requirements.txt
-
Clone the repository:
git clone https://github.com/DhruvMiyani/Duplicate-Detection-for-Job-Postings-using-Milvus
-
Start Milvis:
bash standalone_embed.sh start
-
Build Image:
docker build -t milvus2:latest .
-
Run:
docker run -p 80:80 milvus2
-
Data Preprocessing:
Explore and clean the data in the
job_postings.csv
file. -
Generating Embeddings:
Run the following command to generate embeddings:
python preprocessing.py
-
Milvus for Duplicate Detection:
-
Set up Milvus instance and duplicate detection:
python milvus/duplicate_detection.py
-