This project is designed to create contextual document embeddings for neural retrieval using a combination of contrastive learning and contextual information from related documents. The goal is to train a model that can embed documents into a vector space where similar documents are closer together, thus improving information retrieval tasks such as document search and recommendation systems.
The project uses a contrastive learning approach to train an LSTM-based encoder to learn embeddings for documents. By incorporating positive and negative neighbors, the model creates context-rich embeddings that can be used for more effective semantic retrieval.
The project consists of the following main components:
config.py
: Contains configuration settings, such as model hyperparameters, data paths, and training settings.train.py
: Script to train the model using contrastive learning, incorporating both positive and negative neighbors for context.evaluate.py
: Evaluates the model's ability to retrieve similar documents, computing metrics such as precision, recall, and F1-score.models/contrastive.py
: Defines the contrastive model that uses the encoder to compute similarity scores and contrastive loss.models/encoder.py
: Implements theContextualEncoder
, which uses an LSTM to generate document embeddings enriched by neighbors.data/dataset.py
: Defines theDocumentDataset
class, responsible for loading and providing documents for training and evaluation.
- Python 3.7+
- PyTorch for model development (
torch
) - A virtual environment is recommended for dependency management.
- CUDA (optional) for GPU support.
-
Clone or Extract the Repository
- Extract the ZIP file or clone the repository to your desired directory.
-
Create a Virtual Environment
python3 -m venv venv
-
Activate the Virtual Environment
- On Linux/macOS:
source venv/bin/activate
- On Windows:
venv\Scripts\activate
- On Linux/macOS:
-
Install Dependencies
- Navigate to the project directory and install the required dependencies:
pip install -r requirements.txt
- Navigate to the project directory and install the required dependencies:
- Place the data in
data/raw/
anddata/processed/
. - The processed data should contain JSON files with each document represented as follows:
Ensure all necessary fields are included for every document.
{ "id": "document_1", "embedding": [0.1, 0.2, 0.3, ...], "neighbor_ids": ["document_2", "document_3"], "pos_neighbor_ids": ["document_2"], "neg_neighbor_ids": ["document_4", "document_5"] }
To train the model, run the train.py
script:
python train.py
- Device: The script will automatically detect if a GPU is available and use it for training.
- Checkpoints: Model checkpoints are saved in the
checkpoints/
directory after every epoch.
After training, evaluate the model using the evaluate.py
script:
python evaluate.py
- Make sure the checkpoint (
model.pth
) is available for loading. You can specify the correct checkpoint file if needed.
- Training: The training script uses contrastive learning to learn embeddings by maximizing similarity with positive neighbors and minimizing similarity with negative neighbors.
- Evaluation: After training, the evaluation script runs to assess how well the embeddings capture the contextual relationships between documents.