Contextual Document Embedding for Neural Retrieval

Overview

This project is designed to create contextual document embeddings for neural retrieval using a combination of contrastive learning and contextual information from related documents. The goal is to train a model that can embed documents into a vector space where similar documents are closer together, thus improving information retrieval tasks such as document search and recommendation systems.

The project uses a contrastive learning approach to train an LSTM-based encoder to learn embeddings for documents. By incorporating positive and negative neighbors, the model creates context-rich embeddings that can be used for more effective semantic retrieval.

Project Structure

The project consists of the following main components:

config.py: Contains configuration settings, such as model hyperparameters, data paths, and training settings.
train.py: Script to train the model using contrastive learning, incorporating both positive and negative neighbors for context.
evaluate.py: Evaluates the model's ability to retrieve similar documents, computing metrics such as precision, recall, and F1-score.
models/contrastive.py: Defines the contrastive model that uses the encoder to compute similarity scores and contrastive loss.
models/encoder.py: Implements the ContextualEncoder, which uses an LSTM to generate document embeddings enriched by neighbors.
data/dataset.py: Defines the DocumentDataset class, responsible for loading and providing documents for training and evaluation.

Getting Started

Prerequisites

Python 3.7+
PyTorch for model development (torch)
A virtual environment is recommended for dependency management.
CUDA (optional) for GPU support.

Setup Instructions

Clone or Extract the Repository
- Extract the ZIP file or clone the repository to your desired directory.
Create a Virtual Environment
```
python3 -m venv venv
```
Activate the Virtual Environment
- On Linux/macOS:
```
source venv/bin/activate
```
- On Windows:
```
venv\Scripts\activate
```
Install Dependencies
- Navigate to the project directory and install the required dependencies:
```
pip install -r requirements.txt
```

Prepare Data

Place the data in data/raw/ and data/processed/.

The processed data should contain JSON files with each document represented as follows:

{
  "id": "document_1",
  "embedding": [0.1, 0.2, 0.3, ...],
  "neighbor_ids": ["document_2", "document_3"],
  "pos_neighbor_ids": ["document_2"],
  "neg_neighbor_ids": ["document_4", "document_5"]
}

Ensure all necessary fields are included for every document.

Running the Project

Training the Model

To train the model, run the train.py script:

python train.py

Device: The script will automatically detect if a GPU is available and use it for training.
Checkpoints: Model checkpoints are saved in the checkpoints/ directory after every epoch.

Evaluating the Model

After training, evaluate the model using the evaluate.py script:

python evaluate.py

Make sure the checkpoint (model.pth) is available for loading. You can specify the correct checkpoint file if needed.

Project Workflow

Training: The training script uses contrastive learning to learn embeddings by maximizing similarity with positive neighbors and minimizing similarity with negative neighbors.
Evaluation: After training, the evaluation script runs to assess how well the embeddings capture the contextual relationships between documents.

NijatZeynalov/Contextual-Document-Embedding-for-Neural-Retrieval