Contextual Document Embedding for Neural Retrieval

Overview

This project is designed to create contextual document embeddings for neural retrieval using a combination of contrastive learning and contextual information from related documents. The goal is to train a model that can embed documents into a vector space where similar documents are closer together, thus improving information retrieval tasks such as document search and recommendation systems.

The project uses a contrastive learning approach to train an LSTM-based encoder to learn embeddings for documents. By incorporating positive and negative neighbors, the model creates context-rich embeddings that can be used for more effective semantic retrieval.

Project Structure

The project consists of the following main components:

  • config.py: Contains configuration settings, such as model hyperparameters, data paths, and training settings.
  • train.py: Script to train the model using contrastive learning, incorporating both positive and negative neighbors for context.
  • evaluate.py: Evaluates the model's ability to retrieve similar documents, computing metrics such as precision, recall, and F1-score.
  • models/contrastive.py: Defines the contrastive model that uses the encoder to compute similarity scores and contrastive loss.
  • models/encoder.py: Implements the ContextualEncoder, which uses an LSTM to generate document embeddings enriched by neighbors.
  • data/dataset.py: Defines the DocumentDataset class, responsible for loading and providing documents for training and evaluation.

Getting Started

Prerequisites

  • Python 3.7+
  • PyTorch for model development (torch)
  • A virtual environment is recommended for dependency management.
  • CUDA (optional) for GPU support.

Setup Instructions

  1. Clone or Extract the Repository

    • Extract the ZIP file or clone the repository to your desired directory.
  2. Create a Virtual Environment

    python3 -m venv venv
  3. Activate the Virtual Environment

    • On Linux/macOS:
      source venv/bin/activate
    • On Windows:
      venv\Scripts\activate
  4. Install Dependencies

    • Navigate to the project directory and install the required dependencies:
      pip install -r requirements.txt

Prepare Data

  • Place the data in data/raw/ and data/processed/.
  • The processed data should contain JSON files with each document represented as follows:
    {
      "id": "document_1",
      "embedding": [0.1, 0.2, 0.3, ...],
      "neighbor_ids": ["document_2", "document_3"],
      "pos_neighbor_ids": ["document_2"],
      "neg_neighbor_ids": ["document_4", "document_5"]
    }
    Ensure all necessary fields are included for every document.

Running the Project

Training the Model

To train the model, run the train.py script:

python train.py
  • Device: The script will automatically detect if a GPU is available and use it for training.
  • Checkpoints: Model checkpoints are saved in the checkpoints/ directory after every epoch.

Evaluating the Model

After training, evaluate the model using the evaluate.py script:

python evaluate.py
  • Make sure the checkpoint (model.pth) is available for loading. You can specify the correct checkpoint file if needed.

Project Workflow

  1. Training: The training script uses contrastive learning to learn embeddings by maximizing similarity with positive neighbors and minimizing similarity with negative neighbors.
  2. Evaluation: After training, the evaluation script runs to assess how well the embeddings capture the contextual relationships between documents.