/DocLing

DocLing: Multilingual Document Understanding

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

MultiLingual Document Understanding

This repository demonstrates the implementation of a language model pipeline using Cohere's AYA Model. The notebook walks through various steps, from data preprocessing to model deployment, focusing on using Cohere's capabilities for language tasks.

Table of Contents

Overview

Our approach showcases a pipeline designed to leverage the power of Cohere's NLP model for natural language processing tasks. This includes:

  • Preprocessing text data
  • Creating embeddings
  • Implementing a language understanding or generation task
  • Evaluating model outputs

Requirements

Before running the scripts in your local system, make sure you have the following installed:

  • Python 3.x
  • Jupyter Notebook or JupyterLab
  • Cohere Python SDK (cohere)
  • Other necessary libraries such as pandas, numpy, and scikit-learn

Installation

To set up the environment, follow these steps:

  1. Clone this repository:
    git clone https://github.com/hemhemoh/DocLing.git
  2. Install the necessary dependencies:
    pip install -r requirements.txt
  3. Obtain an API key from Cohere and set it up in the notebook.
  4. Run the below command:
python app.py

Approach Description

The approach follows these main steps:

  1. Data Preparation: Loads and preprocesses text data for NLP tasks.
  2. Embedding Generation: Utilizes Cohere’s model to create embeddings for text inputs.
  3. Model Usage: Demonstrates how to perform tasks like text classification, semantic search, or text generation using the Cohere API.
  4. Evaluation: Evaluates the model's performance based on the specific NLP task.

References