/LLM-Latino

Collection of ETL scripts used to create a dataset of text in Spanish to train Large Language Models.

Primary LanguagePython

Seneca Extractor

Description

seneca_extractor is a Python package designed for extracting files and metadata from the Seneca institutional repository at Universidad de los Andes. This project is part of the LLM-Latino project and focuses on facilitating the access and manipulation of data stored in the repository.

Authors

  • Juan Sebastian Urrea Lopez
  • David Santiago Ortiz Almanza

Contact

Installation

To install this package, it is recommended to use a Python virtual environment to avoid dependency conflicts. You can follow these steps to set up your environment and install seneca_extractor:

  1. Create and activate a virtual environment (optional, but recommended):

    • On Windows:
      python -m venv venv
      .\venv\Scripts\activate
    • On Unix or MacOS:
      python3 -m venv venv
      source venv/bin/activate
  2. Install the package:

    • Navigate to the directory where the source code is located and run:
      pip install -e .

    This will install seneca_extractor in editable mode, which means any changes to the package source code will be immediately available without needing to reinstall the package.