MIR-MU/pine

Use a custom corpus

Closed this issue · 3 comments

Hey, I just read your paper and found it really interesting. However, I was wondering if PINE is easily translatable to other corpuses beyond Wiki and Common Crawl and how I could go about implementing that?

@CompBioML Hello and thank you for your interest. PInE is not easily translatable as the focus is on reproducibility and robustness rather than ease of experimentation. There are two options:

  1. You can add your corpus into PInE:
  2. You can use the low-level witiko/gensim@pine library on top of which the high-level PInE library is built. Gensim will work with any kind of corpus, but using it will feel like programming.

@CompBioML In version 0.2.0, I added support for custom corpora. You can now pass a parameter corpus of type Iterable[Iterable[str]] to the LanguageModel constructor. For example, if you have a corpus stored in a text file named corpus.txt, here is how you could load it together with a progress bar:

from typing import Iterable, List

from pine import LanguageModel
from tqdm import tqdm

class MyCorpus:
    def __init__(self):
        with open('corpus.txt', 'rt') as f:
            self.number_of_lines = sum(1 for _ in tqdm(f, desc='Counting lines in corpus'))

    def __iter__(self) -> Iterable[List[str]]:
        with open('corpus.txt', 'rt') as f:
            sentences = tqdm(f, desc='Reading corpus', total=self.number_of_lines)
            for sentence in sentences:
                sentence = sentence.split()  # tokenize the sentence
                yield sentence


corpus = MyCorpus()
language_model = LanguageModel(corpus)

@Witiko Awesome that's really useful! Your prior recommendation was already also really helpful too