Use a custom corpus

Question

Use a custom corpus

Closed this issue 2 years ago · 3 comments

Hey, I just read your paper and found it really interesting. However, I was wondering if PINE is easily translatable to other corpuses beyond Wiki and Common Crawl and how I could go about implementing that?

Answer 1 · 2022-03-04T19:43:31.000Z

@CompBioML Hello and thank you for your interest. PInE is not easily translatable as the focus is on reproducibility and robustness rather than ease of experimentation. There are two options:

You can add your corpus into PInE:
- Write a function that creates a plain text file with the corpus similarly to the get_corpus_path() functions from the common_crawl and wikipedia modules.
- Register a name for your corpus in the get_corpus() function from the corpus module.
- Register the expected size of the text file of the corpus in the CORPUS_SIZES dict from the configuration module.
- Use the name of your corpus in the corpus parameter of the LanguageModel constructor from the language_model module.
You can use the low-level witiko/gensim@pine library on top of which the high-level PInE library is built. Gensim will work with any kind of corpus, but using it will feel like programming.

Answer 2 · 2022-04-27T20:20:44.000Z

@CompBioML In version 0.2.0, I added support for custom corpora. You can now pass a parameter corpus of type Iterable[Iterable[str]] to the LanguageModel constructor. For example, if you have a corpus stored in a text file named corpus.txt, here is how you could load it together with a progress bar:

from typing import Iterable, List

from pine import LanguageModel
from tqdm import tqdm

class MyCorpus:
    def __init__(self):
        with open('corpus.txt', 'rt') as f:
            self.number_of_lines = sum(1 for _ in tqdm(f, desc='Counting lines in corpus'))

    def __iter__(self) -> Iterable[List[str]]:
        with open('corpus.txt', 'rt') as f:
            sentences = tqdm(f, desc='Reading corpus', total=self.number_of_lines)
            for sentence in sentences:
                sentence = sentence.split()  # tokenize the sentence
                yield sentence


corpus = MyCorpus()
language_model = LanguageModel(corpus)

Answer 3 · 2022-04-28T01:03:24.000Z

@Witiko Awesome that's really useful! Your prior recommendation was already also really helpful too