Use a custom corpus
Closed this issue · 3 comments
CompBioML commented
Hey, I just read your paper and found it really interesting. However, I was wondering if PINE is easily translatable to other corpuses beyond Wiki and Common Crawl and how I could go about implementing that?
Witiko commented
@CompBioML Hello and thank you for your interest. PInE is not easily translatable as the focus is on reproducibility and robustness rather than ease of experimentation. There are two options:
- You can add your corpus into PInE:
- Write a function that creates a plain text file with the corpus similarly to the
get_corpus_path()
functions from the common_crawl and wikipedia modules. - Register a name for your corpus in the
get_corpus()
function from thecorpus
module. - Register the expected size of the text file of the corpus in the
CORPUS_SIZES
dict from theconfiguration
module. - Use the name of your corpus in the
corpus
parameter of theLanguageModel
constructor from thelanguage_model
module.
- Write a function that creates a plain text file with the corpus similarly to the
- You can use the low-level witiko/gensim@pine library on top of which the high-level PInE library is built. Gensim will work with any kind of corpus, but using it will feel like programming.
Witiko commented
@CompBioML In version 0.2.0, I added support for custom corpora. You can now pass a parameter corpus
of type Iterable[Iterable[str]]
to the LanguageModel
constructor. For example, if you have a corpus stored in a text file named corpus.txt
, here is how you could load it together with a progress bar:
from typing import Iterable, List
from pine import LanguageModel
from tqdm import tqdm
class MyCorpus:
def __init__(self):
with open('corpus.txt', 'rt') as f:
self.number_of_lines = sum(1 for _ in tqdm(f, desc='Counting lines in corpus'))
def __iter__(self) -> Iterable[List[str]]:
with open('corpus.txt', 'rt') as f:
sentences = tqdm(f, desc='Reading corpus', total=self.number_of_lines)
for sentence in sentences:
sentence = sentence.split() # tokenize the sentence
yield sentence
corpus = MyCorpus()
language_model = LanguageModel(corpus)