MIND-Lab/OCTIS

vocabulary: a .txt for custom dataset

SaraAmd opened this issue · 1 comments

how to generate vocabulary file from our csv / tsv dataset?

Hi, you can load the tsv file and then split the words using the spaces and save only the unique words. Like this:

import pandas as pd
df = pd.read_csv(dataset_path + "/corpus.tsv", sep='\t', header=None)
vocabulary = set()
for document in df[0].tolist():
    for word in document.split(): 
         vocabulary.add(word)
with open(dataset_path + "/vocabulary.txt", 'w') as fw:
    for word in vocabulary:
        fw.write(word)

Best,

Silvia