MIND-Lab/OCTIS

load a custom preprocessed dataset Error

srashtchi opened this issue · 4 comments

  • OCTIS version: 1.10.4
  • Python version: 3.9
  • Operating System: MacOS

Description

I am trying to use evaluation metrics from OTICS package on my own dataset.
I did follow the guilds in main readme page on how to load a custom preprocessed dataset. I even used your sample .tsv file, but I got the following error:
NotADirectoryError: [Errno 20] Not a directory: '/Users/shabnam.rashtchi/DEB/topicModeling_project_folder/scratches/metadata.json/corpus.tsv'

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder('/Users/../scratches/corpus.tsv')

Traceback (most recent call last):
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/octis/dataset/dataset.py", line 327, in load_custom_dataset_from_folder
df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in init
self._engine = self._make_engine(f, self.engine)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1218, in _make_engine
self.handles = get_handle( # type: ignore[call-overload]
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/common.py", line 786, in get_handle
handle = open(
NotADirectoryError: [Errno 20] Not a directory: '/Users/shabnam.rashtchi/DEB/topicModeling_project_folder/scratches/metadata.json/corpus.tsv'

Hi!
The function load_custom_dataset_from_folder() requires the folder path, not the path to corpus file.

Can you check if it works in this way?

dataset.load_custom_dataset_from_folder('/Users/../scratches/')

Silvia

Thanks Silvia for getting back to me.
It worked. I didn't know I need to name my own .tsv file same as your sample "corpus.tsv". and not include the corpus.tsv file name in the path.
Cheers

Perfect. I'll fix the readme to make it clear.
Thanks,

Silvia

Hi Silvia

I managed to get my code running fine, thanks for your response.

I have another question , I am trying to make the code smoother, right now in order to create a dataset object I have to save my variable to a .tsv file first, and then use the load_custom_dataset_from_folder method to load the data from .tsv into empty dataset object. without this object obviously the get_corpus() method wouldn't do its magic. See the sample code below.

So basically the question is: is there a way to directly pass my variable to a dataset object without saving and loading?

from octis.dataset.dataset import Dataset
f=Path('/myFolderPath/corpus.tsv')
df.to_csv(f, sep="\t", index=False, header=False, columns = ['document'])

dataset = Dataset()
dataset.load_custom_dataset_from_folder('/myFolderPath/')

texts=dataset.get_corpus()