load a custom preprocessed dataset Error
srashtchi opened this issue · 4 comments
- OCTIS version: 1.10.4
- Python version: 3.9
- Operating System: MacOS
Description
I am trying to use evaluation metrics from OTICS package on my own dataset.
I did follow the guilds in main readme page on how to load a custom preprocessed dataset. I even used your sample .tsv file, but I got the following error:
NotADirectoryError: [Errno 20] Not a directory: '/Users/shabnam.rashtchi/DEB/topicModeling_project_folder/scratches/metadata.json/corpus.tsv'
What I Did
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder('/Users/../scratches/corpus.tsv')
Traceback (most recent call last):
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/octis/dataset/dataset.py", line 327, in load_custom_dataset_from_folder
df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in init
self._engine = self._make_engine(f, self.engine)
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1218, in _make_engine
self.handles = get_handle( # type: ignore[call-overload]
File "/opt/anaconda3/envs/BERTOPIC/lib/python3.9/site-packages/pandas/io/common.py", line 786, in get_handle
handle = open(
NotADirectoryError: [Errno 20] Not a directory: '/Users/shabnam.rashtchi/DEB/topicModeling_project_folder/scratches/metadata.json/corpus.tsv'
Hi!
The function load_custom_dataset_from_folder()
requires the folder path, not the path to corpus file.
Can you check if it works in this way?
dataset.load_custom_dataset_from_folder('/Users/../scratches/')
Silvia
Thanks Silvia for getting back to me.
It worked. I didn't know I need to name my own .tsv file same as your sample "corpus.tsv". and not include the corpus.tsv file name in the path.
Cheers
Perfect. I'll fix the readme to make it clear.
Thanks,
Silvia
Hi Silvia
I managed to get my code running fine, thanks for your response.
I have another question , I am trying to make the code smoother, right now in order to create a dataset object I have to save my variable to a .tsv file first, and then use the load_custom_dataset_from_folder
method to load the data from .tsv into empty dataset object. without this object obviously the get_corpus()
method wouldn't do its magic. See the sample code below.
So basically the question is: is there a way to directly pass my variable to a dataset
object without saving and loading?
from octis.dataset.dataset import Dataset
f=Path('/myFolderPath/corpus.tsv')
df.to_csv(f, sep="\t", index=False, header=False, columns = ['document'])
dataset = Dataset()
dataset.load_custom_dataset_from_folder('/myFolderPath/')
texts=dataset.get_corpus()