thomsonreuters/classy_bench

Add function that verifies if the input dataset is valid

Opened this issue · 0 comments

As described in the README, we have some requirements for the input dataset if the user decides to use the built-in pipelines:

If you are planning to use any of the included pipelines, you must have a dataset split into 3 files (train.csv, dev.csv and test.csv) that contain train, validation and test sets respectively. Each file must have the following columns:
id: an identifier for each sample, e.g. a document id
text: the input text
labels: the labels list as a string (e.g. "[LabelA, OtherLabel, LabelB]")

It would be great to have a function that helps user verify that their datasets fulfill these requirements.