Define the format of input for all metrics
faneshion opened this issue · 1 comments
Abstract the dataset by defining a new data structure with datapack. Intuitively, a DataPack consists of five parts: question
, answer
, gt_answer
, contexts
, and gt_contexts
. Currently, we leave search_query
and gt_search_query
as future work.
Examples:
>>> question = [
... ['qid1', 'question 1'],
... ['qid2', 'question 2']
... ]
>>> answer = [
... ['aid1', 'answer 1'],
... ['aid2', 'answer 2']
... ]
>>> question = pd.DataFrame(question)
>>> answer = pd.DataFrame(answer)
>>> dp = DataPack(
... question=question,
... answer=answer,
... gt_answer=gt_answer,
... contexts=contexts,
... gt_contexts=gt_contexts,
... )
>>> len(dp)
2
In this way, we can add many inplace functions to process datapack. Some basic usage are as follows:
>>> import rageval as rl
>>> data_pack = rl.datasets.toy.load_data()
>>> data_pack.apply_on_question(preprocess_func)
>>> data_pack.drop_label(inplace=True)
>>> data_pack.has_label
False
The dataset from huggingface is a good choice to replace the DataPack.
URL: https://github.com/huggingface/datasets
At now, we require each metrics to implement the api:define compute(self, dataset: Dataset) -> (score, Dataset)
. The column names of the dataset object should be in ["questions", "contexts", "gt_contexts", "answers", "gt_answers"]
. An example of dataset are as follows:
>>> from datasets import Dataset
>>> data = {
"questions": ["what is snoopy", "where is beijing"],
"contexts": ["snoopy one", "snoopy two"],
"gt_contexts": [{"id": 1, "text": "snoopy 1", "label": 1}, {"id": 2, "text": "snoopy 2", "label": 2}],
"answers": ["a1", "a2"],
"gt_answers": ["a11", "aa2"]
}
>>> dataset = Dataset.from_dict(data)
>>> len(dataset)
2
>>> dataset
Dataset({
features: ['questions', 'contexts', 'gt_contexts', 'answers', 'gt_answers'],
num_rows: 2
})
It is worthy to note that each colunm can be extended to more complicated data structures.