/ChatGPT-RetrievalQA

A dataset for training/evaluating Question Answering Retrieval models on ChatGPT responses with the possibility to training/evaluating on real human responses.

Primary LanguageJupyter Notebook

ChatGPT-RetrievalQA: Can ChatGPT's responses act as training data for Q&A retrieval models?

The repository of paper "Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts" and paper "A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts". A dataset for training and evaluating Question Answering (QA) Retrieval models on ChatGPT responses with the possibility of training/evaluating on real human responses.

If you use this dataset, please use the following bibtex references:

@InProceedings{askari2023chatgptcikm2023,
  author = {Askari, Arian and Aliannejadi, Mohammad and Kanoulas, Evangelos and Verberne, Suzan},
  titlE = {A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts},
  year = 2023,
  booktitle = {The 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023)},
}

@InProceedings{askari2023genirsigir2023,
  author = {Askari, Arian and Aliannejadi, Mohammad and Kanoulas, Evangelos and Verberne, Suzan},
  title = {Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts},
  year = 2023,
  booktitle = {Generative Information Retrieval workshop at ACM SIGIR 2023},
}

This work has been done under the supervision of Prof. Mohammad Aliannejadi, Evangelos Kanoulas, and Suzan Verberne during my visiting research at Information Retrieval Lab at the University of Amsterdam (IRLab@UvA).

Summary of what we did

Given a set of questions and corresponding ChatGPT's and humans' responses, we make two separate collections: one from ChatGPT and one from humans. By doing so, we provide several analysis opportunities from an information retrieval perspective regarding the usefulness of ChatGPT responses for training retrieval models. We provide the dataset for both end-to-end retrieval and a re-ranking setup. To give flexibility to other analyses, we organize all the files separately for ChatGPT and human responses.

Why rely on retrieval when ChatGPT can generate answers?

While ChatGPT is a powerful language model that can produce impressive answers, it is not immune to mistakes or hallucinations. Furthermore, the source of the information generated by ChatGPT is not transparent and usually there is no source for the generated information even when the information is correct. This can be a bigger concern when it comes to domains such as law, medicine, science, and other professional fields where trustworthiness and accountability are critical. Retrieval models, as opposed to generative models, retrieve the actual (true) information from sources and search engines provide the source of each retrieved item. This is why information retrieval -- even when ChatGPT is available -- remains an important application, especially in situations where reliability is vital.

Answer ranking dataset

This dataset is based on the public HC3 dataset, although our experimental setup and evaluation will be different. We split the data in a train, validation, and test set in order to train/evaluate answer retrieval models on ChatGPT or human answers. We store the actual response by human/ChatGPT as the relevant answer. For training, a set of random responses can be used as non-relevant answers. In our main experiments, we train on ChatGPT responses and evaluate on human responses. We release ChatGPT-RetrievalQA dataset in a similar format to the MSMarco dataset, which is a popular dataset for training retrieval models. Therefore, everyone could re-use their scripts for the MSMarco dataset on our data.

Description Filename File size Num Records Format
Collection-H (H: Human Responses) collection_h.tsv 38.6 MB 58,546 tsv: pid, passage
Collection-C (C: ChatGPT Responses) collection_c.tsv 26.1 MB 26,882 tsv: pid, passage
Queries queries.tsv 4 MB 24,322 tsv: qid, query
Qrels-H Train (Train set Qrels for Human Responses) qrels_h_train.tsv 724 KB 40,406 TREC qrels format
Qrels-H Validation (Validation set Qrels for Human Responses) qrels_h_valid.tsv 29 KB 1,460 TREC qrels format
Qrels-H Test (Test set Qrels for Human Responses) qrels_h_test.tsv 326 KB 16,680 TREC qrels format
Qrels-C Train (Train set Qrels for ChatGPT Responses) qrels_c_train.tsv 339 KB 18,452 TREC qrels format
Qrels-C Validation (Validation set Qrels for ChatGPT Responses) qrels_c_valid.tsv 13 KB 672 TREC qrels format
Qrels-C Test (Test set Qrels for ChatGPT Responses) qrels_c_test.tsv 152 KB 7,756 TREC qrels format
Queries, Answers, and Relevance Labels collectionandqueries.zip 23.9 MB 866,504
Train-H Triples train_h_triples.tsv 58.68 GB 40,641,772 tsv: query, positive passage, negative passage
Validation-H Triple valid_h_triples.tsv 2.02 GB 1,468,526 tsv: query, positive passage, negative passage
Train-H Triples QID PID Format train_h_qidpidtriples.tsv 921.7 MB 40,641,772 tsv: qid, positive pid, negative pid
Validation-H Triples QID PID Format valid_h_qidpidtriples.tsv 35.6 MB 1,468,526 tsv: qid, positive pid, negative pid
Train-C Triples train_c_triples.tsv 37.4 GB 18,473,122 tsv: query, positive passage, negative passage
Validation-C Triple valid_c_triples.tsv 1.32 GB 672,659 tsv: query, positive passage, negative passage
Train-C Triples QID PID Format train_c_qidpidtriples.tsv 429.6 MB 18,473,122 tsv: qid, positive pid, negative pid
Validation-C Triples QID PID Format valid_c_qidpidtriples.tsv 16.4 MB 672,659 tsv: qid, positive pid, negative pid

We release the training and validation data in Triples format to facilitate training. The Triples files to train on ChatGPT responses are: "train_c_triples.tsv" and "valid_c_triples.tsv". Moreover, we release the triples based on human responses so everyone could compare training on ChatGPT VS training on human responses ("train_h_triples.tsv" and "valid_h_triples.tsv" files). Given each query and positive answer, 1000 negative answers have been sampled randomly.

Answer re-ranking dataset

Description Filename File size Num Records
Top-H 1000 Train top_1000_h_train.run 646.6 MB 16,774,122
Top-H 1000 Validation top_1000_h_valid.run 23.7 MB 605,956
Top-H 1000 Test top_1000_h_test.run 270.6 MB 692,0845
Top-C 1000 Train top_1000_c_train.run 646.6 MB 16,768,032
Top-C 1000 Validation top_1000_c_valid.run 23.7 MB 605,793
Top-C 1000 Test top_1000_c_test.run 271.1 MB 6,917,616

The format of the run files of the Answer re-ranking dataset is in TREC run format.

Note: We use BM25 as first-stage ranker in Elasticsearch in order to rank top-1000 documents given a question (i.e., query). However, for some queries, less than 1000 documents will be retrieved which means there were less than 1000 documents with at least one word matched with the query in the collection.

Analyzing the effectiveness of BM25 on human/ChatGPT responses

Coming soon.

BERT re-ranking effectiveness on the Qrels-H Test

We train BERT on the responses that are produced by ChatGPT (using queries.tsv, collection_c.tsv, train_c_triples.tsv, valid_c_triples.tsv, qrels_c_train.tsv, and qrels_c_valid.tsv files). Next, we evaluate the effectiveness of BRET as an answer re-ranker model on human responses (using queries.tsv, collection_h.tsv, top_1000_c_test.run, and qrels_h_test.tsv). By doing so, we answer to the following question: "What is the effectiveness of an answer retrieval model that is trained on ChatGPT responses, when we evaluate it on human responses?"

Coming soon.

Collection of responses produced by other Large Language Models (LLMs)

Coming soon

Code for creating the dataset

ChatGPT-RetrievalQA-Dataset-Creator

Dataset source and copyright

Special thanks to the HC3 team for releasing Human ChatGPT Comparison Corpus (HC3) corpus. Our data is created based on their dataset and follows the license of them.