/causal-quest

CausalQuest: Collecting Natural Causal Questions for AI Agents

Primary LanguagePythonApache License 2.0Apache-2.0

CausalQuest: Collecting Natural Causal Questions for AI Agents

Causality Types

Fig 1: Examples of three main types of causal questions in CausalQuest: (a) querying the effect given a cause, (b) querying the cause given the effect, and (c) querying the causal relation of the given variables.

Paper: https://arxiv.org/abs/2405.20318

This repository accompanies our research paper titled "CausalQuest: Collecting Natural Causal Questions for AI Agents" by Roberto Ceraolo*, Dmitrii Kharlapenko*, Amélie Reymond, Rada Mihalcea, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin. The repo contains the code that we used to generate the dataset, the results and the plots that we used in the paper. It also contains the labels for the CausalQuest dataset used in the paper and the code to reproduce the full dataset, which is not directly shared here because one of the sources has an unknown license.

Get the dataset

  • jsonl file size: 9.6MB
  • Version: v1
  • Date: 2024-05-29

This repo contains the labels for the CausalQuest dataset used in the paper "CausalQuest: Collecting Natural Causal Questions for AI Agents", together with the code to reproduce the full dataset. The dataset contains natural questions from different sources, labeled as causal or non-causal. Causal questions are then further labeled following the taxonomy proposed in the paper.

Follow the instructions in the Code Setup section to generate the full dataset.

Licenses

The following are the licences for each of the sources used in the dataset:

We invite all users of CausalQuest to carefully read and respect the licenses of the sources.

Dataset

Data Usage

The data is stored in .jsonl format, with each line representing a single question. Each question has the following fields:

  • query_id: a unique (for the file) identifier for the question
  • source: the source of the question, one of nq, quora, msmarco, wildchat, sharegpt
  • causal_reasoning: a detailed reasoning for why the question is causal, given by Chain-of-thought reasoning
  • is_causal: a boolean indicating whether the question is causal
  • domain_class, action_class, is_subjective: class labels for the question's domain, action, and subjectivity, as proposed in the paper
  • sg_id, msmarco_id, nq_id, quora_id, wc_id: identifiers for the question in the original sources

To obtain the text of the questions, follow the instructions in the Code Setup section.

Dataset Statistics

Here are some basic statistics for the dataset.

Number of questions: 13500, out of which 42% are causal.

Query Types:

Causality Type Percentage Example (Summarized When Needed)
Classification based on the action required
Cause-seeking 36.3% Why did the Roman Empire fall?
Steps-seeking 27.8% How do I start a tech channel on YouTube?
Recommendation-seeking 12.2% What are the best ways to improve my writing skills?
Algorithm-seeking 11.4% How to prepare a Chettinad chicken gravy?
Effect-seeking 8.5% What would happen if Middleware is removed?
Relation-seeking 3.8% Is drinking packed juice safe and hygienic?
Classification based on the domain of knowledge requested to answer (main classes)
Computer Science 25.8% What are some solutions if I forgot my iCloud password?
Society, Economy & Business 18.8% How can I start a side hustle using ChatGPT effectively?
Everyday Life & Personal Choices 16.0% How do I score well in competitive exams?
Health and Medicine 10.9% How do I lose weight?
Classification based on whether the answer requires a subjective judgement
Objective 62.9% Why is perpetual motion not possible?
Subjective 37.1% What is the best way to live a good life?

Repo structure

The following is the structure of the repo:

  • data

    • causalquest_labels.jsonl: JSON Lines file containing labels for CausalQuest
  • src

    • dataset_generation

      • build_causalquest.py: Script to generate the full dataset
      • dataloaders.py: Scripts to load all sources
      • utils.py: Utility functions
    • fine_tuning

      • train_flan_lora.py: Script to fine-tune FLAN model with LoRA
      • train_phi_lora.py: Script to fine-tune PHI model with LoRA
    • labeling_scripts

      • compute_annotator_agreement.py: Script to compute annotator agreement
      • dataset_generation.py: Script used to generate the dataset
      • graph_generation.py: Script to generate the graphs of the paper -nq_sampling.py: Script to sample from Natural Questions
      • run_batchAPI.py: Script to run the OpenAI batch API
      • run_classification.py: Script to run the normal OpenAI API
      • utils.py: Utility functions
    • linguistic_baseline

      • linguistic_baseline.py: Baseline linguistic models and scripts

Code Setup

To use the codes in this repo, first clone this repo:

git clone https://github.com/roberto-ceraolo/causal-quest
cd causal-quest

Then, install the dependencies:

pip install -r requirements.txt

Code Usage

The following are the steps needed to generate the full dataset:

  • Download the Natural Question data ("Simplified train set") from the Natural Questions website. This is done to avoid downloading the full Natural Questions from Huggingface. All the other sources are downloaded on-the-fly.
  • Unzip the file and place it in a directory, which we will refer to as <nq_path>.
  • <causalquest_path> is the path of the labels' file, which is provided in the data directory of this repository.
  • <output_path> is the path where the full dataset will be saved.

The following command will then generate the full dataset:

python src/dataset_generation/build_causalquest.py <nq_path> <causalquest_path> <output_path>