This repository contains NLR dataset samples.
NLR dataset contains samples of natural language representations (NLRs) across questions from multiple domains, presenting a new data point for Natural Language Representation, thereby enabling users to test components of DB interaction systems end-to-end.
The data can be found at dataset/NLR_labels.json.
Each sample in dataset/NLR_labels.json contains the following fields:
question_id: ID for the sample question. db_id: domain of the database NLR: The natural language representation of the db_result. result_size_complexity: row count + column count of the db_result.
Example
{
"question_id": 0,
"db_id": "financial",
"NLR": "There are 13 accounts who choose issuance after transaction staying in the East Bohemia region.",
"result_size_complexity": 2
}
The data was created through a combination of synthetic generation and manual curation, between October 2024 and May 2025. The research work is being published by Oracle, and this data is part of research being released to the community.
NLR is being shared with the research community to facilitate reproduction of our results and foster further research in this area.
NLR is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.
This project welcomes contributions from the community. Before submitting a pull request, please review our contribution guide
Please consult the security guide for our responsible security vulnerability disclosure process
Copyright (c) 2025 Oracle and/or its affiliates.
Released under the Universal Permissive License v1.0 as shown at https://oss.oracle.com/licenses/upl/.