The Carleton Benchmark Suite (CBench)

CBench is an extensible and more informative benchmarking framework for evaluating question answering systems over knowledge graphs. CBench facilitates this evaluation using a set of popular benchmarks that can be augmented with other user-provided benchmarks. CBench not only evaluates a question answering system based on popular single-number metrics, but also gives a detailed analysis of the syntactic and linguistic properties of answered and unanswered questions to better help the developers of question answering systems developers to better understand where their system excels and struggles.

Download, Setup, and Run CBench video https://www.youtube.com/watch?v=r9QvFTb60WM
The web verion is available here
Talk: https://www.youtube.com/watch?v=_zbUbDTMgbI
Demo talk: https://www.youtube.com/watch?v=QT-q53jPhMc

Features

Benchmarks Fine-grained Analysis: CBench studies several syntactical and linguistic features of a predefined benchmark or a new benchmark to be added by the user.
Predefined Benchmarks: The currecnt version of CBench supports 17 Benchmarks, 12 of them have their crossponding SPARQL queries.
Benchmarks Analysis: CBench enables you to analyize one of the predefined benchmarks or your own benchmark. The analysis includes the shallow and shape analysis for the SPARQL quries and the natural language analysis.
Detailed QA system Evaluation: CBench is not support single number evaluation but rather F1-Macro, Micro and Global(With different thresholds) scores. Theses scores are defined in the paper.
QA system Evaluation Debugging Mode: Within the QA Evaluation Debugging Mode, the user can control CBench's output questions based on any of the linguistic, syntactical or structural features of all the questions and queries in CBench.
Fine-grained Evaluation Analysis: CBench is able to identify the queries properties of the quetsions that are correctly and incorrectly answered.
Qualitative Evaluation of Linguistic Features: CBench is able to find the k linguistically closest questions to a chosen question q.

Paper: Abdelghny Orogat, Isabelle Liu, and Ahmed El-Roby. CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs. PVLDB, 14(8): 1325 - 1337, 2021.

Citation

@Article{Orogat2021,
  Title                    = {{CBench}: {T}owards {B}etter {E}valuation of {Q}uestion {A}nswering {O}ver {K}nowledge {G}raphs},
  Author                   = {Orogat, Abdelghny and Liu, Isabelle and El-Roby, Ahmed},
  Journal                  = {Proceedings of the VLDB Endowment (PVLDB)},
  Year                     = {2021},
  volume                   = {14},
  number                    = {8}
}

We also encourage you to read the long version paper on arXiv from here.

Table of Content

Getting Started

Prerequisites

CBench requires the following development kits and liberaries. You can download the liberaries with the system.

for Java [JDK 8, Download liberaries from Lib folder]
for Python [Python 3, Numpy, Pandas, Matplotlib, Spacy, Scipy, Statistics]

Deploy CBench via jar

Download CBench.jar: Download the CBench.jar file and other folders. The project structure must be as follow

projectFolder
│   
└─── CBench.jar
│
|─── data
│   |─── userDefinied.json
│   |─── DBpedia
│   |   │─── No_SPARQL
│   |   └─── SPARQL
|   |
│   └─── Freebase
│       │─── No_SPARQL
│       └─── SPARQL
│
└─── lib
|   └─── ... All .jar files
│   
└─── evaluate.py

Run CBench.jar: Using the command java -jar "PATH/TO/projectFolder/CBench.jar", run the project.
Configure the System: While the system is running, it asks you about some parameters. Theses parameters are
- Mode: You can select between two modes: (1) Benchmarks Analysis or (2) QA system Evaluation.
- KG: The desired knowledge graph.
Start the System: After you configured the system, it starts and does the following. (It is prefered to unwrap the text)
- Questions Preprocessing: CBench read the question form the raw files and remove dupicates. All the questions will be printed.
- Benchmark Statistics: CBench prints questions statistics.

For Benchmark Analysis Mode,

Benchmark: The desired benchmark for the analysis must be selected. It can be a predefined benchmark or a userdefined benchmark [See the section "Add a New Benchmark" for how to add your own benchmark].
Print Analysis results: CBench prints shallow, shape and linguistic analysis results.

For QA Evaluation Mode [For how to evaluated a QA system, see section "Evaluate a QA system via http request"],

Benchmark: The desired benchmark for the QA Evaluation must be selected. It can be a predefined benchmark, a user-defined benchmark or Properties Defined benchmark. The Properties Defined option used for the QA Evaluation Debugging Mode.
- Properties Defined Benchmark: For this option, the CBench asks the user about the required properties and based on the targeted KG, It will collect the questions from the benchmarks that target the selected KG and achieve the user-defined properties. Afetr the Benchmark Preparation, CBench will do the following
Collecting Correct answers and Systems answers: CBench collects gold answers, feeds QA system with the questions and collects system answers. CBench prints the results per question while it is running.
Final Report: CBench prints the evaluated questions then print them again categorized by queries shapes. The values are separated by tabs to be easy for you to paste them in a spreadsheet for your own analysis. After that, it prints the performance scores defined in the paper.
Data Visualization: A Python script generates the fine-grained analysis visualization. To do so, you have to setup python on your machine and these libraries: Numpy, Pandas, Matplotlib, Spacy, Scipy and Statistics. CBench asks you for the python3 setup location; for example, if you use anaconda on an ubuntu machine, the path usually is /home/username/anaconda3/bin.

Support

Please raise potential bugs on github. If you have a research related question, please send it to this email(abdelghny.orogat@carleton.ca)

aorogat/CBench