Data, results, benchmarking scripts and source codes for paper "BANDAR: Benchmarking Snippet Generation Algorithms for Dataset Search".
The large volume of open data on the Web is expected to be reused and create value. Finding the right data to reuse is a non-trivial task addressed by the recent dataset search systems, which retrieve datasets that are relevant to a keyword query. An important component of such a system is snippet generation. It extracts data content that best represents a retrieved dataset and explains its relevance to the query. Snippet generation algorithms have emerged but were mainly evaluated by user studies. More efficient and reproducible evaluation methods are needed. To meet the challenge, in this article, we present a set of quality metrics and we aggregate them into quality profiles, for assessing the usefulness of a snippet from different perspectives in different stages of a typical dataset search process. Furthermore, we create a benchmark from thousands of collected real-world data needs and datasets, on which we apply the presented metrics and profiles to evaluate snippets generated by two existing algorithms and three adapted algorithms. The results, which are reproducible as they are automatically computed without human interaction, show the pros and cons of the tested algorithms and highlight directions for future research. The benchmark data is publicly available.
All queries, datasets and snippet results generated by different methods are provided in data.
- dataset_id_dump contains URL links of the dump files for each dataset. The first column is a local id for the dataset which was also used in query-dataset-pairs, the second and following columns are links to its dump, all columns are separated by
'\t'
. Note that, one dataset could have more than one dump files. - query-dataset-pairs contains all pairs used in the snippet generation experiments. It has 4 columns separated by
'\t'
, the first column is the local id of the query-dataset pair, corresponding to the files in result, the second column shows the dataset id of the pair, the third column is the original query text, and the fourth column is the content keywords of the query (which were actually used in snippet generation). - result contains all snippet results. Folder result/1k/ contains the snippet results of the 1-th to 1,000-th query-dataset pairs (and so on). result/ik/x/20/ and result/ik/x/40/ are of snippet size
k = 20
andk = 40
respectively. Each snippet is presented as a .nt file.
- JDK 8+
We provide a shell script and an example RDF dataset for generating and evaluating snippets for RDF dataset in bin.
To run the benchmark, you should move process.sh and bandar.jar to your local directory, and execute the following script:
./process.sh
Then you can follow the instructions in the script to generate or evaluate snippets.
dataset.nt is an example RDF dataset with N-triple format. For example, if you want to generate IlluSnip for this example dataset, please execute the following script (you can change the dataset file to any local path):
# Preprocess first
-p ./dataset.nt
# Generate IlluSnip
-g ./dataset.nt -illusnip
Note that if you want to generate more than one snippets, the preprocess step needs to be done only once.
For example, if you want to evaluate a snippet result using our benchmark, please execute the following script (you can also change the paths of files):
# Evaluate IlluSnip
-e ./dataset.nt ./illusnip-record.txt keyword1,keyword2 queryword1,queryword2
All source codes of implementation are provided in code/src.
- JDK 8+
- MySQL 5.6+
- Apache Lucene 7.5.0
- JGraphT 1.3.0
useful packages (jar files) are all provided in code/lib.
To re-implement our experiments or adapt detailed settings, please open src as a java project.
All source codes of the benchmark script are provided in package scriptProcess, the parameters of each snippet generation method can be edited in each class file in scriptProcess/snippetAlgorithmFile.
We also provide the implementations of snippet generation and evaluation methods based on a MySQL database. code/example.sql provides an example dataset with database settings.
If you want to run experiments on this example, please follow these steps:
- Import example.sql to your local MySQL database, it contains 5 tables.
- Configure the information in util/DBUtil.java according to your local database settings, namely, uri, name, user and password.
- Run snippetGenerationTest/PreprocessOfExample.java, it will insert records into table dataset_info and mrrs, create useful indexes for all snippet generation methods. The default path of output indexes is the same as src, you can change it (if needed) in snippetAlgorithm/Parameter.java. Note that, if you need to rerun the preprocess step, you need to TRUNCATE table dataset_info and mrrs, and DELETE the folder of indexes (dataset1) first.
- To generate snippets by different methods, all 5 methods are provided as snippetGenerationTest/xxxTest.java, namely, KSD, IlluSnip, TA+C, Dual-CES and PrunedDP++. Directly run the corresponding methods, the result snippet will be presented in the terminal as triples. Besides, you can change the keywords in each main() method.
- To get evaluation scores of snippets, run snippetEvaluation/xxxEvaluation.java (the snippets need to be generated first as in step 4), the corresponding snippet and metric scores will be presented in the terminal. The evaluation metrics include SkmRep, EntRep, DescRep, LinkRep, KwRel and QryRel.
If you have any difficulty or question in using the benchmark or running the codes, please email to xxwang@smail.nju.edu.cn.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you use these codes or results, please kindly cite it as follows:
@article{BANDAR,
author = {Xiaxia Wang and Gong Cheng and Jeff Z. Pan and Evgeny Kharlamov and Yuzhong Qu},
title = {BANDAR: Benchmarking Snippet Generation Algorithms for Dataset Search},
journal = {IEEE Transactions on Knowledge and Data Engineering},
year = {2021},
doi = {10.1109/TKDE.2021.3095309}
}