/SemiBench

"Knowledge Extraction on Semi-Structured Content Does It Remain Relevant for Question Answering in the Era of LLMs?"

OtherNOASSERTION

SemiBench

SemiBench is a benchmark dataset designed to directly evaluate knowledge extraction quality and to facilitate analysis of how extracted knowledge influences question answering (QA) performance on semi-structured webpages, including both cleaned and original whole webpages.

Whole Webpages

  • URLs: url_whole.json
    • This file contains the mapping from webpage ID to URL on the Internet Archive (https://archive.org/).
  • QA Set: qa_whole.json
    • This file contains QA pairs for the whole webpages. The QA pairs were generated by Llama 3.3 and have been human-audited.
  • Triple annotations: triple_whole.json
    • This file contains manually annotated (subject, predicate, object) triples for the whole webpages.

Cleaned Webpages

Coming soon.

Citations

@article{sun2025knowledge,
  title={Knowledge Extraction on Semi-Structured Content: Does It Remain Relevant for Question Answering in the Era of LLMs?},
  author={Sun, Kai and Huang, Yin and Mehra, Srishti and Kachuee, Mohammad and Chen, Xilun and Tao, Renjie and Lin, Zhaojiang and Jessee, Andrea and Shah, Nirav and Betty, Alex and Liu, Yue and Kumar, Anuj and Yih, Wen-tau and Dong, Xin Luna},
  year={2025}
}

License

The dataset is released Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) and is intended for benchmarking purposes only. The QA Set is made up of outputs of Llama 3.3, and subject to the Llama 3.3 license (https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE). If you use the QA Set to create, train, fine-tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama” at the beginning of any such AI model name. Third party content pulled from other locations are subject to its own licenses and you may have other legal obligations or restrictions that govern your use of that content.

TODO

  • Release triple annotations for cleaned webpages.
  • Release 100 additional whole webpages, bringing the dataset total to 351 whole webpages.