/pdf-text-extraction-benchmark

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

Primary LanguageTeXMIT LicenseMIT

A Benchmark & Evaluation for Text Extraction from PDF

This project is about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles. It provides (1) a benchmark generator, (2) a ready-to-use benchmark and (3) an extensive evaluation, with meaningful evaluation criteria.

The Benchmark Generator

  • constructs high-quality benchmarks from TeX source files.
  • identifies the following 16 logical text blocks: title, author(s), affiliation(s), date, abstract, headings, paragraphs of the body text, formulas, figures, tables, captions, listing-items, footnotes, acknowledgements, references, appendices.
  • serializes desired logical text blocks to plain text, XML or JSON format.

For more details and usage, see benchmark-generator/.

The Benchmark

  • consists of 12,099 ground truth files and 12,099 PDF files of scientific articles, randomly selected from arXiv.org. Each ground truth file contains the title, the headings and the body text paragraphs of a particular scientific article.
  • was generated using the benchmark generated above.

For more details, see benchmark/.

The Evaluation

For more details, see evaluation/.