scrapinghub/article-extraction-benchmark

Adding more tools to the benchmark?

Opened this issue ยท 7 comments

adbar commented

Hi,

Thanks for your contribution, it's really useful to see evaluations on real-world data! There are further extraction tools for Python which this repository doesn't feature yet and which could be more efficient than some of the ones you're mentioning. You might have a look at

  • goose3
  • jusText (especially with a custom configuration)
  • inscriptis (html-to-txt conversion)
  • trafilatura (disclaimer: I'm the author).

Or is there a reason why you didn't use them in the first place? I'd be curious to hear about it.

For more details please refer to the evaluation I've performed. The code including baselines is available here.

hi @adbar thanks for the pointers of the tools and evaluation. Another tool which was referenced elsewhere by @saippuakauppias was https://github.com/go-shiori/go-readability. It would be great to add them, we only need to write a script which outputs results in JSON. PRs are welcome, and I hope to have time to add more tools soon as well, it would be great to have more tools evaluated.

adbar commented

Thanks for your answer, I've added JSON to trafilatura and will check if I can write a straightforward PR.

adbar commented

Hi @lopuhin, here is another tool that could be added: Mercury Parser.
(source: adbar/trafilatura#114)

adbar commented

Hi @lopuhin, just a quick follow-up: the benchmark could also be updated using the latest versions of the tools, see for instance the issue adbar/trafilatura#156.

Another tool to consider is Azure Immersive Reader, used in Microsoft Edge.

Seconded this, but also would like to see:

  1. which ones are better (F1/precision/accuracy/recall) relative to speed in the same vein as Squash Benchmark or Matt Mahoney for compression algorithms (since there will always a tradeoff between performance and speed)
  2. bigger datasets for re-evaluating the benchmark since having a larger diversity of articles from blogs may how a stronger use case

With the current advancement in RAGs with LLMs I think these benchmarks would be paramount to help in gathering information, and is really due for an update.
P.S. DragNet has a new fork now https://github.com/currentslab/extractnet