Adding more tools to the benchmark?

Question

Adding more tools to the benchmark?

Opened this issue 5 years ago · 7 comments

Hi,

Thanks for your contribution, it's really useful to see evaluations on real-world data! There are further extraction tools for Python which this repository doesn't feature yet and which could be more efficient than some of the ones you're mentioning. You might have a look at

goose3
jusText (especially with a custom configuration)
inscriptis (html-to-txt conversion)
trafilatura (disclaimer: I'm the author).

Or is there a reason why you didn't use them in the first place? I'd be curious to hear about it.

For more details please refer to the evaluation I've performed. The code including baselines is available here.

Answer 1 · 2020-06-29T07:51:14.000Z

hi @adbar thanks for the pointers of the tools and evaluation. Another tool which was referenced elsewhere by @saippuakauppias was https://github.com/go-shiori/go-readability. It would be great to add them, we only need to write a script which outputs results in JSON. PRs are welcome, and I hope to have time to add more tools soon as well, it would be great to have more tools evaluated.

Answer 2 · 2020-07-07T17:44:20.000Z

Thanks for your answer, I've added JSON to trafilatura and will check if I can write a straightforward PR.

Answer 3 · 2021-09-14T12:36:44.000Z

Hi @lopuhin, here is another tool that could be added: Mercury Parser.
(source: adbar/trafilatura#114)

Answer 4 · 2022-01-05T13:43:22.000Z

Hi @lopuhin, just a quick follow-up: the benchmark could also be updated using the latest versions of the tools, see for instance the issue adbar/trafilatura#156.

Answer 5 · 2022-04-01T02:38:46.000Z

Another tool to consider is Azure Immersive Reader, used in Microsoft Edge.

Answer 6 · 2023-04-02T07:33:28.000Z

Seconded this, but also would like to see:

which ones are better (F1/precision/accuracy/recall) relative to speed in the same vein as Squash Benchmark or Matt Mahoney for compression algorithms (since there will always a tradeoff between performance and speed)
bigger datasets for re-evaluating the benchmark since having a larger diversity of articles from blogs may how a stronger use case

Answer 7 · 2024-05-07T03:21:38.000Z

With the current advancement in RAGs with LLMs I think these benchmarks would be paramount to help in gathering information, and is really due for an update.
P.S. DragNet has a new fork now https://github.com/currentslab/extractnet