How do various readable website extractor libraries (ie. libraries that provide a feature like Reader View in Safari) perform?
This repo exists to provide a way to compare many libraries at once across many pages at once.
Currently the following libraries are implemented:
- mozilla/readability
- cleanview
- metascraper
- @postlight/mercury-parser
- TODO - clean-mark (377 stars)
- TODO - ascrape-js (13 stars)
The latest output from running the comparisons on a set of 16 random pages selected from Hacker News in June 2020 is available on the gh-pages
branch (direct link to report).
Based on these comparisons @awendland is intending to use the mozilla/readability project.
Make sure to run yarn
to ensure all dependencies are installed. Each command should include --help
documentation and produce explanatory output during execution.
Create a newline delimited list of URLs to fetch and store them in a text file such as test_urls.txt
.
Use the fetch-test-pages
script to retrieve and save them into a folder such as test_pages/
for report processing.
yarn scripts:run ./scripts/fetch-test-pages.ts --listOfUrls test_urls.txt --outDir test_pages/ --parallelism 30
They will be saved as JSON files containing information such as the source URL and the HTML contents of the page.
Once test pages have been retrieved a report can be generated. The following command would be used to generate a report named report.html
from test pages saved in test_pages/
.
yarn scripts:run ./scripts/generate-report.ts --testPages 'test_pages/*.json' --reportFile report.html
Adding a new library to the comparison involves several steps:
-
Add the library (and any associated
@types/
package) as a project dependencyyarn add LIBRARY_NAME --exact
-
Authoring an adapter for the library in
scripts/lib/adapters/adapter-LIBRARY_NAME.ts
which conforms to the following type (detailed inscripts/lib/types.ts
):type Adapter = { metadata: AdapterMetadata extract(params: ExtractParams): Promise<ExtractedInfo | null> }
-
Registering the adapter in
scripts/lib/adapters/index.ts
-
Generating a report to make sure that it works