/extrablatt

Article scraping in rust

Primary LanguageRustApache License 2.0Apache-2.0

extrablatt

Crates.io Documentation

Customizable article scraping & curation library and CLI. Also runs in Wasm.

Basic Wasm example with some CORS limitations: https://mattsse.github.io/extrablatt/

Inspired by newspaper.

Html Scraping is done via select.rs.

Features

  • News url identification
  • Text extraction
  • Top image extraction
  • All image extraction
  • Keyword extraction
  • Author extraction
  • Publishing date
  • References

Customizable for specific news sites/layouts via the Extractor trait.

Documentation

Full Documentation https://docs.rs/extrablatt

Example

Extract all Articles from news outlets.

use extrablatt::Extrablatt;
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {

    let site = Extrablatt::builder("https://some-news.com/")?.build().await?;

    let mut stream = site.into_stream();
    
    while let Some(article) = stream.next().await {
        if let Ok(article) = article {
            println!("article '{:?}'", article.content.title)
        } else {
            println!("{:?}", article);
        }
    }

    Ok(())
}

Command Line

Install

cargo install extrablatt --features="cli"

Usage

USAGE:
    extrablatt <SUBCOMMAND>

SUBCOMMANDS:
    article     Extract a set of articles
    category    Extract all articles found on the page
    help        Prints this message or the help of the given subcommand(s)
    site        Extract all articles from a news source.

Extract a set of specific articles and store the result as json

extrablatt article "https://www.example.com/article1.html", "https://www.example.com/article2.html" -o "articles.json"

License

Licensed under either of these: