ShawnAlisson/content_gen

A simple and modular scraper that extracts articles from specified websites, summarizes the content, and optionally publishes it to a blog.

Python

ContentGen

A simple and modular scraper that extracts articles from specified websites, summarizes the content, and optionally publishes it to a blog.

Features

Extracts article URLs using general patterns.
Extracts article content using Newspaper3k and BeautifulSoup.
Summarizes content using the T5 model from Hugging Face.
Optionally translates content to a specified language.
Publishes summarized articles to a blog via an API.

Requirements

Python 3.7+
Libraries: requests, beautifulsoup4, newspaper3k, transformers

Installation

Clone the repository:

git clone https://github.com/ShawnAlisson/content_gen.git
cd content_gen

Install the required packages:

pip install requests beautifulsoup4 newspaper3k transformers

Configuration

Edit the config.json file to specify:

The list of websites to scrape.
Whether to publish (publish option).
The blog API endpoint (post_url).
Your API token for authentication (api_token).
Target language for translation (target_language).

Usage

Run the scraper using the command:

python scraper.py