A simple and modular scraper that extracts articles from specified websites, summarizes the content, and optionally publishes it to a blog.
- Extracts article URLs using general patterns.
- Extracts article content using Newspaper3k and BeautifulSoup.
- Summarizes content using the T5 model from Hugging Face.
- Optionally translates content to a specified language.
- Publishes summarized articles to a blog via an API.
- Python 3.7+
- Libraries:
requests
,beautifulsoup4
,newspaper3k
,transformers
-
Clone the repository:
git clone https://github.com/ShawnAlisson/content_gen.git cd content_gen
-
Install the required packages:
pip install requests beautifulsoup4 newspaper3k transformers
Edit the config.json
file to specify:
- The list of websites to scrape.
- Whether to publish (
publish
option). - The blog API endpoint (
post_url
). - Your API token for authentication (
api_token
). - Target language for translation (
target_language
).
Run the scraper using the command:
python scraper.py