This project is a Hacker News scraper that fetches articles, summarizes them, translates the summaries to Argentine Spanish, and generates video scripts based on the content. It uses OpenAI's GPT models for natural language processing tasks and provides a web-based dashboard for monitoring the scraping process.
-
Clone the repository:
git clone <repository-url> cd <repository-name>
-
Install the required dependencies:
pip install flask flask-socketio requests beautifulsoup4 fake_useragent openai
-
Set up your OpenAI API key as an environment variable:
export OPENAI_API_KEY=your_api_key_here
-
Ensure you have the following project structure:
project_root/ ├── scrapper.py ├── templates/ │ └── index.html └── hackernews_data.json (will be created automatically)
To run the scraper and start the dashboard:
python scrapper.py [--dev]
- Use the
--dev
flag to use GPT-3.5-turbo instead of GPT-4 for development/testing purposes.
Once running, open a web browser and navigate to http://localhost:5000
to view the dashboard.
- Hacker News Scraping: Fetches the top stories from Hacker News.
- Content Filtering: Filters stories based on relevance to technology topics.
- Article Summarization: Uses OpenAI's GPT models to summarize article content.
- Translation: Translates summaries to Argentine Spanish.
- Script Generation: Creates video scripts based on the translated summaries.
- Web Dashboard: Provides a real-time view of the scraping process and results.
This is the main script that handles the scraping process, AI interactions, and runs the web server.
Key functions:
startLine: 116
endLine: 155
<code_block_to_apply_changes_from>
startLine: 157
endLine: 191
scrape_article_content(url)
: Fetches and extracts the content of individual articles.
startLine: 193
endLine: 214
summarize_article(content, title)
: Generates a summary of the article using GPT.
startLine: 216
endLine: 231
translate_to_argentine_spanish(text)
: Translates the summary to Argentine Spanish.
startLine: 233
endLine: 266
create_script(summary, title)
: Generates a video script based on the translated summary.
startLine: 401
endLine: 459
main()
: Orchestrates the entire scraping and processing workflow.
This file contains the HTML template for the web dashboard.
startLine: 1
endLine: 54
- The script uses environment variables for configuration. Make sure to set the
OPENAI_API_KEY
environment variable with your OpenAI API key. - The choice between GPT-4 and GPT-3.5-turbo is made using the
--dev
command-line argument.
- Scraped data is stored in
hackernews_data.json
in the project root directory.
- The script respects rate limits and includes random delays to avoid overloading the Hacker News website.
- CAPTCHA detection is implemented to handle potential anti-scraping measures.
- The script processes a maximum of 5 relevant articles per run to manage API usage and processing time.
- If you encounter CAPTCHA or access issues, try adjusting the
random_delay()
function to increase wait times between requests. - Ensure your OpenAI API key has sufficient credits and permissions for the models being used.
Contributions to improve the scraper, enhance the dashboard, or extend the AI capabilities are welcome. Please submit pull requests or open issues for any bugs or feature requests.
[Specify your license here]