Fetch last HN posts and save them in a database.
- hn_api.py: Hacker News API wrapper
- persistence.py: Redis database wrapper
- polling.py: Check for new posts and add them to the queue
- polling_embedding.py: A one-time script to push embeddings job to the queue
- data_export: Generate a CSV, PARQUET and DuckDB file from the database
- main.py: Run the scheduler
- embeddings.py: Fetch embeddings from OpenAI API and Diffbot API
The project employs RQ to schedule the scraping of HN posts.
Every five minutes, a watcher checks for new posts to fetch and adds jobs to the queue. Additionally, it fetches the first 100 posts every five minutes to ensure the database is up-to-date.
The scraper listens to the queue and fetches posts. It stores them in the database if they are stories, not jobs, polls, etc. It does not save posts without URLs, such as "Ask HN".
When fetched, the URL is scraped by Diffbot to get the article content. This content is then sent to the OpenAI API to get the article embedding. The embedding is then stored in the database, but only if the article isn't already in the database.
The database is a Redis database. It contains 2 sets:
- db0 contains the queue
- db1 contains the posts
Posts are prefixed by hn: