hacker-news-comments

This little project aims to:

  1. get all comments from hacker news website;
  2. Update such base from time to time;
  3. Trigger a signal once some custom event occurs.
    • Currently, the only custom event is the occurence of specific words.

How to use (assuming sudo access)

1. Install dependencies

Change current directory to tutorial

pip install -r requirements.txt
2. Change into directory and run
cd tutorial
  
# Crawl how much pages you prefer
scrapy crawl comments
    
# For download all comments not yet downloaded since the 
# last downloaded comment
scrapy crawl comments -s CLOSESPIDER_PAGECOUNT=0

3. [Optional] Run mongodb on docker

docker-compose -f local-compose.yml up

Alternatively you can configure your own mongodb server. Check tutorial/tutorial/settings.py for eventual envirnoment variables that must be set.

10. [Optional] Run tests

  1. Install dependencies
    pip install pytest==6.0.2
  2. Run tests
    pytest

How to run on Docker

  1. Install docker
  2. Install docker-compose
  3. Run the application.
    docker-compose up -d
  1. NOTE: if some modification is done on the source and you want to update the containerzed application then the follow steps are required:
        docker-compose down   # to stop the application
        docker-compose build  # rebuild the image
        docker-compose up     # start application again

Limitations (a.k.a TODO)

  1. Missing tests: could not implement automated tests for the main spider.

    • The only automated tests are for the helper.py file
  2. Can not access hn_comments_crawler through localhost.

    • Could not configure the network correctly. At the moment it is not possible to access hn_comments_crawler container directly from localhost .
  3. Currently the comments are crawled taking into account only the id field.

    • It would be very good to be able specify manually a lower bound for the IDs or another criteria such as date. However date was not take into account for performance reasons since it would be necessary to send a request for each comment crawled.
    • Also it would be interesting to have an option do crawl only specific comments

Observations

  1. The "alarm" consists of dumping the ids of comments containing the linux substring in the linux_ids collection.
  2. The database name is defined on the settings.py file.

References

  1. Hacker news API (github)
  2. Hacker news API docs
  3. DODFMiner (for structuring the tests)
  4. Project template
  5. Project structuring
  6. hackeRnews, an R package for getting data from HN
  7. Leap year
  8. xpath exact
  9. xpath and css equivalences cheat sheet
  10. xpath cheatsheed
  11. mongodb vs sql
  12. pymongo docs
  13. Scrapy architecture overview
  14. Scrapy+docker
  15. Docker docs
  16. docker-cron
  17. scrapyd
  18. scrapy-client
  19. scrapyd-client installation