hacker-news-comments: A Python repository from maffei2443

hacker-news-comments

This little project aims to:

get all comments from hacker news website;
Update such base from time to time;
Trigger a signal once some custom event occurs.
- Currently, the only custom event is the occurence of specific words.

How to use (assuming `sudo` access)

1. Install dependencies

Change current directory to tutorial

pip install -r requirements.txt

2. Change into directory and run

cd tutorial
  
# Crawl how much pages you prefer
scrapy crawl comments
    
# For download all comments not yet downloaded since the 
# last downloaded comment
scrapy crawl comments -s CLOSESPIDER_PAGECOUNT=0

3. [Optional] Run mongodb on docker

docker-compose -f local-compose.yml up

Alternatively you can configure your own mongodb server. Check tutorial/tutorial/settings.py for eventual envirnoment variables that must be set.

10. [Optional] Run tests

Install dependencies
```
pip install pytest==6.0.2
```
Run tests
```
pytest
```

How to run on Docker

Install docker
Install docker-compose
Run the application.

    docker-compose up -d

NOTE: if some modification is done on the source and you want to update the containerzed application then the follow steps are required:

    docker-compose down   # to stop the application
    docker-compose build  # rebuild the image
    docker-compose up     # start application again

Limitations (a.k.a TODO)

Missing tests: could not implement automated tests for the main spider.
- The only automated tests are for the helper.py file
Can not access hn_comments_crawler through localhost.
- Could not configure the network correctly. At the moment it is not possible to access hn_comments_crawler container directly from localhost .
Currently the comments are crawled taking into account only the id field.
- It would be very good to be able specify manually a lower bound for the IDs or another criteria such as date. However date was not take into account for performance reasons since it would be necessary to send a request for each comment crawled.
- Also it would be interesting to have an option do crawl only specific comments

Observations

The "alarm" consists of dumping the ids of comments containing the linux substring in the linux_ids collection.
The database name is defined on the settings.py file.

References

Hacker news API (github)
Hacker news API docs
DODFMiner (for structuring the tests)
Project template
Project structuring
hackeRnews, an R package for getting data from HN
Leap year
xpath exact
xpath and css equivalences cheat sheet
xpath cheatsheed
mongodb vs sql
pymongo docs
Scrapy architecture overview
Scrapy+docker
Docker docs
docker-cron
scrapyd
scrapy-client
scrapyd-client installation

maffei2443/hacker-news-comments