readme


pip install scrapy

cd webby_scraper/scraper

scrapy crawl craigslist_spider

then you will see data in webby_scraper/scraper/scraper.db

setup new ubuntu server

sudo apt update && sudo apt install python3-pip -y && pip3 install scrapy feedparser

start spider

cd /home/ubuntu/webby_scraper/scraper

# start rss spider
nohup python3 start_spider.py > /dev/null &

# start normal spider
nohup python3 start_normal_spider.py > /dev/null &

stop spider

kill start_spider.py daemon or start_normal_spider.py daemon

change url

~~edit /home/ubuntu/webby_scraper/url.ini~~

reset data

cd /usr/local/src/webby_scraper/scraper

sqlite3 scraper.db

delete from scraper_craigslist;

TODO

Web front-end & Web server

Add comments field ✅
Add filter on the left : all, per keyword and then last filter is "spam" which got filtered ✅
Add save tick-box✅
Add delete tick-box (mark it for deletion and don't display it anymore) ✅
Add archive tick-box (it will save and hide the entry. compared to save which just saves it) ✅
Change the list to exclude entries marked for deletion ✅
Add notifications capability
web server back-end should purge anything not marked for save at midmight CDT each day. (To avoid the database getting to big)✅
Add a 2nd page called "archive" ✅ - a. This shows stuff which older than the current day - b. has a search field to search old entries - c. Create a 2nd table or possibly even a 2nd database for this. This is poor man way to ensure that the current day database stays fast.

Back-end

Add support for pulling RSS field instead of scraping✅
Convert crawler from script to daemon ✅
Add start/stop scripts for init
Add round robin IP rotation
Add random user-agent rotation ✅
Add random referrer URL ✅
daemon should read to 2 ini format files: a. keywords file b. regions file ✅
keywords file has 2 properties: keyword and interval of how often to scrape Note: we may want to add a filters field to remove negative keywords
regions file is just a list of regions ✅

webby TODO(20190620)

Why is z1 test missing after many hours? https://stlouis.craigslist.org/search/sss?query=z1&excats=20-102-2-39-5-22&sort=rel&postedToday=1✅
We need to refine search for each keyword for example exclude cars for everything except corvette
Add notifications
Check why doesn't display correctly on iOS / the left side navigation is missing✅
Remove the word "keywords" from the left✅
Make links that have been visited purple✅
Setup 24 instances of crawler so that it scans 2x per second
Add some ban detection and alert us if crawler is having a problem
When you filter + paging may not be working correctly. Check for bugs✅
Bulk page operation.. how to delete a full page of results

hondajojo/webby_scraper

readme

setup new ubuntu server

start spider

stop spider

change url

reset data

TODO

Web front-end & Web server

Back-end

webby TODO(20190620)