pip install scrapy
cd webby_scraper/scraper
scrapy crawl craigslist_spider
then you will see data in webby_scraper/scraper/scraper.db
sudo apt update && sudo apt install python3-pip -y && pip3 install scrapy feedparser
cd /home/ubuntu/webby_scraper/scraper
# start rss spider
nohup python3 start_spider.py > /dev/null &
# start normal spider
nohup python3 start_normal_spider.py > /dev/null &
kill start_spider.py daemon or start_normal_spider.py daemon
edit /home/ubuntu/webby_scraper/url.ini
login 104.156.49.114
cd /usr/local/src/webby_scraper/scraper
sqlite3 scraper.db
delete from scraper_craigslist;
- Add comments field ✅
- Add filter on the left : all, per keyword and then last filter is "spam" which got filtered ✅
- Add save tick-box✅
- Add delete tick-box (mark it for deletion and don't display it anymore) ✅
- Add archive tick-box (it will save and hide the entry. compared to save which just saves it) ✅
- Change the list to exclude entries marked for deletion ✅
- Add notifications capability
- web server back-end should purge anything not marked for save at midmight CDT each day. (To avoid the database getting to big)✅
- Add a 2nd page called "archive" ✅ - a. This shows stuff which older than the current day - b. has a search field to search old entries - c. Create a 2nd table or possibly even a 2nd database for this. This is poor man way to ensure that the current day database stays fast.
- Add support for pulling RSS field instead of scraping✅
- Convert crawler from script to daemon ✅
- Add start/stop scripts for init
- Add round robin IP rotation
- Add random user-agent rotation ✅
- Add random referrer URL ✅
- daemon should read to 2 ini format files: a. keywords file b. regions file ✅
- keywords file has 2 properties: keyword and interval of how often to scrape Note: we may want to add a filters field to remove negative keywords
- regions file is just a list of regions ✅
-
Why is z1 test missing after many hours? https://stlouis.craigslist.org/search/sss?query=z1&excats=20-102-2-39-5-22&sort=rel&postedToday=1✅
-
We need to refine search for each keyword for example exclude cars for everything except corvette
-
Add notifications
-
Check why doesn't display correctly on iOS / the left side navigation is missing✅
-
Remove the word "keywords" from the left✅
-
Make links that have been visited purple✅
-
Setup 24 instances of crawler so that it scans 2x per second
-
Add some ban detection and alert us if crawler is having a problem
-
When you filter + paging may not be working correctly. Check for bugs✅
-
Bulk page operation.. how to delete a full page of results