/staticice-for-cameras

proof-of-concept web scraping for online camera shops

Primary LanguagePython

GOAL: be staticice , except for cameras.

Summary of proof-of-concept experiments:

  • msy.py - a fully working scraper for MSY's parts list, complete with example output, and a Functional Requirements document.

    This like staticice and steamprices -- it remembers and charts price changes over time, so you can observe long-term trends!

    FIXME: haven't put the FR up yet.

  • test2 - follow the scrapy tutorial, basic poking around scrapy

    cd test2 && scrapy crawl quotes
    
  • test3 - work out how to make scrapy into a "normal" app. I couldn't quite reduce it to a single file, so

    python3 -m test3      # "run" the directory as-is
    

    Also worked out how to make scrapy save to a database (badly).

    Also work out how to turn the database able into an Excel spreadsheet (xlsx), for showing to regular people.

  • test3-jb - because JB's sitemap has ALL products, not just cameras, I thought I'd ignore it and instead try to read from their user-facing pages like https://www.jbhifi.com.au/collections/cameras

    Big mistake - it's all generated by hairy javascript, so the only way to do that would be to either

    1. run an entire GUI browser in "headless" / "remote control" mode. requires like 2GB of RAM and 500MB of disk, and just really bad.
    2. reverse-engineer shopify's (deliberately confusing) javascript
    3. pretend to be a shopify retailer and dig through their (paywalled?) retailer docs, hoping it gives away something.

    So for now give up on that, and instead just read EVERY product, and throw away 98% of them (non-camera ones).

  • test4.py - go back to doing scraping the "lo-fi" way, with no confusing OO middleware. scrapy is 3 MEGABYTES of code, we should be able to do this in about 0.04 MEGABYTES.

    • Successfully scraping basic metadata from JB prodcts.
    • Add a quick hack to discard all the DVDs and CDs.
    • Add a quick hack to NEVER re-scrape any product.
    • test4.db
    • test4.xlsx
    • test4.csv
  • test5 - have a go at using scrapy's helper code specifically designed to deal with sitemap.xml.

    • upstream CSV writer (instead of database hack).
    • upstream throttling options
    • Basic scraper for digidirect.com.au.
    • Initial "don't rescrape the same URL repeatedly" code.

    Partial output: test5.csv (~4000 of ~10000 SKUs)

  • sqlite2xlsx.py - since sqlitebrowser is a bit too simple and lobase + JDBC is really tedious, make a bare-bones report generator for non-IT stakeholders.

    python3 sqlite2xlsx.py test4.db -q 'SELECT * FROM SKUs WHERE type = "CAMERAS" ORDER BY make, price DESC'