Reduce complexity, create simplicity
-
creates product groups with rich metadata from a raw mess of online products
-
this is a medium size project with around 10k lines of Python3 by June 2020
around 200 files, with an average of 50 lines per file
cloud entry points are here,
scripts here defined as services in setup.py
and cron jobs are set for them in scrapinghub
visitor.py visits all websites
refresher.py refreshes the matching and sync the new matching
also sentry is init in visitor, backup, and refresher. this is connected to my personal account, please replace the api keys because I will revoke present keys
autogenerated
frequently used dictionary keys are defined here
when a function needs to read or write some data, it just calls a service for it.
enrich skus with brand, category, color etc.
experimental, not in production
helper scripts to provide some necessary data to mobile, usually one-of scripts
autogenerated on deploy
reusable generic services
defines data models and exceptions, not an exhaustistive spec though
all collection logic here,
parsers here
for a new module to be registered, it has to be defined in SPIDER_MODULES at spiders/settings.py
on start, scrapy executes every file defined in SPIDER_MODULES, so be careful, don't leave functions or code in the open
they are very important, all data goes through a pipeline
they clean the parsed data and sync them to mongo, elastic, and firebase
this clean and sync are an involved operation, so be careful
in settings.py default pipeline is set as MarketPipeline, almost all data goes through it
ITEM_PIPELINES = {"spiders.pipelines.market_pipeline.MarketPipeline": 300}
however some parsers define a different pipeline in their custom_settings
spider template
calling all spiders one by one would be very tedious
visitor collects all of them, and run_spiders_concurrently
test method, when you run it, pipelines are not activated
get raw docs and turns them into groups of skus and products
writes the created groups to elasticsearch and firestore
tests and one-of scratch scripts
static analysis tool, connected to github
for travis ci, it is not active
package management
scrapinghub needs this instead of Pip
congrats, that's all :)
in terminal, from the top level directory,
run shub deploy
deployment configs are in
- setup.py
- scrapy.cfg
- scrapinghub.yml
- and spiders/settings.py
system summary
-
crawlers collect raw docs and save to mongo
-
supermatch creates groups
- docs as nodes of a graph
- edges from barcodes, names, and promoted links
- connected components are groups of docs
- reduce a doc group to a single SKU
- group SKUs using links and names
- reduce SKUs into products
a key library here is networkx, it's used create and update the graph
-
assign a category, brand, type, size, color, and other identifiers to SKUs by using names and info in raw docs
-
saves skus to elastic and firestore
-
instant update -> when crawling a raw doc next time, write the fresh price to the related elastic SKU
result: a clean set of products with rich metadata
skus could be generated confidently at any time from the raw docs thus we only back up the raw docs
use pylint, mypy, black regularly
write automated tests for critical parts
any one them in your code ?
- long classes
- long functions
- hardcoded variables
- too many args
- flags
- errors not handled properly
- copy-paste
- mixed styles
- dead code
- commented out lines
- no tests
- functions are hard to test
- bad names
- unnecessary complexity
- functions should do one thing
you may take a similarity hash of images and compare them to match items
https://github.com/JohannesBuchner/imagehash
gensim is great
https://radimrehurek.com/gensim/models/ldamodel.html
a few bits of advice :)
-
think about data structures and their relationships
-
think ahead
- set up backups
- have a plan B
- make it easy to recover
-
solve the right problem
-
If it is hard to explain, it's a bad idea.
-
define what is most important and focus on this
ride now, ride and fear no darkness..