Yinyo: A wonderfully simple API driven service to reliably execute many long running scrapers in a super scaleable way
- Easily run as many scrapers as you like across a cluster of machines without having to sweat the details. Powered by Kubernetes.
- Use the language and libraries you love for writing scrapers. Supports Python, JavaScript, Ruby, PHP and Perl via Heroku Buildpacks.
- Supports many different use cases through a simple, yet flexible API that can operate synchronously or asynchronously.
- Made specifically for developers of scraper systems be it open source or commercial. No chance of vendor lock-in because it's open source, Apache licensed.
This README is focused on getting developers of the core system up and running. It does not yet include a guide for people who are just interested in being users of the API.
- Yinyo: A wonderfully simple API driven service to reliably execute many long running scrapers in a super scaleable way
-
- nb: Helm is still releasing updates to 2.x; be sure to install the latest 3.x, not just the latest release
-
- Ubuntu - use
make ppa
or read instructions - MacOS package installer
- Ubuntu - use
-
Yinyo's web interface needs to be accessible on http://localhost:8080/. If you have something already listening on this port, you won't get any errors, but you won't be able to connect to Yinyo to start a scraper. You'll need to clear that port.
First, follow the links to install the main dependencies
Start Minikube if you haven't already
make minikube
Let helm know where to find some of the development dependencies
helm repo add stable https://kubernetes-charts.storage.googleapis.com
helm repo add bitnami https://charts.bitnami.com/bitnami
Run skaffold. This will build all the bits and pieces and deploy things to your local kubernetes for you. The first time it builds everything it it takes a few minutes. After that when you make any changes to the code it does everything much faster.
make skaffold
Leave skaffold
running and open a new terminal window.
Now compile and install the binary into your GOPATH that allows you to run a scraper
make install
Now you're ready to run your first scraper. The first time you run this it will take a little while.
yinyo test/scrapers/test-python --output data.sqlite
Now, if you run the same scraper again it should run significantly faster.
yinyo test/scrapers/test-python --output data.sqlite
There are some extra dependencies required for building the website and associated API documentation.
- Hugo v0.60.0 or later - a static website generator
- Shins - a Node.js Slate markdown renderer
- Widdershins - Converts OpenAPI definitions to Slate. Make sure you're using a version which includes a fix for rendering callbacks https://github.com/Mermade/widdershins/commit/5d7223f070e8d295e29a3390c3d42b4798748c55. As of December 2019 this is likely to be on master and not in one of the released versions.
Do this after you've installed the dependencies (above):
make website
Then point your web browser at http://localhost:1313.
The project currently depends on a custom version of the herokuish docker image mlandauer/herokuish:for-morph-ng which is built from the Github repo mlandauer/herokuish and pushed to docker hub manually.
There is an open pull request to try to get the bug fix in our modified version merged upstream.
If this PR doesn't get merged we could use a workaround used by Dokku.
From the top level directory:
make test
Point your web browser at http://localhost:9000. Login with the credentials in the file configs/secrets-minio.env
.
make dashboard
You'll want to look in the "default" and "yinyo-runs" namespaces.
> kubectl exec -it redis-0 sh
/data # redis-cli
127.0.0.1:6379> auth changeme123
OK
127.0.0.1:6379> ping
PONG
Use webhook.site to see calls to a specific URL in real time. Very handy. You can run the test scraper and get the events directed to webhook.site. For example:
yinyo test/scrapers/test-python --output data.sqlite --callback https://webhook.site/#!/uuid-specific-to-you
Sometimes after a while of testing and debugging the minikube VM runs out of disk space. You'll either see this as kubernetes refusing to run anything because the node is "tainted" or minio refusing to do anything because it doesn't have enough space. Luckily there is an easy way to clear space.
minikube ssh
docker system prune
We're using Github Actions to run the tests (make test
), do some linting, measure coverage and build binaries of the yinyo client automatically on every push. Also,
release binaries are automatically built as well whenever a release is made in GitHub.