/yinyo

A wonderfully simple API driven service to reliably execute many long running scrapers in a super scaleable way

Primary LanguageGoApache License 2.0Apache-2.0

Yinyo: A wonderfully simple API driven service to reliably execute many long running scrapers in a super scaleable way

  • Easily run as many scrapers as you like across a cluster of machines without having to sweat the details. Powered by Kubernetes.
  • Use the language and libraries you love for writing scrapers. Supports Python, JavaScript, Ruby, PHP and Perl via Heroku Buildpacks.
  • Supports many different use cases through a simple, yet flexible API that can operate synchronously or asynchronously.
  • Made specifically for developers of scraper systems be it open source or commercial. No chance of vendor lock-in because it's open source, Apache licensed.

Build Status Coverage Status Go Report Card

Who is this README for?

This README is focused on getting developers of the core system up and running. It does not yet include a guide for people who are just interested in being users of the API.

Table of Contents

Development: Guide to getting up and running quickly

Main dependencies

The main bit

First, follow the links to install the main dependencies

Start Minikube if you haven't already

make minikube

Let helm know where to find some of the development dependencies

helm repo add stable https://kubernetes-charts.storage.googleapis.com
helm repo add bitnami https://charts.bitnami.com/bitnami

Run skaffold. This will build all the bits and pieces and deploy things to your local kubernetes for you. The first time it builds everything it it takes a few minutes. After that when you make any changes to the code it does everything much faster.

make skaffold

Leave skaffold running and open a new terminal window.

Now compile and install the binary into your GOPATH that allows you to run a scraper

make install

Now you're ready to run your first scraper. The first time you run this it will take a little while.

yinyo test/scrapers/test-python --output data.sqlite

Now, if you run the same scraper again it should run significantly faster.

yinyo test/scrapers/test-python --output data.sqlite

Getting the website running locally

Dependencies

There are some extra dependencies required for building the website and associated API documentation.

Running a local development server for the website

Do this after you've installed the dependencies (above):

make website

Then point your web browser at http://localhost:1313.

The custom herokuish docker image

The project currently depends on a custom version of the herokuish docker image mlandauer/herokuish:for-morph-ng which is built from the Github repo mlandauer/herokuish and pushed to docker hub manually.

There is an open pull request to try to get the bug fix in our modified version merged upstream.

If this PR doesn't get merged we could use a workaround used by Dokku.

Notes for debugging and testing

To run the tests

From the top level directory:

make test

To see what's on the blob storage (Minio)

Point your web browser at http://localhost:9000. Login with the credentials in the file configs/secrets-minio.env.

To see what Kubernetes is doing

make dashboard

You'll want to look in the "default" and "yinyo-runs" namespaces.

Accessing Redis

> kubectl exec -it redis-0 sh
/data # redis-cli
127.0.0.1:6379> auth changeme123
OK
127.0.0.1:6379> ping
PONG

Testing callback URLs

Use webhook.site to see calls to a specific URL in real time. Very handy. You can run the test scraper and get the events directed to webhook.site. For example:

yinyo test/scrapers/test-python --output data.sqlite --callback https://webhook.site/#!/uuid-specific-to-you

Reclaiming diskspace in minikube

Sometimes after a while of testing and debugging the minikube VM runs out of disk space. You'll either see this as kubernetes refusing to run anything because the node is "tainted" or minio refusing to do anything because it doesn't have enough space. Luckily there is an easy way to clear space.

minikube ssh
docker system prune

Continuous integration

We're using Github Actions to run the tests (make test), do some linting, measure coverage and build binaries of the yinyo client automatically on every push. Also, release binaries are automatically built as well whenever a release is made in GitHub.