/gh-action-data-scraping

this shows how to use github actions to do periodic data scraping

Primary LanguageJavaScriptMIT LicenseMIT

gh-action-data-scraping

this repo shows how to use github actions to do automated data scraping, with storage in git itself! free git storage and scheduled updates!!!

2021 Update

You can read more in the Blog Writeup.

As of May 2021, Flat Data scraping is officially supported by GitHub, check them out.

Basic Idea

The script looks like:

# /.github/workflows/daily.yml
on:
  schedule:
    - cron:  '0 8 * * *' # every day at 8am
name: Pull Data and Build
jobs:
  build:
    name: Build
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@master
    - name: Build
      run: npm install
    - name: Scrape
      run: npm run action 
      # env:
      #   WHATEVER_TOKEN: ${{ secrets.YOU_WANT }}
    - uses: mikeal/publish-to-github-action@master
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # GitHub sets this for you

How it should look

For people new to GH actions, this is how my Actions tab of this very repo looks if you need a reference point:

image

Limits

You can do whatever you like with this, including taking screenshots of sites!

The limits I can think of are the limits of GitHub and GitHub Actions:

In addition to these limits, GitHub Actions should not be used for:

  • Content or activity that is illegal or otherwise prohibited by their Terms of Service or Community Guidelines.
  • Cryptomining
  • Serverless computing
  • Activity that compromises GitHub users or GitHub services.
  • Any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used. In other words, be cool, don’t use GitHub Actions in ways you know you shouldn’t.

Be a good citizen, don't abuse it and F this up for the rest of us!

This is heavily based on