First GitHub Action
Contents:
What are we doing today?
- Understanding GitHub Actions
- Getting familiar with the syntax used to build a GitHub Action
- We will write a workflow to download earthquake data published by USGS, which is updated every minute
- Going over a few use cases
What are GitHub Actions?
π set of instructions (aka, a workflow)
π that run on a virtual computer (aka, a runner)
β± triggered on an event β such as a schedule, a pull
or push
or manually by pushing a button
π₯ have the ability to log files to the repository
Prerequisites
- Let's create a new repository
- Let's change its settings so that GitHub Actions can commit and push files
Let's get started!
1οΈβ£ Hello, data (Watch!)
To get familiar with logging data to the repository, let's write a simple curl
script that fetches earthquake data provided by USGS. This file is updated every minute, so it's a good candidate for running a GitHub Action.
The file source page of the CSV is here.
Let's write a script that
- is triggered on a schedule (every 10 minutes) or on a manual button push
- runs on an Ubuntu virtual machine
- uses
curl
to download a CSV - commits and pushes the CSV to our repo
For the purpose of this presentation, we will be working entirely in the browser. However, you can set up a local folder for version control with git.
-
Create the
.github/workflows/
directories, and a file calledhello_data.yml
inside it -
Next, copy and paste the following code in the file, and commit it
# βοΈ First, a name for our Action. This will be displayed in the GitHub Actions tab.
name: Hello, data!
# βοΈ Next, give it a trigger -- when do you want the action to run? On a push, pull, schedule?
on:
# βοΈ We want to schedule it
schedule:
# "cron" is your Unix-like operating systems' way of interpreting time. It's an expression made of five fields which represent the time to execute a command.
# More information here: https://crontab.guru/#*_*_*_*.
- cron: "*/10 * * * *"
# The keyword `workflow_dispatch` will let us manually run the Action on GitHub.
# Without this keyword, we would have to wait for 10 minutes every time
# to test and see how the action turned out.
workflow_dispatch:
# πΌ Finally, let's tell our file what command to execute. A GitHub Action can be made up of several such jobs.
jobs:
# Give this particular job a name, `hello_world`.
get-data:
# Tell the GitHub Action what kind of virtual machine to run on. In our case, we'll be using the latest version of Ubuntu.
runs-on: ubuntu-latest
# Within this job, let's define our commands in steps
steps:
# First, we need to clone the repo on the Ubuntu virtual runner.
# This is just like you would clone a repo on your computer before exploring the code within it.
- name: check out the repo
uses: actions/checkout@v3
# Next, we will execute a simple command, that uses curl to get the CSV
- name: fetch data
run: |-
curl "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv" -o usgs_current.csv
# Lastly, after the Action has run on a virtual machine, we will need to commit and push the file back to
# this repo, so we can access it there.
- name: Commit and push
run: |-
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Latest data: ${timestamp}" || exit 0
git push
- Navigate to the
Actions
tab in the repo, and run the Action. We are able to manually run the Action because of theworkflow_dispatch
keyword we added to our YML file.
- Let's go back to the
Code
section of our repository, and see the new CSV created!
2οΈβ£ Hello, Python (Write!)
For our second workflow, we are going to hook up a Python script to our workflow. In a new workflow file, we will add a second task to what we did in the previous step.
Right now, the USGS data we are downloading every 10 minutes is being overwritten every time our Action runs. This Python script takes a new CSV and adds its new rows to a βmainβ CSV that we will create.
This time, our workflow will:
- be triggered on a schedule (every 10 minutes) or on a manual button push
- run on an Ubuntu virtual machine
- use
curl
to download a CSV - set up a Python instance to run our
get_all_data.py
script - commit and push the CSV from USGS, along with our
main
csv to our repo
Similar to how we created the YML file in the browser itself, we'll add this Python file to GitHub.
- Create a
get_all_data.py
file in the root of the directory, and paste the following code in it.
import pandas as pd # import pandas library for data manipulation and analysis
from pathlib import Path # import path library to work with file paths
df_current = pd.read_csv('usgs_current.csv')
path = Path("usgs_main.csv")
if path.is_file() == False:
# if false, save initial main file
df_current.to_csv("usgs_main.csv", index = False)
else:
# if the file already exists, save it to a dataframe and then append to a new one
df_main_old = pd.read_csv("usgs_main.csv")
df_main_new = pd.concat([df_main_old,df_current])
# deduplicate based on unique id
df_main_new_drop_dupes = df_main_new.drop_duplicates(subset = "id", keep = "first")
# save to dataframe and overwrite the old usgs_main file
df_main_new_drop_dupes.to_csv("usgs_main.csv", index = False)
- Navigate back to the repo, and we will create another YML workflow file, just like we did in step 1. This time, we will call it
hello_python.yml
.
βοΈ This section will require you to fill in some blanks βοΈ Copy the next code chunk and paste it into the file:
name: Hello, Python!
on:
schedule:
- cron: ___ # ---TO FILL: every π minutes ---
workflow_dispatch:
jobs:
get-data:
runs-on: ___ # ---TO FILL: let's use an Ubuntu π₯οΈ (the latest one!)---
steps:
- name: check out the repo
uses: actions/checkout@v3
- name: fetch data
run: |-
curl "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv" -o usgs_current.csv
- name: Install requirements
run: python -m pip install ___ ___ # ---TO FILL: what Python libraries did we use? πΌ and π ---
- name: Run script to create main csv
run: python ___ # ---TO FILL: what's the name of our file? π---
- name: Commit and push if it changed
run: |-
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Latest data: ${timestamp}" || exit 0
git push
π Help!
name: Hello, Python!
on:
schedule:
- cron: "*/10 * * * *" # ---TO FILL: every π minutes ---
workflow_dispatch:
jobs:
get-data:
runs-on: ubuntu-latest # ---TO FILL: let's use an Ubuntu π₯οΈ (the latest one!)---
steps:
- name: check out the repo
uses: actions/checkout@v3
- name: fetch data
run: |-
curl "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv" -o usgs_current.csv
- name: Install requirements
run: python -m pip install pandas pathlib # ---TO FILL: what Python libraries did we use? πΌ and π ---
- name: Run script to create main csv
run: python get_all_data.py # ---TO FILL: what's the name of our file? π---
- name: Commit and push if it changed
run: |-
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Latest data: ${timestamp}" || exit 0
git push
Appendix
-
Now that we are periodically downloading the data, the
raw
URL to the CSV can be used in a D3 chart of a Datawrapper graphic that updates automatically. -
GitHub Actions documentation, including helpful resources for all the keywords the workflow uses.
-
Are there examples of GitHub Actions?
- Run a Twitter bot or a Slack bot
- Run web scrapers
- Validate and run tests on changes to our custom template
- Build an app every time new code is pushed
- Other fun things!
-
Are there restrictions / limitations on GitHub Actions?
-
If I have an API key or a password used in my scraper or workflow, how do I keep it hidden?
- You can store your password as a repository secret, and access it similar to environment variables you would normally use.