/fetch-apache-ga-stats

Repository to make "snapshots" of GitHub Action queue for later analysis

Primary LanguagePython

Fetch Apache GitHub Actions Statistics

Test the build Code style: black

Table of Contents

Context and motivation

For The Apache Software Foundation [ASF] the limit for concurrent jobs in GitHub Actions [GA] equals 180 (usage limits). The GItHub does not provide statistics related to GA and this repo was created to collect some basic data to make it possible.

Statistics

Statistics data is fetched in the scheduled action Fetch GitHub Action queue. This action makes series of "snapshots" of GA workflow runs for every ASF repository which uses GA (list of them is stored in matrix.json, described here).

Statistics consists of:

  • json files - workflow runs for every repo in seperate files (described here)
  • csv file - simple statistics in single file (described here)

These files are uploaded as workflow artifact.

Json files

The json files contain list of repository workflow runs in queued and in_progress state. File titles contain timestamp when fetching this list started. The json schema is described in GitHub API documentation here.

Files are uploaded to Google Cloud Storage for later processing.

Example structure of objects in Google Cloud Storage bucket:

example-bucket-name
└── apache
    ├── airflow
    │   ├── 20201103_130148Z.json
    │   └── 20201103_131641Z.json
    └── beam
        ├── 20201103_130148Z.json
        └── 20201103_131641Z.json

CSV file

Single bq.csv file is created and contains simple statistics for all fetched repositories. This file is used in the Fetch GitHub Action queue to efficiently upload data to BigQuery table.

CSV file headers: workflow_id, status, created_at, timestamp.

Processing existing json files to csv and pushing it to BigQuery

Helper script scripts/parse_existing_json_files.py can be used to process existing json files into single csv.

Example use:

gsutil -m cp -r gs://example-bucket-name/apache gcs

python parse_existing_json_files.py \
    --input-dir gcs \
    --output bq_csv.csv

bq load --autodetect \
    --source_format=CSV \
    dataset.table bq_csv.csv

Determining ASF repositories which uses GitHub Actions (matrix.json)

There is no single endpoint to obtain list of ASF repositories which uses GA and since ASF consists of 2000+ repositories it is not trivial task to obtain it.

This list of repositories which uses GitHub Actions is stored in matrix.json and can be updated in three ways:

Running python script and action cause many requests in behalf of used GitHub Access token which may cause in exceeding quota limits.

GitHub Actions Secrets:

Secret Required Description
PERSONAL_TOKEN True Personal GitHub access token used to authorize requests. It has bigger quota than GITHUB_TOKEN secret
BQ_TABLE - BigQuery table reference to which simple statistics will be pushed (e.g. dataset.table).
GCP_BUCKET - Google Cloud storage bucket to which json files with workflows payload will be pushed (e.g. example-bucket-name).
GCP_PROJECT_ID - Google Cloud Project ID.
GCP_SA_KEY - Google Cloud Service Account key (Service Account with permissions to Google Cloud Storage and BigQuery).
GCP_SA_EMAIL - Google Cloud Service Account email (Service Account with permissions to Google Cloud Storage and BigQuery).