Table of Contents
- Context and motivation
- Statistics
- Determining ASF repositories which uses GitHub Actions (matrix.json)
- GitHub Actions Secrets:
For The Apache Software Foundation [ASF] the limit for concurrent jobs in GitHub Actions [GA] equals 180 (usage limits). The GItHub does not provide statistics related to GA and this repo was created to collect some basic data to make it possible.
Statistics data is fetched in the scheduled action Fetch GitHub Action queue. This action makes series of "snapshots" of GA workflow runs for every ASF repository which uses GA (list of them is stored in matrix.json, described here).
Statistics consists of:
- json files - workflow runs for every repo in seperate files (described here)
- csv file - simple statistics in single file (described here)
These files are uploaded as workflow artifact.
The json files contain list of repository workflow runs in queued
and in_progress
state.
File titles contain timestamp when fetching this list started.
The json schema is described in GitHub API documentation here.
Files are uploaded to Google Cloud Storage for later processing.
Example structure of objects in Google Cloud Storage bucket:
example-bucket-name
└── apache
├── airflow
│ ├── 20201103_130148Z.json
│ └── 20201103_131641Z.json
└── beam
├── 20201103_130148Z.json
└── 20201103_131641Z.json
Single bq.csv
file is created and contains simple statistics for all fetched repositories.
This file is used in the Fetch GitHub Action queue
to efficiently upload data to BigQuery table.
CSV file headers: workflow_id
, status
, created_at
, timestamp
.
Helper script scripts/parse_existing_json_files.py
can be used to process existing json files into single csv.
Example use:
gsutil -m cp -r gs://example-bucket-name/apache gcs
python parse_existing_json_files.py \
--input-dir gcs \
--output bq_csv.csv
bq load --autodetect \
--source_format=CSV \
dataset.table bq_csv.csv
There is no single endpoint to obtain list of ASF repositories which uses GA and since ASF consists of 2000+ repositories it is not trivial task to obtain it.
This list of repositories which uses GitHub Actions is stored in matrix.json and can be updated in three ways:
- manually editing
matrix.json
and committing changes - by using fetch_apache_projects_with_ga.py python script and committing changes
- automatically by Fetch Apache Repositories with GA action (changes committed automatically when occur).
Running python script and action cause many requests in behalf of used GitHub Access token which may cause in exceeding quota limits.
Secret | Required | Description |
---|---|---|
PERSONAL_TOKEN |
True | Personal GitHub access token used to authorize requests. It has bigger quota than GITHUB_TOKEN secret |
BQ_TABLE |
- | BigQuery table reference to which simple statistics will be pushed (e.g. dataset.table ). |
GCP_BUCKET |
- | Google Cloud storage bucket to which json files with workflows payload will be pushed (e.g. example-bucket-name ). |
GCP_PROJECT_ID |
- | Google Cloud Project ID. |
GCP_SA_KEY |
- | Google Cloud Service Account key (Service Account with permissions to Google Cloud Storage and BigQuery). |
GCP_SA_EMAIL |
- | Google Cloud Service Account email (Service Account with permissions to Google Cloud Storage and BigQuery). |