GitHub Data Collection

Project Directory Structure

github-data-collector/
├── dags/
│   ├── __init__.py
│   ├── github_data_collection_dag.py
│   └── utils/
│       ├── __init__.py
│       ├── github_client.py
│       └── data_processor.py
├── plugins/
│   ├── __init__.py
│   └── operators/
│       ├── __init__.py
│       └── github_operator.py
├── config/
│   ├── airflow.cfg
│   └── github_orgs.yml
├── output/
│   └── .gitkeep
├── tests/
│   ├── __init__.py
│   └── test_github_dag.py
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── .env.example
├── .gitignore
└── README.md

Setup Instructions

  1. Clone and Setup:

    git clone https://github.com/gridatek/github-data-collector.git
    cd github-data-collector
  2. Environment Setup:

    cp .env.example .env
    # Edit .env file and add your GitHub personal access token
  3. Start Services:

    docker compose up -d
  4. Configure Airflow Variables:

    • Access Airflow UI at http://localhost:8080
    • Login with admin/admin
    • Go to Admin > Variables
    • Add github_token with your GitHub token
    • Add github_organizations as JSON array: ["apache","kubernetes","tensorflow"]
  5. Output Files:

    • Raw repository data: /output/repos_raw_YYYY-MM-DD.json
    • Contribution data: /output/contributions_YYYY-MM-DD.json
    • Summary report: /output/github_summary_YYYY-MM-DD.json

The DAG will run daily and collect comprehensive GitHub organization data, saving it as structured JSON files for further analysis.