github-data-collector/
├── dags/
│ ├── __init__.py
│ ├── github_data_collection_dag.py
│ └── utils/
│ ├── __init__.py
│ ├── github_client.py
│ └── data_processor.py
├── plugins/
│ ├── __init__.py
│ └── operators/
│ ├── __init__.py
│ └── github_operator.py
├── config/
│ ├── airflow.cfg
│ └── github_orgs.yml
├── output/
│ └── .gitkeep
├── tests/
│ ├── __init__.py
│ └── test_github_dag.py
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── .env.example
├── .gitignore
└── README.md
-
Clone and Setup:
git clone https://github.com/gridatek/github-data-collector.git cd github-data-collector -
Environment Setup:
cp .env.example .env # Edit .env file and add your GitHub personal access token -
Start Services:
docker compose up -d
-
Configure Airflow Variables:
- Access Airflow UI at http://localhost:8080
- Login with admin/admin
- Go to Admin > Variables
- Add
github_tokenwith your GitHub token - Add
github_organizationsas JSON array:["apache","kubernetes","tensorflow"]
-
Output Files:
- Raw repository data:
/output/repos_raw_YYYY-MM-DD.json - Contribution data:
/output/contributions_YYYY-MM-DD.json - Summary report:
/output/github_summary_YYYY-MM-DD.json
- Raw repository data:
The DAG will run daily and collect comprehensive GitHub organization data, saving it as structured JSON files for further analysis.