/analysis-github-activity-using-Big-Query

Awesome training Center For Practicing Big Data Processing using Github and Big Query

Primary LanguageJupyter Notebook

Introduction

Have you ever worked on data processing billions of records? Unless you're working as a data engineer, you won't have the experience of processing billions of records. The github activity dataset, which contains records of developers' activities on GitHub sing 2012, contains billions of log records.

To handle this, we will use Google BigQuery, powerful and effective data warehouse tool. Even without spending a lot of money, you can experience the experience of analyzing Github records over the years through Google BigQuery.

requirements

1. install python packages
pip install --upgrade google-cloud-bigquery
pip install --upgrade pandas-gpq
pip install --upgrade six
pip install --upgrade pyarrow
2. Get Bigquery Credentials

If you wanna run the scripts in the repository, Download the credential json file according to the link and save it in the credentials/ folder

3. Get Github Credentials

Read this article. You need to create a personal access token to use Github API

Reading List

It is so easy to handle Google BigQuery in the Jupyter. Let's learn 3 ways to handle Google BigQuery in Jupyter.
Before we start analyzing the data, let's see how the github archive is structured.
Grasp the overall aspect of the github action log
Understand how to obtain the data ghrough the Github API

CopyRight CC BY-SA 4.0

This repository is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0