GitHub Topics Scraper : Project Overview

You can find the explanation of this project here.

Introduction:

Web scraping is the process of parsing and extracting data from the websites. It is useful techinque when we want to collect data from the websites for our work.

GitHub is web-based version control and collaboration platform for software developers. It has a page https://github.com/topics where we can find different topics listed on GitHub.

I have built a scraper i.e., Github_topics_scraper which scrapes the GitHub Topics webpage. This scraper can return, either detailed or non-detailed dataframe based on the user preferences.

Tools Used:

Programming Language: Python

Python Packages: BeautifulSoup, Requests, Pandas, Math and tqdm.

The scraper function : github_topics_scraper()

github_topics_scraper(), the scraper function, takes two optional arguments, detailed and records.

  • The argument records represents the number of records user want. It can take either Boolean(True/False) or integer inputs. It should be set to “False” to return all the possible records.

  • The argument detailed takes Boolean(True/False) inputs and returns detailed data frame if set to "True" else returns non detailed dataframe.

By default the function returns non-detailed data frame with single record.

Detailed v/s non-detailed dataframes:

As mentioned above, the scraper function can return either detailed or non-detailed dataframe.

When the "detailed" argument in the scraper function is set to "True", we get a dataframe which looks something like this -

alt

The columns in the above data frame represents -

alt

When the "detailed" argument in the scraper function is set to "False", we get a non-detailed dataframe which consists of only the first 3 columns. This dataframe contains the basic information of the topics on the GitHub topics webpage.

tqdm - Progress bar

Web scraping is a time consuming process and sometimes we can't understand whether the scraping is in progress or not. So I have used tqdm progress bar to view the progress of scraping. It is very helpful specially in case of detailed scraping.

The progress of web scraping looks like this-

alt

alt