You can find the explanation of this project here.
Web scraping is the process of parsing and extracting data from the websites. It is useful techinque when we want to collect data from the websites for our work.
GitHub is web-based version control and collaboration platform for software developers. It has a page https://github.com/topics where we can find different topics listed on GitHub.
I have built a scraper i.e., Github_topics_scraper which scrapes the GitHub Topics webpage. This scraper can return, either detailed or non-detailed dataframe based on the user preferences.
Programming Language: Python
Python Packages: BeautifulSoup, Requests, Pandas, Math and tqdm.
github_topics_scraper()
, the scraper function, takes two optional arguments, detailed
and records
.
-
The argument
records
represents the number of records user want. It can take either Boolean(True/False) or integer inputs. It should be set to “False” to return all the possible records. -
The argument
detailed
takes Boolean(True/False) inputs and returns detailed data frame if set to "True" else returns non detailed dataframe.
By default the function returns non-detailed data frame with single record.
As mentioned above, the scraper function can return either detailed or non-detailed dataframe.
When the "detailed" argument in the scraper function is set to "True", we get a dataframe which looks something like this -
The columns in the above data frame represents -
When the "detailed" argument in the scraper function is set to "False", we get a non-detailed dataframe which consists of only the first 3 columns. This dataframe contains the basic information of the topics on the GitHub topics webpage.
Web scraping is a time consuming process and sometimes we can't understand whether the scraping is in progress or not. So I have used tqdm progress bar to view the progress of scraping. It is very helpful specially in case of detailed scraping.
The progress of web scraping looks like this-