
Web scraping project that includes data cleaning and analysis to determine the most cited, published books on google scholar

Case Study - Web Scraping - Published Books (Google Scholar)

Author: Clint Barnard-El
Email: barnard.clint@yahoo.com
LinkedIn: https://www.linkedin.com/in/clintbarnardel/
Dataset: Google Scholar Web Scraping


This repository provides the results from an analysis of the most cited published books on African-American History. Black History Month is an annual observance in the United States - held during February - to commemorate the events and celebrate the contributions of those from the African diaspora.

This analysis aims to demonstrate the skills of web scraping, data cleaning, exploratory analysis (EDA), and data visualization.

Applications Used

  • RStudio
  • Octoparse
  • Google Chrome

Language Used

  • R

Skills Used


The dataset can be found on Kaggle. The web scraping completed on 2/24/24 yielded over 500 results that included articles, books, and PDFs.