libroScrapeR

kanahia 09 October, 2023

Introduction

This package was created as a side project that started as a loose conversation at work. The objective was to extract the data from a popular website in Poland utilized for sharing a collection of read books. Unfortunately, website does not provide a user friendly option to export the library and associated data. I decided to play with web scrapping to extract those information from their website. It may be useful for anyone who wants to transfer the library to the third part services, like more international one .

Additional details

Depending on the user library size, the script may take a while. Averagely, scrapping the data from 30 pages takes around 8-9 minutes (~ 600 books).

At the moment the script extract the following fields:

Title
Author
Shelf at which the book is stored
Date read
Given rate
User review
Book ISBN number

Input

As an input user should provide individual link to the own library that can be generated by the user after log in as in the example below:

Script completes with final table summarizing all read books and their associated details.

Getting ready

The package has been written in R and minimal knowledge of R is required to successfully complete the script.

The enviroment

To run the script three steps must be undertaken. First of all, package utilize RSelenium package. In this respect, user must verify the google chrome version and download the respective chromedriver.

Driver can be downloaded from the following page: https://googlechromelabs.github.io/chrome-for-testing/
Make sure java is installed.

Initialize

This step aims to initialize required driver directory skeleton. After running this chunk, error may occur but you should wait until all necessary files are downloaded. At the end, inspect your local directory in terms of presence of binman directories.

Especially pay attention if the binman_chromdriver subdirectory was created:

Windows: C:\Users\USER\AppData\Local\binman with several other binman directories Unix: ~/.local/share/binman_chromedriver/

library(RSelenium)

rD <- rsDriver() # runs a chrome browser, wait for necessary files to download

Confirm chromedriver version

Check your google chrome version by typing in the url field chrome://version/
Download matching chromedriver, e.g., if your version is 117.0.5938.132 then your chromedriver must match the major version 117.X.XXXX.XXX
Driver can be downloaded from https://googlechromelabs.github.io/chrome-for-testing/
Navigate to C:\Users\USER\AppData\Local\binman\binman_chromedriver\win32\ and make new directory named as the chromedriver version (e.g., 117.0.5938.132), save chromedriver and unpack in this location.
Note: It may happen that chromedriver license file would have to be deleted.
Make sure you have installed java. If not install it. Confirm by cmd.exe -> java -version (Windows)
Go back to R and run:
for windows it should work without any arguments
on archlinux works with rD <- RSelenium::rsDriver(chromever = NULL)

library(RSelenium)

rD <- rsDriver()

If RSelenium sucessfully initiates the browser, then close it and run the main.R script.

Run script

Load all the functions and run the script providing link to your personal library:

# Example how to run the entire workflow

URL <- "https://lubimyczytac.pl/ksiegozbior/XyeMyrJrGh"
res <- libroScrapeR::run_libroScrapeR(URL = URL)

Once it is finished one should get a dataframe comprising all the data describing your reading activity.

Throubleshooting and final remarks

License problem

If the browser has not been initialized, remove license file from your local binman directory, e.g., C:\UsersUSER\AppData\Local\binman\binman_chromedriver\win32\117.0.5938.92\ and run again.

After it is done, make sure to close driver and/or kill chromedriver and java processes.

Exiting

Before running script again, make sure to terminate chromedriver and java processes to release the port. Otherwise you will not be able to initialize script again.