kanahia 09 October, 2023
This package was created as a side project that started as a loose conversation at work. The objective was to extract the data from a popular website in Poland utilized for sharing a collection of read books. Unfortunately, website does not provide a user friendly option to export the library and associated data. I decided to play with web scrapping to extract those information from their website. It may be useful for anyone who wants to transfer the library to the third part services, like more international one .
Depending on the user library size, the script may take a while. Averagely, scrapping the data from 30 pages takes around 8-9 minutes (~ 600 books).
At the moment the script extract the following fields:
- Title
- Author
- Shelf at which the book is stored
- Date read
- Given rate
- User review
- Book ISBN number
As an input user should provide individual link to the own library that can be generated by the user after log in as in the example below:
Script completes with final table summarizing all read books and their associated details.
The package has been written in R and minimal knowledge of R is required to successfully complete the script.
To run the script three steps must be undertaken. First of all, package
utilize RSelenium
package. In this respect, user must verify the
google chrome version and download the respective chromedriver.
- Driver can be downloaded from the following page: https://googlechromelabs.github.io/chrome-for-testing/
- Make sure java is installed.
This step aims to initialize required driver directory skeleton. After running this chunk, error may occur but you should wait until all necessary files are downloaded. At the end, inspect your local directory in terms of presence of binman directories.
Especially pay attention if the binman_chromdriver
subdirectory was
created:
Windows: C:\Users\USER\AppData\Local\binman
with several other binman
directories Unix: ~/.local/share/binman_chromedriver/
library(RSelenium)
rD <- rsDriver() # runs a chrome browser, wait for necessary files to download
- Check your google chrome version by typing in the url field
chrome://version/
- Download matching chromedriver, e.g., if your version is
117.0.5938.132
then your chromedriver must match the major version117.X.XXXX.XXX
- Driver can be downloaded from https://googlechromelabs.github.io/chrome-for-testing/
- Navigate to
C:\Users\USER\AppData\Local\binman\binman_chromedriver\win32\
and make new directory named as the chromedriver version (e.g.,117.0.5938.132
), save chromedriver and unpack in this location. - Note: It may happen that chromedriver license file would have to be deleted.
- Make sure you have installed java. If not install it. Confirm by
cmd.exe -> java -version (Windows)
- Go back to R and run:
- for windows it should work without any arguments
- on archlinux works with
rD <- RSelenium::rsDriver(chromever = NULL)
library(RSelenium)
rD <- rsDriver()
If RSelenium sucessfully initiates the browser, then close it and run the main.R script.
Load all the functions and run the script providing link to your personal library:
# Example how to run the entire workflow
URL <- "https://lubimyczytac.pl/ksiegozbior/XyeMyrJrGh"
res <- libroScrapeR::run_libroScrapeR(URL = URL)
Once it is finished one should get a dataframe comprising all the data describing your reading activity.
If the browser has not been initialized, remove license file from your
local binman directory, e.g.,
C:\UsersUSER\AppData\Local\binman\binman_chromedriver\win32\117.0.5938.92\
and run again.
After it is done, make sure to close driver and/or kill chromedriver and java processes.
Before running script again, make sure to terminate chromedriver and java processes to release the port. Otherwise you will not be able to initialize script again.