Scraping book information from the Internet

Some time ago, I wanted to experiment a bit with book recommendation systems. I did not find any good Persian dataset of books, along with their information (title, authors, translators, ratings, etc.). There were some English ones, mainly scraped from GoodReads lists, but I didn't like them either. So I decided to create a Persian dataset of books and learn how to use selenium for web scraping along the way, which is something I wanted to learn for quite some time.

So, this is how this repository was born:)

So far I have written some code to extract an individual book's information from its webpage. In particular I extract:

URL
Title
Authors and Translators
ISBN
Genres
Publisher
Publication date
Publication count
Ratings
Number of pages
Description
Image URL

This code scraps data from this website.

~~In the future, I should also add a crawler to automatically extract the URL of book pages and use it along with the current code to create a dataset.~~
UPDATE: Dataset creation is finished. Using 24 threads of execution, it took ~4 hours to scrape 26,202 records. The dataset is stored in books.json and can also be downloaded from the zip file in the data directory.

Then perhaps a little bit of cleaning, removing duplicated and such, and that will be it.

I have some vague ideas of creating a graph from this dataset and run some GNNs on it, but there is no time for them now, so maybe in some distant future!

conflictednerd/Persian-Book-Dataset

Scraping book information from the Internet