Some time ago, I wanted to experiment a bit with book recommendation systems. I did not find any good Persian dataset of books, along with their information (title, authors, translators, ratings, etc.). There were some English ones, mainly scraped from GoodReads lists, but I didn't like them either. So I decided to create a Persian dataset of books and learn how to use selenium for web scraping along the way, which is something I wanted to learn for quite some time.
So, this is how this repository was born:)
So far I have written some code to extract an individual book's information from its webpage. In particular I extract:
- URL
- Title
- Authors and Translators
- ISBN
- Genres
- Publisher
- Publication date
- Publication count
- Ratings
- Number of pages
- Description
- Image URL
This code scraps data from this website.
In the future, I should also add a crawler to automatically extract the URL of book pages and use it along with the current code to create a dataset.
UPDATE: Dataset creation is finished. Using 24 threads of execution, it took ~4 hours to scrape 26,202 records. The dataset is stored in books.json
and can also be downloaded from the zip file in the data
directory.
Then perhaps a little bit of cleaning, removing duplicated and such, and that will be it.
I have some vague ideas of creating a graph from this dataset and run some GNNs on it, but there is no time for them now, so maybe in some distant future!