Scribd is changing some of its xpaths so it breaks the script sometimes. I will do the best I can to keep updating it! If you find errors please report
For any of this type of script to work you need to have a premium account. It can also be a free trial account.
This version creates a screenshot of every book page and then generates a PDF file with it. It has the optional ability to generate Searchable Text (as known as OCR) too.
I have a version of this script that isn't screenshot-based in finishing process, just need to solve one big problem and will release it.
- u/Ephemeral-Throwaway
- u/Ronald_Halbal
This project receives one or more Scribd links and generates a PDF within the books content. It has the option to generate a Searchable PDF too.
֎ (Will improve in next update!) ֎
- Downloading a 300 pages book in ± 11,5 minutes in my setup.
- OCR-ed 300 pages in ± 8 minutes.
- Total elapsed time downloading and OCR-ing a 300 pages book: ± 19<5 minutes
- Non OCR-ed PDF: 11,5MB
- OCR-ed PDF: 9,20MB
This project needs the following python libraries to work:
- selenium
- fpdf
- Pillow
- ocrmypdf (Optional. Needed for OCR)
- And will need to have Google Chrome installed. You can get it from here
If you intend to use the OCR function at the end, you will need to install the following programs, as they are needed for running the ocrmypdf module:
- Ended with
_IMG.pdf
→ Version without OCR - Ended with
_OCR.pdf
→ Version with OCR
For full details on installation, see: https://tesseract-ocr.github.io/tessdoc/Home.html
-
You will need to download the installer from: https://github.com/UB-Mannheim/tesseract/wiki
-
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
Just download and install from: https://www.ghostscript.com/download/gsdnld.html
- Download chromewebdriver from here. Move the driver file to script's root.
- Add your email
- Add your password
- "is_list"; if you want to load the links from the book_list.txt file, mark as True
- "Do_OCR": if you want to automatically execute the OCR process after downloading the file, mark as True
Beware! OCR can be time consuming for big books. Test with small ones first so you know how is the output.
- Just paste one link per line.
This project is for educational purposes only. It is not advised to use this for any type of copyright or Scribd TOS infringement and I am not responsible for any misuse of this piece of code.
- Correct error when trying to click 'Save book" popup not clicking in time for the screenshot and it appears on the final book
- Optimize speed (change every
time.sleep()
toelement_to_be_clickable()
selenium function - Finish and release text-based version of this script
- Be able to continue already started downloads by checking already downloaded pages (already working in this)