Multi-threaded web scraper to download all the tutorials from www.learncpp.com and convert them to PDF files concurrently.
Please support here: https://www.learncpp.com/about/
Get the image
docker pull amalrajan/learncpp-download:latest
And run the container
docker run --rm --name=learncpp-download --mount type=bind,destination=/app/learncpp,source=/home/amalr/temp/downloads amalrajan/learncpp-download
Replace /home/amalr/temp/downloads
with a local path on your system where you'd want the files to get downloaded.
You need Python 3.10 and wkhtmltopdf
installed on your system.
Clone the repository
git clone https://github.com/amalrajan/learncpp-download.git
Install Python dependencies
cd learncpp-download
pip install -r requirements.txt
Run the script
scrapy crawl learncpp
You'll find the downloaded files inside learncpp
directory under the repository root directory.
Rate Limit Errors:
- Modify
settings.py
. - Increase
DOWNLOAD_DELAY
(default: 0) to 0.2.
High CPU Usage:
- Adjust
max_workers
inlearncpp.py
. - Decrease from default 192 to reduce CPU load.
self.executor = ThreadPoolExecutor(
max_workers=192
) # Limit to 192 concurrent PDF conversions
Further Issues:
- Report at https://github.com/amalrajan/learncpp-download/issues. Attach console logs.