Geeks for Geeks PDFs

Download the PDFs from the releases page.

I started in 2015 from @gnijuohz's repo, but now (in 2018) I've re-written pretty much every part of the process.

Dependencies

docopt
- Basic CLI in scripts
requests & requests_cache
- To download pages and cache the result locally
lxml
- Cleaning of the downloaded pages
pandoc & xelatex
- Convert the cleaned pages to PDF

Running the code

First, find out a "topic url" for what you want to download. Eg:
- https://www.geeksforgeeks.org/tag/samsung/
- https://www.geeksforgeeks.org/category/dynamic-programming/
Create a JSON containing links of all posts on that topic
- python3.6 list_links.py https://www.geeksforgeeks.org/tag/samsung/
- This JSON can now be edited by hand, to remove some links, re-order them etc.
Now fetch the actual posts
- python3.6 download_html.py JSON/Samsung.json
Finally, convert the HTML to a PDF using Pandoc
- python3.6 html_to_pdf.py HTML/Samsung.html

Things will work only if you're really lucky. This project has taught me how fragile my HTML to PDF pipeline really is. There's just too many things that can go wrong.

What could go wrong

The PDF engine that pandoc calls may err!
- In which case, you should convert the html to tex
- Then run pandoc on the tex file in verbose mode
- and manually fix the tex file

Topic URLs

List of Topic URLs that have I've fetched. You can download these from the releases page.

Algorithms

Data Strucutres

Companies

harshit37/geeksforgeeks.pdf

Geeks for Geeks PDFs

Dependencies

Running the code

What could go wrong

Topic URLs