/geeksforgeeks.pdf

Topic wise PDFs of Geeks for Geeks articles. (Last updated in October 2018)

Primary LanguagePython

Geeks for Geeks PDFs

Table of Contents of the Dynamic Programming Book.

Download the PDFs from the releases page.

I started in 2015 from @gnijuohz's repo, but now (in 2018) I've re-written pretty much every part of the process.

Dependencies

  • docopt

    • Basic CLI in scripts
  • requests & requests_cache

    • To download pages and cache the result locally
  • lxml

    • Cleaning of the downloaded pages
  • pandoc & xelatex

    • Convert the cleaned pages to PDF

Running the code

  1. First, find out a "topic url" for what you want to download. Eg:

    • https://www.geeksforgeeks.org/tag/samsung/
    • https://www.geeksforgeeks.org/category/dynamic-programming/
  2. Create a JSON containing links of all posts on that topic

    • python3.6 list_links.py https://www.geeksforgeeks.org/tag/samsung/

    • This JSON can now be edited by hand, to remove some links, re-order them etc.

  3. Now fetch the actual posts

    • python3.6 download_html.py JSON/Samsung.json
  4. Finally, convert the HTML to a PDF using Pandoc

    • python3.6 html_to_pdf.py HTML/Samsung.html

Things will work only if you're really lucky. This project has taught me how fragile my HTML to PDF pipeline really is. There's just too many things that can go wrong.

What could go wrong

  • The PDF engine that pandoc calls may err!
    • In which case, you should convert the html to tex
    • Then run pandoc on the tex file in verbose mode
    • and manually fix the tex file

Topic URLs

List of Topic URLs that have I've fetched. You can download these from the releases page.

Algorithms

Data Strucutres

Companies