/recursive-goIndex-downloader

Simple GoIndex Downloader

Primary LanguageJupyter NotebookMIT LicenseMIT

Recursive GoIndex Downloader by atlonxp

Features

  • Recursive crawler (atlonxp)
  • Download all folders and files in a given url (atlonxp)
  • Download all folders and files in in sub-folders (atlonxp)
  • Adaptive delay in fetching url (atlonxp)
  • Store folders/files directly to your Google Drive (pankaj260)
  • Folders and files exclusion filters (atlonxp)
  • Download queue supported (atlonxp)
  • Auto-domain URL detection (atlonxp)
  • API-based GoIndex crawler (atlonxp, ifvv)
  • Parallel/Multiple files downloader (atlonxp)
  • Auto-skip password-protected folders (cxu-fork)

Upcoming

  • parallel crawlers

Version 2:

API-based crawler with parallel files downloader

28 April 2020 (v2.4.0)

+ added feature: curl download mode as default (we found sometime, requests.get caused a corrupted file)
+ added feature: file size check. If not the same in the metadata, we force download
+ added feature: double file size check. Once a file is downloaded, we re-check the it size with the metdadata
+ revised time delay while crawling and downloading 
+ fixed major bugs when checking file size

26 April 2020 (v2.3.3)

+ added downloaded size information

22 April 2020 (v2.3.2)
---------------------

+ added summary
+ added Exception when file is unable to download

21 April 2020 (v2.3.1)
---------------------
While crawling, fetching might cause errors sometime due to some quick requests or server is busy.
This problem has caused the eror in getting a json, so we re-fetch the url again (up to MAX_RETRY_CRAWLING)
or until we found key "files" in the return response. Once retries is reached the maximum and
the key "files" is not found, so we ignore this link (return [])

At the end, if you find there is failure, just re-run the download section again. Unless you set
OVERWITE = TRUE, all files will be re-downloaded

+ added MAX_RETRY_CRAWLING (v2.3)
+ fixed FILE_EXISTING_CHECK (stupid) bug
+ added failure-links download task

20 April 2020 (v2.2)
---------------------
Some sub-folders may be password-protected which will cause the error while crawling, so we skip this folder

+ added auto-skip password-protected folder

17 April 2020 (v2.1)
---------------------
+ fixed URL duplicated when crawling
+ added search 'files' key for some websites do not have proper files structure. So, we search it\

16 April 2020 (v2.0)
---------------------
+ crawler_v2:
	* API-based GoIndex crawler
	* Collecting all urls to be downloaded
+ parallel downloader
	* TDQM progress bar

Version 1:

Simple HTTP-based crawler and simple series downloader

Version 1 was created and improved by adapting the code from pankaj260 https://colab.research.google.com/drive/1tmsLGuswIZIZ_oM35EMW8TbJ6pQPt1rY#scrollTo=3bCnUMUg_SoT&forceEdit=true&sandboxMode=true

15 April 2020 (v1.1)
---------------------
-   Added auto-domain URL detection
-   Added simple download queue

14 April 2020 (v1.0)
---------------------
-   initial