Project inspired by the likes of archive.org and miscellaneous free archival and curation projects. Intended to work for a broader public with a larger objective.
Python project which uses mainly BeautifulSoup and Selenium Webdriver in order to crawl through websites and retrieve their resources in order to keep a personal record of documentation studied. Not meant to be used without webmasters permissions; this is only for learning purposes. We do not encourage you to breach terms of any website.
- Included files within script should be able to:
- Follow principles of deduplication based filesystem such as: Duplicacy - Cloud Backup Tool, Borg - Deduplicating Archiver, SDFS - Deduplicating FS
- Permit elastic mapping as external scripts continue to be stored in CDN for using network bandwidth instead
- Inline styles by using Pynliner - CSS-to-inline-styles conversion tool
- Can follow principles of mind mapping and memory techniques, such as:
- Make decentralization possible due to browsing websites offline, saved per domain
- Add as pip package
- Zipper to minimize manual operations by automatizing and streamline
- Add silent mode
With issues like selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
, do the following:
Try
ps aux
and see if there are multiple processes running. In linux, withkillall -9 chromedriver
andkillall -9 chrome
you can make sure to free up processes to run the app again. In windows, the command is:taskkill /F /IM chrome.exe
. This is usually a result of crashes mid-runs, and is easily fixable.
UnicodeEncodeError: 'charmap' codec can't encode characters in position XXXX-YYYY: character maps to
This is a windows encoding issue and it may be possible to fix by running the following commands before running the script:
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
Donate if you can spare a few bucks for pizza, coffee or just general sustenance. I appreciate it.