/archival-web-spider

Only for educative purposes, uses BeautifulSoup, Selenium and Python as a means of getting the work done.

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

Archival Spider Donate to the founder

Efficient means to documenting your projects info.

Educational/Experimental project Contribute to the repo Compliant with TravisCI standard Gitpod Ready-to-Code Accepting contributions Known Vulnerabilities Code coverage HitCount

Inspired By

Project inspired by the likes of archive.org and miscellaneous free archival and curation projects. Intended to work for a broader public with a larger objective.

About

Python project which uses mainly BeautifulSoup and Selenium Webdriver in order to crawl through websites and retrieve their resources in order to keep a personal record of documentation studied. Not meant to be used without webmasters permissions; this is only for learning purposes. We do not encourage you to breach terms of any website.

To-do

Troubleshooting

Common Issues:

Chrome not running!

With issues like selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash, do the following:

Try ps aux and see if there are multiple processes running. In linux, with killall -9 chromedriver and killall -9 chrome you can make sure to free up processes to run the app again. In windows, the command is: taskkill /F /IM chrome.exe. This is usually a result of crashes mid-runs, and is easily fixable.

..."encodings\cp1252.py", line 19, in encode...

UnicodeEncodeError: 'charmap' codec can't encode characters in position XXXX-YYYY: character maps to

This is a windows encoding issue and it may be possible to fix by running the following commands before running the script: set PYTHONIOENCODING=utf-8 set PYTHONLEGACYWINDOWSSTDIO=utf-8

Donate

Donate if you can spare a few bucks for pizza, coffee or just general sustenance. I appreciate it.

Donate Button