/deskchan-python-plugins

DeskChan plugin: Web crawler for the imageboard 2ch

Primary LanguagePythonMIT LicenseMIT

PyPI - Python Version DeskChan Version GitHub TeamCity CodeBetter

Web-Crawler for the russian imageboard 2ch

Python plugin for the project DeskChan

Actually in progress

  • Using the html-parser libraries, get the URLs of all media files in a thread
  • Write all URLs with a timestamp in a structured file
  • Create subfolder with timestamp for each thread-URL
  • Adapt this script to the DeskChan project (proxy3.py)

For future projects

  • Structured text extraction
  • Crawling for 4ch
  • Universal crawling of the imageboards
  • Tagging and sorting algorithms

Features

  • You have a URL from a 2ch thread.
  • You put the URL from a 2ch thread into the script
  • The script uses a parser searching all media files in this thread
  • The script writes all direct URLs into a thread.txt file
  • The script writes all files of the thread into a subfolder with a timestamp

Needed Python libraries:

For Windows user: You will need to install Python 3. While the installation process you have to "add Python 3.x to PATH". After the reboot type 'python' in your command prompt. If you don't get an error, your installation was successful. Notice you can also use the pip for automatic updates of the (also missing) libraries.

Possibility Nr. 1

To update pip himself, type in your command prompt:

python -m pip install --upgrade pip

To download the Requests library:

pip install requests

To download the BeautifulSoup library:

pip install beautifulsoup4

Now you should be ready for using this plugin.

Possibility Nr. 2

The alternative possibility is to use the prepared requirements.txt file. Type in the plugin directory in your command prompt:

pip install -r requirements-to-freeze.txt --upgrade

And then:

pip freeze > requirements.txt