PyScrapAcademy

Disclaimer

INFORMATIONAL PURPOSES ONLY: I created this script for personal use only. You cannot scrap the content of modules that you do not own

I am not responsible if you leak any modules !!

Information

This script can be used to scrape content from HackTheBox Academy modules to keep clear track of your progress or make a nice CheatSheet :D, extract the contents into multiple markdown files and store them in a newly created folder. The script is written in Python3 and requires the following modules:

Dependencies

This web scraper relies on the following Python libraries:

Beautiful Soup 4: A library for pulling data out of HTML and XML files, making it easy to navigate and search the parse tree.
Requests: A library for making HTTP requests in Python, providing a simple and convenient way to interact with web services.
PyYAML: A YAML parser and emitter for Python, allowing you to easily read and write YAML files.

Features

Scrape links from a web page
Save the list of links to a text file
Extract content from each link
Save the content of each link in a separate Markdown file
Organize the scraped content into folders

Installation

Clone the repository:

git clone https://github.com/Oxooi/PyScrapAcademy.git

Change the directory to the project folder:

cd PyScrapAcademy

Install the required dependencies:

pip install -r requirements.txt

Configuration

Rename the config.example.yaml file to config.yaml in the config folder.
Open the config.yaml file and set the following parameters:

url: The URL of the web page you want to scrape
file: The name of the text file where the list of links will be saved
cookies: The cookies to use for requests (htb_academy_session)

Usage

Run the script with the following command:

python bot.py

The script will create a results folder containing the scraped content organized in subfolders.