INFORMATIONAL PURPOSES ONLY: I created this script for personal use only. You cannot scrap the content of modules that you do not own
I am not responsible if you leak any modules !!
This script can be used to scrape content from HackTheBox Academy modules to keep clear track of your progress or make a nice CheatSheet :D, extract the contents into multiple markdown files and store them in a newly created folder. The script is written in Python3 and requires the following modules:
This web scraper relies on the following Python libraries:
- Beautiful Soup 4: A library for pulling data out of HTML and XML files, making it easy to navigate and search the parse tree.
- Requests: A library for making HTTP requests in Python, providing a simple and convenient way to interact with web services.
- PyYAML: A YAML parser and emitter for Python, allowing you to easily read and write YAML files.
- Scrape links from a web page
- Save the list of links to a text file
- Extract content from each link
- Save the content of each link in a separate Markdown file
- Organize the scraped content into folders
- Clone the repository:
git clone https://github.com/Oxooi/PyScrapAcademy.git
- Change the directory to the project folder:
cd PyScrapAcademy
- Install the required dependencies:
pip install -r requirements.txt
- Rename the
config.example.yaml
file toconfig.yaml
in theconfig
folder. - Open the
config.yaml
file and set the following parameters:
url
: The URL of the web page you want to scrapefile
: The name of the text file where the list of links will be savedcookies
: The cookies to use for requests (htb_academy_session)
Run the script with the following command:
python bot.py
The script will create a results
folder containing the scraped content organized in subfolders.