Enhanced Telegram Channel Scraper using TOR and a Flask Dashboard for results
This software is designed solely for educational and research purposes and should be used with ethical considerations in mind. Users are responsible for ensuring their activities comply with local laws and regulations. The authors of this software bear no responsibility for any misuse or potential damages arising from its use. It's imperative to adhere to the terms of service of any platforms interacted with through this tool.
TeleScrape is an advanced tool for extracting content from Telegram channels, emphasizing user privacy through Tor integration and providing real-time insights via a dynamic Flask dashboard. It eschews the need for Telegram's API by utilizing Selenium for web scraping, offering a robust solution for data gathering from public Telegram channels.
- Enhanced Privacy: Routes all scraping through the Tor network to protect user anonymity.
- Keyword-Driven Scraping: Fetches channel content based on user-defined keywords, focusing on relevant data extraction.
- Interactive Web Dashboard: Utilizes Flask to present scraping results dynamically, with real-time updates and insights.
- Efficient Parallel Processing: Employs concurrent scraping to expedite data collection from multiple channels simultaneously.
- User-Friendly Customization: Designed for easy adaptability to specific requirements, supporting straightforward modifications and extensions.
- Matched Files Download: Allows users to download matched result files directly from the dashboard.
- S3 Integration: Automatically uploads matched result files to an S3 bucket, with configurable settings via a
.env
file.
- Python 3.x
- Flask
- BeautifulSoup4 - bs4
- Selenium
- Requests
- Flask-SocketIO
- NLTK
- Tor
- Boto3 (for S3 integration)
- Python-Dotenv (for environment variable management)
- Python 3.x Installation: Verify Python 3.x is installed on your system.
python3 --version
- Dependencies: Install the required Python packages using pip.
pip install flask beautifulsoup4 selenium requests flask_socketio nltk tor boto3 python-dotenv
- Tor Configuration: Install Tor locally and ensure it's configured to run a SOCKS proxy on
localhost:9050
.Edit the Tor configuration file to ensure the SOCKS proxy is running on portsudo apt install tor sudo systemctl enable tor sudo systemctl start tor
9050
:Verify Tor is running a SOCKS proxy:sudo nano /etc/tor/torrc
curl --socks5 localhost:9050 https://check.torproject.org
- WebDriver Setup: Ensure the Chrome WebDriver is installed and properly configured in the script's path settings.
- Download Chrome WebDriver:
wget https://storage.googleapis.com/chrome-for-testing-public/126.0.6478.126/linux64/chromedriver-linux64.zip
- Unzip and Install:
unzip chromedriver-linux64.zip sudo mv chromedriver-linux64/chromedriver /usr/local/bin/ sudo chmod +x /usr/local/bin/chromedriver
- Verify Installation:
chromedriver --version
- Download Chrome WebDriver:
For S3 integration, create a .env
file in the project root with the following content:
S3_BUCKET_NAME=tgscraper-matches
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
TeleScrape.py
: The main script, encapsulating the scraping logic, Flask application, and Tor setup.keywords.txt
: Text file listing the keywords for content scraping./templates
: Folder containing HTML templates for the Flask-based dashboard./static
: Folder containing static files like images (e.g., logo).
- Keyword Configuration: Populate
keywords.txt
with your desired keywords. - Script Execution: Launch
TeleScrape.py
to start scraping and activate the Flask dashboard.python3 TeleScrape.py
- Dashboard Navigation: Access
http://127.0.0.1:8081/
on your browser to view the scraping progress and results live.
- Real-Time Refresh: Automatically updates to display the latest scraping data.
- Keyword Visualization: Keywords and matches are highlighted within the content for better clarity.
- File Download: Download matched result files directly from the dashboard.
- S3 Integration: Automatically uploads matched files to an S3 bucket if configured.
- Adaptive Design: Ensures a consistent experience across various devices and resolutions.
With download buttons to get matched key word files
Contributions are highly appreciated! If you have improvements or suggestions, please fork this repository, commit your changes, and submit a pull request for review.
This project is distributed under the MIT License, fostering widespread use and contribution by providing a lenient framework for software distribution and modification.
This README.md
file reflects the current state and new features of the project, including the dashboard interface updates, S3 integration, and improved instructions for setup and usage.