A professional web scraping application built with Streamlit, powered by Firecrawl and OpenAI GPT-4.
- Overview
- Features
- Prerequisites
- Installation
- Usage
- API Keys
- [Suggested Project Structure](#suggested project-structure)
- Technologies Used
- Limitations & Legal Considerations
- Contributing
- Author
- License
Web Scraper Pro is a sophisticated web scraping application that combines the power of Firecrawl for data extraction and OpenAI's GPT-4 for intelligent data processing. The application features a user-friendly interface built with Streamlit, making it accessible for users of all technical levels.
- π User-friendly web interface
- π Secure API key management
- π€ AI-powered data extraction
- π Data preview functionality
- π₯ Multiple export formats (CSV, Excel)
- βοΈ Customizable extraction fields
- π‘οΈ Comprehensive error handling
- π± Responsive design
- Python 3.8 or higher
- pip (Python package installer)
- Required API keys:
- Firecrawl API key
- OpenAI API key
- Clone the repository:
git clone https://github.com/arad1367/Scraper_Pro_AI.git
cd web-scraper-pro
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install required packages:
pip install streamlit pandas openai firecrawl-py requests openpyxl xlsxwriter
- Run the Streamlit app:
streamlit run app.py
-
Open your web browser and navigate to the provided local URL (typically http://localhost:8501)
-
Enter your API keys in the sidebar
-
Input the target URL and configure scraping settings
-
Click "Start Scraping" to begin the extraction process
The application requires two API keys to function:
- Firecrawl API Key: Used for web scraping functionality
- OpenAI API Key: Required for intelligent data processing
Store your API keys securely and never commit them to version control.
- (I did not use asstes folder, but you can!π)
web-scraper-pro/
βββ app.py # Main application file
βββ README.md # Project documentation
βββ requirements.txt # Python dependencies
βββ assets/ # Project assets
β βββ logo.png # Company logo
βββ .gitignore # Git ignore file
- Streamlit: Frontend framework
- Firecrawl: Web scraping engine
- OpenAI GPT-4o mini: Data processing
- Pandas: Data manipulation
- Python: Programming language
- This tool is for educational purposes only
- Always obtain permission before scraping any website
- Respect robots.txt files
- Follow rate limiting best practices
- Comply with websites' terms of service
- Do not scrape personal or sensitive information
Contributions are welcome! Please feel free to submit a Pull Request.
Dr. Pejman Ebrahimi
Research Assistant at University of Liechtenstein
π§ Contact:
- Academic: pejman.ebrahimi@uni.li
- Personal: pejman.ebrahimi77@gmail.com
This project is licensed under the MIT License - see the LICENSE file for details.