CorgiBrowser is an open-source Python framework focused at simplifying the process of web crawling and scraping. Built with scalability, efficiency, and ethical data collection in mind, it is designed for researchers, developers, and analysts who require robust data acquisition capabilities.
CorgiBrowser’s Documentation
- Introduction
- Key Features
- Depencencies
- Getting Started
- Demos
- Background
- Contributing to CorgiBrowser
- License
CorgiBrowser started from the need for a scalable solution that addresses the challenges of modern web crawling and scraping. With the internet's exponential data growth, existing frameworks often fall short in scalability and customizability. CorgiBrowser, is an all tools included framework that focus on ethical data practices, presents a pioneering approach to distributed crawling and data management.
- Scalability: Supports large-scale data collection with a microservices architecture, enabling horizontal scaling on cloud platforms.
- Distributed Crawling: Offers configurable crawlers with priority settings for tailored crawling strategies.
- Use of Custom Scraping Templates: Facilitates the integration of custom templates for precise data extraction.
- Ethical Crawling: Complies with robots.txt standards and employs throttling to minimize the impact on web resources.
- Cloud Integration: Works with cloud storage solutions for efficient data management and scalability.
- Python 3.9+
- Works on Linux, Windows
- Azure Storage Account, (with future support for local storage)
To install CorgiBrowser, run the following command:
pip install corgibrowser
To initialize a Crawler instance:
import os
from dotenv import load_dotenv
from corgibrowser.corgi_cloud_integration.cloud_integration import CloudIntegration
from corgibrowser.corgi_datasets.DataSetsManager import DataSetsManager
from corgibrowser.corgi_settings.SettingsManager import SettingsManager
from corgibrowser.corgi_crawler.crawler import *
# Load Settings Manager
settings_manager = SettingsManager()
load_dotenv()
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_NAME"] = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_KEY"] = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
# Set Up cloud
CloudIntegration(settings_manager = settings_manager)
cloud_integration = CloudIntegration( settings_manager = settings_manager )
cloud_integration.initialize()
# Add Initial URLs
for url in DataSetsManager.load_usa_newspaper_urls():
cloud_integration.add_url_to_queue(url)
# Crawl
crawler = WebCrawler(cloud_integration = cloud_integration, settings_manager=settings_manager )
crawler.initialize()
crawler.start()
To initialize a Scraper instance:
import os
from dotenv import load_dotenv
from corgibrowser.corgi_cloud_integration.cloud_integration import CloudIntegration
from corgibrowser.corgi_settings.SettingsManager import SettingsManager
from corgibrowser.corgi_webscraping.scraper import Scraper
# Load Settings Manager
settings_manager = SettingsManager()
load_dotenv()
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_NAME"] = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_KEY"] = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
# Set Up cloud
CloudIntegration(settings_manager = settings_manager)
cloud_integration = CloudIntegration( settings_manager = settings_manager )
cloud_integration.initialize()
# Scrape
scraper = Scraper(cloud_integration = cloud_integration, settings_manager=settings_manager )
scraper.initialize()
scraper.start()
Link to demo applications and tutorials.
Developed for Jose Enriquez's Master's Thesis in Computer Engineering, CorgiBrowser aims to democratize access to web data through ethical and efficient crawling. CorgiBrowser objective is to represent a significant step in merging web crawling, cloud technologies, and data analysis. This integration enhances scalability, efficiency, and the ability to perform comprehensive data processing, establishing a new benchmark in data collection technologies.
Contributors are welcome! Check out the Open Issues on GitHub for starting points.
CorgiBrowser is released under the MIT License, promoting open and unrestricted use and contribution.