/tmp

Primary LanguagePython

Doctolib Scraper

A professional web scraper for extracting doctor information from Doctolib using crawl4ai. This scraper is designed to be configurable, respectful to the website, and easy to use.

🆕 NEW: Multi-LLM Schema Generation

The latest version includes automatic schema generation using multiple LLM providers! This provides:

  • 🧠 One-time schema generation that creates reusable extraction schemas
  • ⚡ LLM-free extractions after initial schema creation
  • 🎯 Improved accuracy through AI-powered page structure analysis
  • 🔧 Multiple LLM Options: Groq (default), OpenAI, or local Ollama models
  • 💰 Cost-effective: Default Groq integration with built-in API key

Features

Core Features

  • Configurable Parameters: URL and number of pages can be easily configured
  • Multiple Output Formats: Saves data in both JSON and CSV formats
  • Respectful Scraping: Includes delays between requests and handles cookie consent
  • Error Handling: Robust error handling and logging
  • Clean Data Extraction: Properly extracts and cleans doctor information
  • Command Line Interface: Easy to use from command line with arguments

AI-Powered Features

  • 🧠 Multi-LLM Support: Groq (default), OpenAI, or Ollama integration
  • 📋 Schema Reuse: Save and reuse generated schemas for fast extractions
  • 🔄 Fallback Support: Manual schema fallback if LLM generation fails
  • 🎯 Smart Analysis: AI analyzes page structure to create optimal selectors
  • 🔧 Configurable LLM: Parametrable LLM provider, model, and API keys

Installation

Basic Installation

  1. Install the required dependencies:
pip install crawl4ai
  1. Install playwright browsers:
playwright install
playwright install-deps

Ollama Installation (for AI-powered schema generation)

  1. Install Ollama (for AI-powered features):
# On macOS
brew install ollama

# On Linux
curl -fsSL https://ollama.ai/install.sh | sh

# On Windows
# Download from https://ollama.ai/download
  1. Pull a model (recommended: llama3.2):
ollama pull llama3.2
# or
ollama pull llama3.1
ollama pull codellama
  1. Start Ollama service:
ollama serve

Usage

Basic Usage (Default Configuration)

Run with default settings (3 pages of gastroenterologists in Paris 12th):

python final_doctolib_scraper.py

Command Line Usage

# Basic usage with custom URL
python final_doctolib_scraper.py --url "https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue"

# Scrape 5 pages
python final_doctolib_scraper.py --url "https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue" --pages 5

# Custom output files
python final_doctolib_scraper.py --url "https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue" --json my_doctors.json --csv my_doctors.csv

# Scrape dentists in Paris 1st
python final_doctolib_scraper.py --url "https://www.doctolib.fr/search?location=75001-paris&speciality=dentiste" --pages 2

Configuration File Usage

You can also use the configuration file approach:

  1. Edit config.json with your parameters:
{
  "base_url": "https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue&availabilitiesBefore=14",
  "max_pages": 3,
  "output": {
    "json_file": "doctolib_doctors.json",
    "csv_file": "doctolib_doctors.csv"
  }
}
  1. Run the config-based scraper:
python config_based_scraper.py

🧠 AI-Powered Usage (Recommended)

The new LLM-powered scraper provides better accuracy through AI-generated schemas:

# Basic usage with default Groq LLM (generates schema automatically)
python llm_doctolib_scraper.py --url "https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue"

# Use different Groq model
python llm_doctolib_scraper.py --url "URL" --llm-model "meta-llama/llama-3.1-70b-versatile" --pages 5

# Use OpenAI GPT-4
python llm_doctolib_scraper.py --url "URL" --llm-provider openai --llm-model gpt-4 --llm-api-key YOUR_OPENAI_KEY

# Use local Ollama model
python llm_doctolib_scraper.py --url "URL" --llm-provider ollama --llm-model llama3.2

# Load existing schema (skip generation for faster execution)
python llm_doctolib_scraper.py --url "URL" --load-schema doctolib_schema.json

# Generate and save schema for reuse
python llm_doctolib_scraper.py --url "URL" --save-schema my_schema.json

Schema Generation Example

To understand how schema generation works:

# Run the schema generation example
python schema_generator_example.py

This will:

  1. Load a sample Doctolib page
  2. Use Ollama to analyze the page structure
  3. Generate a CSS extraction schema
  4. Test the schema and save results
  5. Save the schema for reuse

Command Line Arguments

Standard Scraper (final_doctolib_scraper.py)

  • --url: Base URL for Doctolib search (required)
  • --pages: Number of pages to scrape (default: 3)
  • --json: Output JSON filename (default: doctolib_doctors.json)
  • --csv: Output CSV filename (default: doctolib_doctors.csv)

LLM-Powered Scraper (llm_doctolib_scraper.py)

  • --url: Base URL for Doctolib search (required)
  • --pages: Number of pages to scrape (default: 3)
  • --llm-provider: LLM provider to use (default: groq, options: groq, openai, ollama)
  • --llm-model: LLM model to use (default: meta-llama/llama-4-scout-17b-16e-instruct)
  • --llm-api-key: API key for the LLM provider (default: built-in Groq key)
  • --json: Output JSON filename (default: llm_doctors.json)
  • --csv: Output CSV filename (default: llm_doctors.csv)
  • --save-schema: Save generated schema to file (default: doctolib_schema.json)
  • --load-schema: Load existing schema from file (skips generation)

Output Format

The scraper extracts the following information for each doctor:

  • name: Doctor's name
  • specialty: Medical specialty
  • address: Full address
  • distance: Distance from search location
  • sector_info: Insurance sector information (e.g., "Conventionné secteur 1")
  • profile_url: Link to doctor's profile (when available)

JSON Output Example

[
  {
    "name": "Dr Natanel BENABOU",
    "specialty": "Gastro-entérologue et hépatologue",
    "address": "5 Rue Hippolyte Pinson, 94340 Joinville-le-Pont",
    "distance": "3,2 km",
    "sector_info": "Conventionné secteur 2",
    "profile_url": null
  }
]

CSV Output

The CSV file contains the same data in tabular format, suitable for Excel or data analysis tools.

How to Build Doctolib URLs

To create search URLs for different specialties and locations:

  1. Go to Doctolib.fr
  2. Search for your desired specialty and location
  3. Copy the URL from the results page
  4. Remove the &page=X parameter if present

Example URLs:

  • Gastroenterologists in Paris 12th: https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue
  • Dentists in Lyon: https://www.doctolib.fr/search?location=lyon&speciality=dentiste
  • Cardiologists in Marseille: https://www.doctolib.fr/search?location=marseille&speciality=cardiologue

Technical Details

  • Framework: Built with crawl4ai and Playwright
  • Browser: Runs in headless mode for efficiency
  • Cookie Handling: Automatically handles cookie consent dialogs
  • Rate Limiting: 2-second delay between page requests
  • Error Recovery: Continues scraping even if individual pages fail

Available Scrapers

This repository includes multiple scraper implementations:

1. 🧠 ollama_doctolib_scraper.py (Recommended)

  • AI-powered schema generation using Ollama
  • Best accuracy through intelligent page analysis
  • Reusable schemas for fast subsequent extractions
  • No API costs (uses local Ollama models)

2. 🔧 final_doctolib_scraper.py (Standard)

  • Manual CSS selectors with robust fallbacks
  • No dependencies on external AI services
  • Proven reliability with extensive testing
  • Fast execution (no schema generation overhead)

3. ⚙️ config_based_scraper.py (Configuration-driven)

  • JSON configuration for easy parameter management
  • Batch processing support
  • Template-based approach

4. 🔍 debug_scraper.py (Development)

  • Debug mode with HTML output
  • Troubleshooting website structure changes
  • Development and testing tool

5. 📚 schema_generator_example.py (Educational)

  • Learn how schema generation works
  • Understand Ollama integration
  • Test schema creation process

Limitations and Considerations

  • Respectful Usage: The scraper includes delays and respects robots.txt
  • Website Changes: May need updates if Doctolib changes their HTML structure
  • Rate Limiting: Don't scrape too aggressively to avoid being blocked
  • Legal Compliance: Ensure your usage complies with Doctolib's terms of service
  • Ollama Requirement: AI-powered features require Ollama installation and models

Troubleshooting

Common Issues

  1. No doctors found:

    • Check if the URL is correct
    • Verify the search has results on the website
    • Website structure may have changed
  2. Browser errors:

    • Make sure playwright browsers are installed: playwright install
    • Install system dependencies: playwright install-deps
  3. Permission errors:

    • Ensure you have write permissions in the output directory
  4. Ollama-specific issues:

    • Ollama not running: Make sure ollama serve is running
    • Model not found: Pull the model with ollama pull llama3.2
    • Schema generation fails: Try a different model or use --load-schema with a manual schema
    • Connection errors: Check if Ollama is accessible on localhost:11434

Debug Mode

For debugging, you can use the debug scraper:

python debug_scraper.py

This will save the raw HTML content for inspection.

For Ollama debugging, use the schema generator example:

python schema_generator_example.py

This will show the schema generation process step by step.

Files in this Package

Core Scrapers

  • ollama_doctolib_scraper.py: 🧠 AI-powered scraper with Ollama schema generation (Recommended)
  • final_doctolib_scraper.py: 🔧 Standard scraper with manual CSS selectors
  • config_based_scraper.py: ⚙️ Configuration file-based scraper

Utilities and Examples

  • schema_generator_example.py: 📚 Educational example showing schema generation
  • debug_scraper.py: 🔍 Debug version for troubleshooting
  • example_usage.py: 📖 Programmatic usage examples

Configuration and Documentation

  • config.json: Example configuration file
  • README.md: This comprehensive documentation
  • pyproject.toml: Project dependencies

Example Output

When run successfully, you'll see output like:

🚀 Starting to scrape 3 pages from Doctolib
📄 Scraping page 1: https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue&page=1
✅ Successfully loaded page 1
👨‍⚕️ Found 24 doctors on page 1
...
🎉 Total doctors found: 32
💾 Saved 32 doctors to doctolib_doctors.json
💾 Saved 32 doctors to doctolib_doctors.csv
✅ Scraping completed successfully!

License

This tool is for educational and research purposes. Please respect Doctolib's terms of service and use responsibly.