A professional web scraper for extracting doctor information from Doctolib using crawl4ai. This scraper is designed to be configurable, respectful to the website, and easy to use.
The latest version includes automatic schema generation using multiple LLM providers! This provides:
- 🧠 One-time schema generation that creates reusable extraction schemas
- ⚡ LLM-free extractions after initial schema creation
- 🎯 Improved accuracy through AI-powered page structure analysis
- 🔧 Multiple LLM Options: Groq (default), OpenAI, or local Ollama models
- 💰 Cost-effective: Default Groq integration with built-in API key
- ✅ Configurable Parameters: URL and number of pages can be easily configured
- ✅ Multiple Output Formats: Saves data in both JSON and CSV formats
- ✅ Respectful Scraping: Includes delays between requests and handles cookie consent
- ✅ Error Handling: Robust error handling and logging
- ✅ Clean Data Extraction: Properly extracts and cleans doctor information
- ✅ Command Line Interface: Easy to use from command line with arguments
- 🧠 Multi-LLM Support: Groq (default), OpenAI, or Ollama integration
- 📋 Schema Reuse: Save and reuse generated schemas for fast extractions
- 🔄 Fallback Support: Manual schema fallback if LLM generation fails
- 🎯 Smart Analysis: AI analyzes page structure to create optimal selectors
- 🔧 Configurable LLM: Parametrable LLM provider, model, and API keys
- Install the required dependencies:
pip install crawl4ai- Install playwright browsers:
playwright install
playwright install-deps- Install Ollama (for AI-powered features):
# On macOS
brew install ollama
# On Linux
curl -fsSL https://ollama.ai/install.sh | sh
# On Windows
# Download from https://ollama.ai/download- Pull a model (recommended: llama3.2):
ollama pull llama3.2
# or
ollama pull llama3.1
ollama pull codellama- Start Ollama service:
ollama serveRun with default settings (3 pages of gastroenterologists in Paris 12th):
python final_doctolib_scraper.py# Basic usage with custom URL
python final_doctolib_scraper.py --url "https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue"
# Scrape 5 pages
python final_doctolib_scraper.py --url "https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue" --pages 5
# Custom output files
python final_doctolib_scraper.py --url "https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue" --json my_doctors.json --csv my_doctors.csv
# Scrape dentists in Paris 1st
python final_doctolib_scraper.py --url "https://www.doctolib.fr/search?location=75001-paris&speciality=dentiste" --pages 2You can also use the configuration file approach:
- Edit
config.jsonwith your parameters:
{
"base_url": "https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue&availabilitiesBefore=14",
"max_pages": 3,
"output": {
"json_file": "doctolib_doctors.json",
"csv_file": "doctolib_doctors.csv"
}
}- Run the config-based scraper:
python config_based_scraper.pyThe new LLM-powered scraper provides better accuracy through AI-generated schemas:
# Basic usage with default Groq LLM (generates schema automatically)
python llm_doctolib_scraper.py --url "https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue"
# Use different Groq model
python llm_doctolib_scraper.py --url "URL" --llm-model "meta-llama/llama-3.1-70b-versatile" --pages 5
# Use OpenAI GPT-4
python llm_doctolib_scraper.py --url "URL" --llm-provider openai --llm-model gpt-4 --llm-api-key YOUR_OPENAI_KEY
# Use local Ollama model
python llm_doctolib_scraper.py --url "URL" --llm-provider ollama --llm-model llama3.2
# Load existing schema (skip generation for faster execution)
python llm_doctolib_scraper.py --url "URL" --load-schema doctolib_schema.json
# Generate and save schema for reuse
python llm_doctolib_scraper.py --url "URL" --save-schema my_schema.jsonTo understand how schema generation works:
# Run the schema generation example
python schema_generator_example.pyThis will:
- Load a sample Doctolib page
- Use Ollama to analyze the page structure
- Generate a CSS extraction schema
- Test the schema and save results
- Save the schema for reuse
--url: Base URL for Doctolib search (required)--pages: Number of pages to scrape (default: 3)--json: Output JSON filename (default: doctolib_doctors.json)--csv: Output CSV filename (default: doctolib_doctors.csv)
--url: Base URL for Doctolib search (required)--pages: Number of pages to scrape (default: 3)--llm-provider: LLM provider to use (default: groq, options: groq, openai, ollama)--llm-model: LLM model to use (default: meta-llama/llama-4-scout-17b-16e-instruct)--llm-api-key: API key for the LLM provider (default: built-in Groq key)--json: Output JSON filename (default: llm_doctors.json)--csv: Output CSV filename (default: llm_doctors.csv)--save-schema: Save generated schema to file (default: doctolib_schema.json)--load-schema: Load existing schema from file (skips generation)
The scraper extracts the following information for each doctor:
- name: Doctor's name
- specialty: Medical specialty
- address: Full address
- distance: Distance from search location
- sector_info: Insurance sector information (e.g., "Conventionné secteur 1")
- profile_url: Link to doctor's profile (when available)
[
{
"name": "Dr Natanel BENABOU",
"specialty": "Gastro-entérologue et hépatologue",
"address": "5 Rue Hippolyte Pinson, 94340 Joinville-le-Pont",
"distance": "3,2 km",
"sector_info": "Conventionné secteur 2",
"profile_url": null
}
]The CSV file contains the same data in tabular format, suitable for Excel or data analysis tools.
To create search URLs for different specialties and locations:
- Go to Doctolib.fr
- Search for your desired specialty and location
- Copy the URL from the results page
- Remove the
&page=Xparameter if present
Example URLs:
- Gastroenterologists in Paris 12th:
https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue - Dentists in Lyon:
https://www.doctolib.fr/search?location=lyon&speciality=dentiste - Cardiologists in Marseille:
https://www.doctolib.fr/search?location=marseille&speciality=cardiologue
- Framework: Built with crawl4ai and Playwright
- Browser: Runs in headless mode for efficiency
- Cookie Handling: Automatically handles cookie consent dialogs
- Rate Limiting: 2-second delay between page requests
- Error Recovery: Continues scraping even if individual pages fail
This repository includes multiple scraper implementations:
- AI-powered schema generation using Ollama
- Best accuracy through intelligent page analysis
- Reusable schemas for fast subsequent extractions
- No API costs (uses local Ollama models)
- Manual CSS selectors with robust fallbacks
- No dependencies on external AI services
- Proven reliability with extensive testing
- Fast execution (no schema generation overhead)
- JSON configuration for easy parameter management
- Batch processing support
- Template-based approach
- Debug mode with HTML output
- Troubleshooting website structure changes
- Development and testing tool
- Learn how schema generation works
- Understand Ollama integration
- Test schema creation process
- Respectful Usage: The scraper includes delays and respects robots.txt
- Website Changes: May need updates if Doctolib changes their HTML structure
- Rate Limiting: Don't scrape too aggressively to avoid being blocked
- Legal Compliance: Ensure your usage complies with Doctolib's terms of service
- Ollama Requirement: AI-powered features require Ollama installation and models
-
No doctors found:
- Check if the URL is correct
- Verify the search has results on the website
- Website structure may have changed
-
Browser errors:
- Make sure playwright browsers are installed:
playwright install - Install system dependencies:
playwright install-deps
- Make sure playwright browsers are installed:
-
Permission errors:
- Ensure you have write permissions in the output directory
-
Ollama-specific issues:
- Ollama not running: Make sure
ollama serveis running - Model not found: Pull the model with
ollama pull llama3.2 - Schema generation fails: Try a different model or use
--load-schemawith a manual schema - Connection errors: Check if Ollama is accessible on localhost:11434
- Ollama not running: Make sure
For debugging, you can use the debug scraper:
python debug_scraper.pyThis will save the raw HTML content for inspection.
For Ollama debugging, use the schema generator example:
python schema_generator_example.pyThis will show the schema generation process step by step.
ollama_doctolib_scraper.py: 🧠 AI-powered scraper with Ollama schema generation (Recommended)final_doctolib_scraper.py: 🔧 Standard scraper with manual CSS selectorsconfig_based_scraper.py: ⚙️ Configuration file-based scraper
schema_generator_example.py: 📚 Educational example showing schema generationdebug_scraper.py: 🔍 Debug version for troubleshootingexample_usage.py: 📖 Programmatic usage examples
config.json: Example configuration fileREADME.md: This comprehensive documentationpyproject.toml: Project dependencies
When run successfully, you'll see output like:
🚀 Starting to scrape 3 pages from Doctolib
📄 Scraping page 1: https://www.doctolib.fr/search?location=75012-paris&speciality=gastro-enterologue&page=1
✅ Successfully loaded page 1
👨⚕️ Found 24 doctors on page 1
...
🎉 Total doctors found: 32
💾 Saved 32 doctors to doctolib_doctors.json
💾 Saved 32 doctors to doctolib_doctors.csv
✅ Scraping completed successfully!
This tool is for educational and research purposes. Please respect Doctolib's terms of service and use responsibly.