This is a Scrapy spider designed to extract data about doctors from the website medicosdoc.com.
- Start URLs: The spider begins scraping from a list of provided URLs. By default, it starts from
https://medicosdoc.com/categoria/instituciones-barranquilla
but can be modified in thestart_urls
variable. - MongoDB Connection: The spider connects to a MongoDB database to store scraped data. The connection details (host, port, username, password, database name) are specified in the script.
- MongoDB Collection: The spider saves scraped data to a MongoDB collection named
instituciones-scrapeops
. Each doctor is represented as a document with the following fields:doctor_id
: A unique ID generated from the doctor's name using a MD5 hash.nombre
: The doctor's name.telefono
: A list of phone numbers associated with the doctor.especializacion
: A list of the doctor's specializations.seguro
: The insurance accepted by the doctor, if available.direccion
: The doctor's address.source
: The URL from which the doctor's information was scraped.
- Install Dependencies: Ensure that all required Python packages are installed, including
Scrapy
,pymongo
, andrequests
. - Configure MongoDB Connection: Update the MongoDB connection details in the script to match your specific MongoDB setup.
- Run the Spider: Execute the command
scrapy runspider medicos_spider.py
to start the scraping process. The spider will traverse the website, extract doctor data, and store it in the MongoDB collection. - Monitor the Process: The spider logs errors to the console. You can check the log messages to monitor the scraping process and identify any issues.
- The spider utilizes
Scrapy-UserAgents
andscrapy_proxy_pool
to manage user agents and proxies, respectively, for improved scraping performance and bypass detection. - The spider uses
ScrapeOps
for monitoring and rotating proxies. - The spider has a limit of 50 pages to scrape. This can be changed in the
CLOSESPIDER_PAGECOUNT
setting.