/bd-medicine-scraper

Scrapy-Django PostgreSQL integrated API with Proxy IP configuration that scrapes all medicine data (meds, prices, generics, companies, indications) from Bangladesh (30k+ pages)

Primary LanguagePythonApache License 2.0Apache-2.0

bd-medicine-scraper

made-with-python Django CI Kaggle Open in Visual Studio Code

Overview

Welcome to the bd-medicine-scraper repository!

In this project, I scraped Medicine data (from medex.com.bd) using scrapy and integrated it with Django REST Framework. The data is stored in a PostgreSQL database. I designed the scraper in a way to keep the relations between models.

I also customized the django admin panels, added additional features such as -

  • auto complete lookup relational fields
  • custom filtering (alphabetical, model property)
  • bulk actions (export to csv)

Other Customizations:

  • custom scrapy command to run scrapy spiders from django command line. (ex- python manage.py <spider_name>)
  • custom django commands
    • to export models to csv. (python manage.py <export_model_name> <export_data_path>)
       python manage.py export_medicine_data /home/ahmed/Desktop/medicine_data.csv
      
    • to export generic monograph PDFs
      python manage.py export_generics_monograph
      

I also added proxy configuration to scrapy.

Run

Create a python virtual environment and run these commands from root directory-

pip insrall -r requirements.txt

This will run the django app-

python manage.py runserver

NB: Migrate before running the app

python manage.py makemigrations && python manage.py migrate

To run all spiders-

python run_crawler.py

To run a specific spider-

python manage.py <spider_name>

ex - python manage.py med

Data Analytics

Dataset

The scraped dataset is available in kaggle -

The dataset has 6 CSV files - Here is a list of the CSV files with their featured columns:

  1. medicine.csv (21k+ entries)
    • brand name
    • medicine type (allopathic or herbal)
    • generic
    • strength
    • manufacturer
    • package container (unit price and pack info)
    • Package Size (unit price)
  2. manufacturer.csv (245 entries)
    • name
  3. indication.csv (2k+ entries)
    • name
  4. generic.csv (~1700-1800 entries)
    • name
    • monographic link (PDF URL)
    • drug class
    • indication
    • generic details such as "Indication description", "Pharmacology description", "Dosage & Administration description" etc.
  5. drug class.csv (~400 entries)
    • name
  6. dosage form.csv (~120 entries)
    • name

Analytics

Bangladesh Medicine Analytics - Notebook on Kaggle

Tests

Workflow script - django-ci.yml

Run the tests using:

coverage run --omit='*/venv/*' manage.py test

or

python manage.py test

Check the coverage

coverage html

Built With

Django==3.2.12
djangorestframework==3.12.2
django-admin-autocomplete-filter==0.7.1
django-filter==21.1
coverage==6.2
Scrapy==2.4.1
scrapy-djangoitem==1.1.1
psycopg2==2.9.3

Preview

django_admin_generics

django_admin_medicine

django_admin_dosage_form

django_admin_manufacturer