/RateMDs

A quantitative and qualitative analysis of RateMDs.com dataset: scraped online physician reviews.

Primary LanguageHTML

TL;DR

This repository contains data and scripts pertaining to work done by Avijit Thawani while at Northeastern University (summer 2018) under the guidance of Dr. Byron C. Wallace (College of Computer and Information Science, Northeastern University, Boston, MA).

Thawani A. Paul M J. Sarkar U. Wallace B C. 
Are Online Reviews of Physicians Biased Against Female Providers? 
In Proceedings of Machine Learning Research. 106:1-17, 2019.

The paper was presented at MLHC 2019 (Machine Learning for Healthcare) Conference, Ann Arbor, Michigan. Here's a poster summarizing our work, slides from the talk and a video presentation for the same.

Please cite us and mail me at thawani@usc.edu for feedback, errors, ideas for future work, or just to say Hi!

This Repository

  1. raw data: parsed HTML files from RateMDs.com
  2. unclean.csv: id, review, physician specialty, physician gender, physician name, document label
  3. processed_1.csv: review id, physician id, physician specialty, physician gender, rating staff, rating punctuality, rating helpfulness, rating knowledgeability, review text (tokenized)
  4. all_Github.csv: physician_id.review_id, physician_id, physician name, physician specialty, physician gender, rating staff, rating punctuality, rating helpfulness, rating knowledgeability, review text
  5. scripts: Jupyter Notebooks to reproduce our results (corresponding section from the paper in parantheses):
  • clean.ipynb: Data preprocessing (Section 2.1)
  • regression.ipynb: Rating Analysis (Section 2.2)
  • LR.ipynb: Lexical Regression (Section 2.3.1)
  • match.ipynb: Embeddings (Section 2.3.2)

Contributors

Avijit Thawani, University of Southern California (work done when interning at Northeastern in Summer 2018).
Michael J. Paul, University of Colorado Boulder.
Urmimala Sarkar, University of California San Francisco.
Byron C. Wallace, Northeastern University.