Drug Prescription Sentiment Analysis

1. | Introduction πŸ‘‹

  • Dataset Problems πŸ€”
    • πŸ‘‰ The Drug Review Dataset is taken from the UCI Machine Learning Repository. This Dataset provides patient reviews on specific drugs along with related conditions and a 10-star patient rating reflecting the overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites.
    • πŸ‘‰ The Drug Review Data Set is of shape (161297, 7) i.e. It has 161297 Data Points or entries and 7 features including the review.
    • πŸ‘‰ The goal of the problem is to recommend top 5 useful drugs per each different medical conditions based on reviews and calculated usefulness from rating * usefulCount * eff_score(normalized average rating).
    • πŸ‘‰ Let's treat this a multi-class classification problem.
    • πŸ‘‰ Choose the best machine Learning Models / Deep Learning Models with highest Accuracy Score and F1 Score to recommend the top 5 most useful drugs for each different Medical Conditions.
  • Dataset Description 🧾
    πŸ‘‰ There are 7 variables in this dataset:
    • FEATURES
    1. uniqueID | An identifier for each post.
    2. drugName | The name of the drug for which review is made.
    3. review | The review made by patients for a particular medicine.
    4. rating | Ratings, given by the patients to each medicine on a scale of 10 where 10 represents the maximum efficacy.
    5. date | Date of review entry.
    6. usefulCount | The number of users who found the review useful.
    7. condition | The name of the medical condition for which the medicine is used.
  • Machine Learning Modules πŸ‘¨β€πŸ’»
    πŸ‘‰ The models used in this notebook:
    1. Linear Support Vector Machine (LinearSVC),
    2. Multinomiaa Naive Bayesian (MNB),
    3. Light Gradient Boosting Machine (LGBM),
    4. Passive Aggressive Classifier,
  • Outcome βœ…
    • πŸ‘‰ Recommend the top 5 most useful drugs for each different Medical Conditions by calculated usefulness through python @interact.
    • πŸ‘‰ Recommend the top 5 most useful drugs for each different Medical Conditions by the best Machine Learning models or Deep Learning models.

2. | File Descriptions πŸ‘“

  • drugsComTrain_raw.csv: the train dataset file.
  • drugsComTest_raw.csv: the test dataset file.

3. | Accuracy of Best Model πŸ§ͺ

Passive Aggressive Classifier

  • Accuracy achieved: 93.91%

4. | Conclusiion πŸ“€

  • In this study respectively,
  • We have tried to predict a multi-class classification problem in Drug Dataset by a variety of models to to recommend top 5 drugs for each different medical conditions based on revews given.
  • We have made the detailed exploratory analysis (EDA).

    there is missing values on the 'condition' column and drops rows containing missin values.

  • Check the top 10 Common Medical Conditions

    "Birth Control", "Depression", "Pain", "Anxiety", "Acne", "Bipolar Disorder", "Insomnia", "Weight Loss", "Obesity" and "ADHD".

  • Check the top 10 Common Drug Names Used

    "Levonorgestrel", "Etonorgestrel", "Ethinnyl Estradiol / Norethindrone", "Nexplanon", "Ethinnyl Estradiol / Norgestimate", "Ethinnyl Estradiol / Levonorgestrel", "Phentermine", "Sertraline, "Escitalopram" and "Mirena".

  • Check the top 10 Most Drugs Available per Condition,

    "Pain", "Birth Control", "High Blood Pressure, "Acne", "Depression", "Rheumathoid Arthritis", "Diabetes Type 2", "Allergic Rhinitis", "Osthoarthritis" and "Bipolar Disorder".

  • Check Most Drug Available to be Used for Many Conditions

    "Prednisone", "Gabapentin", "Ciprofloxacin", "Doxycycline", "Amytriptyline", "Metronidazole", "Venlafaxine", "Neurontin", "Dexamethasone" and "Lyrica".

  • The review length has no clear effect on drugs ratings.
  • Perform data cleaning, wrangling and feature engineering with
    • Remove all the hyperlink, html tags, punctuation, Numbers, Symbols, Special Characters, space, accented characters an stop words
    • Decontract text
    • Tokenize using NLTK's word_tokenize
    • Make all texts to lowercase
    • Lemmatization
    • Text string formation)
    • Clean Medical Conditions Columns Containing Redundant Information
    • Remove Any Person Name, Location Name, Organization Name, Date & Time, Money, etc by NER
    • Remove DataFrame Rows where The Medical Conditions has Only 1 Drug
  • The suitable NGrams for building machine learning model is 2-grams and 4-ngrams.
  • The positive reviews are 70% of the data. This is imbalanced data.
  • The best model is Paassive Aggressive Classifier with 92.75% accuracy.
  • Provided 2 ways of analyzing 3/5 most useful drugs per medical conditions

    through calculated effectiveness and usefulness of Drugs displayed using pyhton @interface magic function. through machine learning drugs recommendation system. (The extracted keywords from reviews, bring meaningful context to each medical condition showwn by most informative features reviews fro each different classes.) -> Top 5 drugs recommended based on each conitions.

5. | Reference πŸ”—