/mysql-cloudmanaged

HHA504 / Assignment 8 / ERD Cloud-managed MySQL DB

Primary LanguagePython

mysql-cloudmanaged

HHA504 / Assignment 8 / Cloud-managed SQL DB + ERD + Dummy Data

This repo aims to:

  • create a cloud-managed MySQL db on GCP
  • use python to create tables in sql and reproduce relationships between tables without repeating commands
  • experiment with various methods of using MySQL (remote mysql instance, local using MySQL client, MySQL workbench)
  • create an ERD using MySQL Workbench
  • create realistic patient dummy data using the faker python package

Note: Initially, I manually created dummy patient data to insert into the tables in my db/ However, I have since revised the python scripts to utilize the faker package to auto-generate realistic dummy patient data into my tables. I have included screenshots of my old dummy data versus my new dummy data in my old images folder.


TASKS

1. setup.md:

  • How I setup and connect my GCP Compute Engine VM Instance to a MySQL dev env
  • How I setup a GCP MySQL Instance
  • How I setup MySQL Workbench
  • Lists the corresponding dependencies for each module

2. CREATE A NEW DB IN MYSQL INSTANCE CALLED patient_portal

  1. Log back into mysql server, create, verify new db using:

     mysql -u root -h [MySQL instance IP address] -p
     password
     create database patient_portal;
     show databases \G;
    
  2. Change into new db directory to create a table:

     Use patient_portal;
    


3. CREATING TABLES AND TABLE RELATIONSHIPS IN PYTHON

[Part 3.1] sql_table_creation.py: Create table schemas

  1. Packages Used:

        import dbm
        import pandas as pd 
        import sqlalchemy
        from sqlalchemy import create_engine
        from dotenv import load_dotenv # pip install python-dotenv
        import os
    
  2. Create tables (various methods):

    • If using MySQL Workbench: Paste raw SQL Query into workbench

    • If using terminal: Run code through instance console or local cmd terminal using MySQL client

       production_patients
       production_medications
       production_conditions
       production_treatment_procedures
       production_social_determinants
      
    • create table if not exists production_patients (
      id int auto_increment,
      mrn varchar(255) default null unique,
      first_name varchar(255) default null,
      last_name varchar(255) default null,
      zip_code varchar(255) default null,
      dob varchar(255) default null,
      gender varchar(255) default null,
      contact_mobile varchar(255) default null,
      contact_home varchar(255) default null,
      PRIMARY KEY (id)
      );

    • id int auto_increment: is the index, auto generates ID variables

    • PRIMARY KEY (id): must define PRIMARY KEY or defaults to first line

  3. Verify that tables have been created within MySQL console

    • In MySQL Workbench:

       Refresh "Tables" tab under Schema
      
    • In SQL Instance Terminal:

       show tables;
      

[Part 3.2] sql_dummy_data.py: Create realistic dummy patient data

  1. Resources Used to Obtain Real codes:

  2. Packages Used:

    •   import dbm
        import pandas as pd 
        import sqlalchemy
        from sqlalchemy import create_engine
        from dotenv import load_dotenv
        import os
        from faker import Faker # https://faker.readthedocs.io/en/master/
        import uuid # used to generate mrn numbers
        import random # creates randomness
      
  3. Use .csv of ICD10, NDC, CPT, LOINC codes to insert real codes to corresponding tables.

    • Take 1000 random codes and create a new df
    • Drop duplicate codes to create uniqueness
    • #### real icd10 codes (conditions table)
      icd10codes = pd.read_csv('https://raw.githubusercontent.com/Bobrovskiy/ICD-10-CSV/master/2020/diagnosis.csv')
      list(icd10codes.columns)
      icd10codesShort = icd10codes[['CodeWithSeparator', 'ShortDescription']]
      icd10codesShort.head(30) # view output
      # take 1000 random icd10 codes and create a new df of it
      icd10codesShort_1k = icd10codesShort.sample(n=1000)
      # drop duplicates
      icd10codesShort_1k = icd10codesShort_1k.drop_duplicates(subset=['CodeWithSeparator'], keep='first')
      icd10codesShort_1k # view output
      #### real ndc codes (medications table)
      ndc_codes = pd.read_csv('https://raw.githubusercontent.com/hantswilliams/FDA_NDC_CODES/main/NDC_2022_product.csv')
      # take 1000 random ndc codes
      ndc_codes_1k = ndc_codes.sample(n=1000, random_state=1)
      # drop duplicates from ndc_codes_1k
      ndc_codes_1k = ndc_codes_1k.drop_duplicates(subset=['PRODUCTNDC'], keep='first')
      ndc_codes_1k # view output
      #### real cpt codes (treatment procedures table)
      cpt_codes = pd.read_csv('https://gist.githubusercontent.com/lieldulev/439793dc3c5a6613b661c33d71fdd185/raw/25c3abcc5c24e640a0a5da1ee04198a824bf58fa/cpt4.csv')
      # take 1000 random cpt codes
      cpt_codes_1k = cpt_codes.sample(n=1000, random_state=1)
      #drop duplicates from cpt_codes_1k
      cpt_codes_1k = cpt_codes_1k.drop_duplicates(subset=['com.medigy.persist.reference.type.clincial.CPT.code'], keep='first')
      cpt_codes_1k # view output
      ### real loinc codes (social determinants table)
      loinc_codes = pd.read_csv('data\Loinc.csv')
      # take 1000 random loinc codes
      loinc_codes_1k = loinc_codes.sample(n=1000, random_state=1)
      #drop duplicates from loinc_codes_1k
      loinc_codes_1k = loinc_codes_1k.drop_duplicates(subset=['LOINC_NUM'], keep='first')
      loinc_codes_1k # view output
  4. Insert fake patients

    • insertQuery = "INSERT INTO production_patients (mrn, first_name, last_name, zip_code, dob, gender, contact_mobile, contact_home) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)"
      for index, row in df_fake_patients.iterrows():
      # db_azure.execute(insertQuery, (row['mrn'], row['first_name'], row['last_name'], row['zip_code'], row['dob'], row['gender'], row['contact_mobile'], row['contact_home']))
      db.execute(insertQuery, (row['mrn'], row['first_name'], row['last_name'], row['zip_code'], row['dob'], row['gender'], row['contact_mobile'], row['contact_home']))
      print("inserted row: ", index)
      # # query dbs to see if data is there
      # df_azure = pd.read_sql_query("SELECT * FROM production_patients", db_azure)
      df_gcp = pd.read_sql_query("SELECT * FROM production_patients", db)
      df_gcp # Verify dummy data has been in MySQL workbench: "select * from patient_portal.production_patients"
  5. Insert fake conditions (ICD10 codes)

    • ########## INSERTING IN FAKE CONDITIONS ##########
      ########## INSERTING IN FAKE CONDITIONS ##########
      ########## INSERTING IN FAKE CONDITIONS ##########
      insertQuery = "INSERT INTO production_conditions (icd10_code, icd10_description) VALUES (%s, %s)"
      startingRow = 0
      for index, row in icd10codesShort_1k.iterrows():
      startingRow += 1
      print('startingRow: ', startingRow)
      # db_azure.execute(insertQuery, (row['CodeWithSeparator'], row['ShortDescription']))
      print("inserted row db_azure: ", index)
      db.execute(insertQuery, (row['CodeWithSeparator'], row['ShortDescription']))
      print("inserted row db_gcp: ", index)
      ## stop once we have 100 rows
      if startingRow == 100:
      break
      # query dbs to see if data is there
      # df_azure = pd.read_sql_query("SELECT * FROM production_conditions", db_azure)
      df_gcp = pd.read_sql_query("SELECT * FROM production_conditions", db)
      df_gcp
  6. Create fake patient conditions and insert randomly to patients

    • ####### CREATING FAKE PATIENT CONDITIONS ###########
      ####### CREATING FAKE PATIENT CONDITIONS ###########
      ####### CREATING FAKE PATIENT CONDITIONS ###########
      ##### now lets create some fake patient_conditions
      # first, lets query production_conditions and production_patients to get the ids
      df_conditions = pd.read_sql_query("SELECT icd10_code FROM production_conditions", db)
      df_patients = pd.read_sql_query("SELECT mrn FROM production_patients", db)
      # create a dataframe that is stacked and give each patient a random number of conditions between 1 and 5
      df_patient_conditions = pd.DataFrame(columns=['mrn', 'icd10_code'])
      # for each patient in df_patient_conditions, take a random number of conditions between 1 and 10 from df_conditions and palce it in df_patient_conditions
      for index, row in df_patients.iterrows():
      # get a random number of conditions between 1 and 5
      # numConditions = random.randint(1, 5)
      # get a random sample of conditions from df_conditions
      df_conditions_sample = df_conditions.sample(n=random.randint(1, 5))
      # add the mrn to the df_conditions_sample
      df_conditions_sample['mrn'] = row['mrn']
      # append the df_conditions_sample to df_patient_conditions
      df_patient_conditions = df_patient_conditions.append(df_conditions_sample)
      print(df_patient_conditions.head(20))
      ## INSERTING RANDOM CONDITION INTO PATIENT ##
      ## INSERTING RANDOM CONDITION INTO PATIENT ##
      insertQuery = "INSERT INTO production_patient_conditions (mrn, icd10_code) VALUES (%s, %s)"
      for index, row in df_patient_conditions.iterrows():
      db.execute(insertQuery, (row['mrn'], row['icd10_code']))
      print("inserted row: ", index)
  7. Repeat for medications, treatment procedures, and social determinants table

  8. Verify tables are created in MYSQL Console

      show tables;
    


4. Create an Entity Relationship Diagram (ERD) for my DB design using MySQL Workbench

  • Must have at least two foreignKeys representing a relationship between at least 2 tables.

  • Shows the rows and columns in created within the tables

  • Shows elevated perspective of Schemas view

  • In MySQL Workbench, select:

    •   Database > Reverse Engineer
        Loads host credentials and enter password
        Select db patient_portal
        Select Import MYSQL Table Objects
        Check: Place imported objects on a diagram 
        Execute
        Select: Reverse Engineer Selected Objects and Place Objects on Diagram
        Reorganize diagram using drag 
        Save and Export as .PNG
      


5. images folder

- screen shot of a ERD of your proposed setup (use either popSQL or mysql work bench) 
- screen shot of you connected to the sql server, performing the following two queries: 
    - Query1: show databases (e.g., show databases;) 
    - Query2: all of the tables from your database (e.g., show tables;)  
    - Query3: select * from patient_portal.medications 
    - Query4: select * from patient_portal.treatment_procedures
    - Query5: select * from patient_portal.conditions


6. Consider backups

  • Hot Backups: Read Replicas on GCP, immediately jump into 2nd database
  • Cold Backups: Typically .csv files that needs to be deployed to another mysql env