alicewu1/mysql-cloudmanaged

mysql-cloudmanaged

HHA504 / Assignment 8 / Cloud-managed SQL DB + ERD + Dummy Data

This repo aims to:

create a cloud-managed MySQL db on GCP
use python to create tables in sql and reproduce relationships between tables without repeating commands
experiment with various methods of using MySQL (remote mysql instance, local using MySQL client, MySQL workbench)
create an ERD using MySQL Workbench
create realistic patient dummy data using the faker python package

Note: Initially, I manually created dummy patient data to insert into the tables in my db/ However, I have since revised the python scripts to utilize the faker package to auto-generate realistic dummy patient data into my tables. I have included screenshots of my old dummy data versus my new dummy data in my old images folder.

TASKS

1. setup.md:

How I setup and connect my GCP Compute Engine VM Instance to a MySQL dev env
How I setup a GCP MySQL Instance
How I setup MySQL Workbench
Lists the corresponding dependencies for each module

2. CREATE A NEW DB IN MYSQL INSTANCE CALLED patient_portal

Log back into mysql server, create, verify new db using:

 mysql -u root -h [MySQL instance IP address] -p
 password
 create database patient_portal;
 show databases \G;

Change into new db directory to create a table:
```
 Use patient_portal;
```

3. CREATING TABLES AND TABLE RELATIONSHIPS IN PYTHON

[Part 3.1] sql_table_creation.py: Create table schemas

Packages Used:

    import dbm
    import pandas as pd 
    import sqlalchemy
    from sqlalchemy import create_engine
    from dotenv import load_dotenv # pip install python-dotenv
    import os

Create tables (various methods):

If using MySQL Workbench: Paste raw SQL Query into workbench

If using terminal: Run code through instance console or local cmd terminal using MySQL client

 production_patients
 production_medications
 production_conditions
 production_treatment_procedures
 production_social_determinants

mysql-cloudmanaged/sql_table_creation.py

Lines 79 to 90 in d5c2d84

    
           create table if not exists production_patients ( 
        
               id int auto_increment, 
        
               mrn varchar(255) default null unique, 
        
               first_name varchar(255) default null, 
        
               last_name varchar(255) default null, 
        
               zip_code varchar(255) default null, 
        
               dob varchar(255) default null, 
        
               gender varchar(255) default null, 
        
               contact_mobile varchar(255) default null, 
        
               contact_home varchar(255) default null, 
        
               PRIMARY KEY (id)  
        
           );

id int auto_increment: is the index, auto generates ID variables
PRIMARY KEY (id): must define PRIMARY KEY or defaults to first line

Verify that tables have been created within MySQL console
- In MySQL Workbench:
```
 Refresh "Tables" tab under Schema
```
- In SQL Instance Terminal:
```
 show tables;
```

[Part 3.2] sql_dummy_data.py: Create realistic dummy patient data

Resources Used to Obtain Real codes:

Packages Used:

  import dbm
  import pandas as pd 
  import sqlalchemy
  from sqlalchemy import create_engine
  from dotenv import load_dotenv
  import os
  from faker import Faker # https://faker.readthedocs.io/en/master/
  import uuid # used to generate mrn numbers
  import random # creates randomness

Use .csv of ICD10, NDC, CPT, LOINC codes to insert real codes to corresponding tables.

Take 1000 random codes and create a new df
Drop duplicate codes to create uniqueness

mysql-cloudmanaged/sql_dummy_data.py

Lines 61 to 99 in d5c2d84

    
           #### real icd10 codes (conditions table) 
        
           icd10codes = pd.read_csv('https://raw.githubusercontent.com/Bobrovskiy/ICD-10-CSV/master/2020/diagnosis.csv') 
        
           list(icd10codes.columns) 
        
           icd10codesShort = icd10codes[['CodeWithSeparator', 'ShortDescription']] 
        
           icd10codesShort.head(30) # view output 
        
           # take 1000 random icd10 codes and create a new df of it 
        
           icd10codesShort_1k = icd10codesShort.sample(n=1000) 
        
           # drop duplicates 
        
           icd10codesShort_1k = icd10codesShort_1k.drop_duplicates(subset=['CodeWithSeparator'], keep='first') 
        
           icd10codesShort_1k # view output 
        
           #### real ndc codes (medications table) 
        
           ndc_codes = pd.read_csv('https://raw.githubusercontent.com/hantswilliams/FDA_NDC_CODES/main/NDC_2022_product.csv') 
        
           # take 1000 random ndc codes 
        
           ndc_codes_1k = ndc_codes.sample(n=1000, random_state=1) 
        
           # drop duplicates from ndc_codes_1k 
        
           ndc_codes_1k = ndc_codes_1k.drop_duplicates(subset=['PRODUCTNDC'], keep='first') 
        
           ndc_codes_1k # view output 
        
           #### real cpt codes (treatment procedures table) 
        
           cpt_codes = pd.read_csv('https://gist.githubusercontent.com/lieldulev/439793dc3c5a6613b661c33d71fdd185/raw/25c3abcc5c24e640a0a5da1ee04198a824bf58fa/cpt4.csv') 
        
           # take 1000 random cpt codes  
        
           cpt_codes_1k = cpt_codes.sample(n=1000, random_state=1) 
        
           #drop duplicates from cpt_codes_1k 
        
           cpt_codes_1k = cpt_codes_1k.drop_duplicates(subset=['com.medigy.persist.reference.type.clincial.CPT.code'], keep='first') 
        
           cpt_codes_1k # view output 
        
           ### real loinc codes (social determinants table) 
        
           loinc_codes = pd.read_csv('data\Loinc.csv') 
        
           # take 1000 random loinc codes 
        
           loinc_codes_1k = loinc_codes.sample(n=1000, random_state=1) 
        
           #drop duplicates from loinc_codes_1k 
        
           loinc_codes_1k = loinc_codes_1k.drop_duplicates(subset=['LOINC_NUM'], keep='first') 
        
           loinc_codes_1k # view output

Insert fake patients

mysql-cloudmanaged/sql_dummy_data.py

Lines 130 to 141 in d5c2d84

    
           insertQuery = "INSERT INTO production_patients (mrn, first_name, last_name, zip_code, dob, gender, contact_mobile, contact_home) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)" 
        
           for index, row in df_fake_patients.iterrows(): 
        
               # db_azure.execute(insertQuery, (row['mrn'], row['first_name'], row['last_name'], row['zip_code'], row['dob'], row['gender'], row['contact_mobile'], row['contact_home'])) 
        
               db.execute(insertQuery, (row['mrn'], row['first_name'], row['last_name'], row['zip_code'], row['dob'], row['gender'], row['contact_mobile'], row['contact_home'])) 
        
               print("inserted row: ", index) 
        
           # # query dbs to see if data is there 
        
           # df_azure = pd.read_sql_query("SELECT * FROM production_patients", db_azure) 
        
           df_gcp = pd.read_sql_query("SELECT * FROM production_patients", db) 
        
           df_gcp # Verify dummy data has been in MySQL workbench: "select * from patient_portal.production_patients"

Insert fake conditions (ICD10 codes)

mysql-cloudmanaged/sql_dummy_data.py

Lines 146 to 167 in d5c2d84

    
           ########## INSERTING IN FAKE CONDITIONS ########## 
        
           ########## INSERTING IN FAKE CONDITIONS ########## 
        
           ########## INSERTING IN FAKE CONDITIONS ########## 
        
           insertQuery = "INSERT INTO production_conditions (icd10_code, icd10_description) VALUES (%s, %s)" 
        
           startingRow = 0 
        
           for index, row in icd10codesShort_1k.iterrows(): 
        
               startingRow += 1 
        
               print('startingRow: ', startingRow) 
        
               # db_azure.execute(insertQuery, (row['CodeWithSeparator'], row['ShortDescription'])) 
        
               print("inserted row db_azure: ", index) 
        
               db.execute(insertQuery, (row['CodeWithSeparator'], row['ShortDescription'])) 
        
               print("inserted row db_gcp: ", index) 
        
               ## stop once we have 100 rows 
        
               if startingRow == 100: 
        
                   break 
        
           # query dbs to see if data is there 
        
           # df_azure = pd.read_sql_query("SELECT * FROM production_conditions", db_azure) 
        
           df_gcp = pd.read_sql_query("SELECT * FROM production_conditions", db) 
        
           df_gcp

Create fake patient conditions and insert randomly to patients

mysql-cloudmanaged/sql_dummy_data.py

Lines 170 to 201 in d5c2d84

    
           ####### CREATING FAKE PATIENT CONDITIONS ########### 
        
           ####### CREATING FAKE PATIENT CONDITIONS ########### 
        
           ####### CREATING FAKE PATIENT CONDITIONS ########### 
        
           ##### now lets create some fake patient_conditions  
        
           # first, lets query production_conditions and production_patients to get the ids 
        
           df_conditions = pd.read_sql_query("SELECT icd10_code FROM production_conditions", db) 
        
           df_patients = pd.read_sql_query("SELECT mrn FROM production_patients", db) 
        
           # create a dataframe that is stacked and give each patient a random number of conditions between 1 and 5 
        
           df_patient_conditions = pd.DataFrame(columns=['mrn', 'icd10_code']) 
        
           # for each patient in df_patient_conditions, take a random number of conditions between 1 and 10 from df_conditions and palce it in df_patient_conditions 
        
           for index, row in df_patients.iterrows(): 
        
               # get a random number of conditions between 1 and 5 
        
               # numConditions = random.randint(1, 5) 
        
               # get a random sample of conditions from df_conditions 
        
               df_conditions_sample = df_conditions.sample(n=random.randint(1, 5)) 
        
               # add the mrn to the df_conditions_sample 
        
               df_conditions_sample['mrn'] = row['mrn'] 
        
               # append the df_conditions_sample to df_patient_conditions 
        
               df_patient_conditions = df_patient_conditions.append(df_conditions_sample) 
        
           print(df_patient_conditions.head(20)) 
        
           ## INSERTING RANDOM CONDITION INTO PATIENT ## 
        
           ## INSERTING RANDOM CONDITION INTO PATIENT ## 
        
           insertQuery = "INSERT INTO production_patient_conditions (mrn, icd10_code) VALUES (%s, %s)" 
        
           for index, row in df_patient_conditions.iterrows(): 
        
               db.execute(insertQuery, (row['mrn'], row['icd10_code'])) 
        
               print("inserted row: ", index)

Repeat for medications, treatment procedures, and social determinants table
Verify tables are created in MYSQL Console
```
  show tables;
```

4. Create an Entity Relationship Diagram (ERD) for my DB design using MySQL Workbench

Must have at least two foreignKeys representing a relationship between at least 2 tables.
Shows the rows and columns in created within the tables
Shows elevated perspective of Schemas view

In MySQL Workbench, select:

  Database > Reverse Engineer
  Loads host credentials and enter password
  Select db patient_portal
  Select Import MYSQL Table Objects
  Check: Place imported objects on a diagram 
  Execute
  Select: Reverse Engineer Selected Objects and Place Objects on Diagram
  Reorganize diagram using drag 
  Save and Export as .PNG

5. images folder

- screen shot of a ERD of your proposed setup (use either popSQL or mysql work bench) 
- screen shot of you connected to the sql server, performing the following two queries: 
    - Query1: show databases (e.g., show databases;) 
    - Query2: all of the tables from your database (e.g., show tables;)  
    - Query3: select * from patient_portal.medications 
    - Query4: select * from patient_portal.treatment_procedures
    - Query5: select * from patient_portal.conditions

6. Consider backups

Hot Backups: Read Replicas on GCP, immediately jump into 2nd database
Cold Backups: Typically .csv files that needs to be deployed to another mysql env

	create table if not exists production_patients (
	id int auto_increment,
	mrn varchar(255) default null unique,
	first_name varchar(255) default null,
	last_name varchar(255) default null,
	zip_code varchar(255) default null,
	dob varchar(255) default null,
	gender varchar(255) default null,
	contact_mobile varchar(255) default null,
	contact_home varchar(255) default null,
	PRIMARY KEY (id)
	);

	#### real icd10 codes (conditions table)
	icd10codes = pd.read_csv('https://raw.githubusercontent.com/Bobrovskiy/ICD-10-CSV/master/2020/diagnosis.csv')
	list(icd10codes.columns)
	icd10codesShort = icd10codes[['CodeWithSeparator', 'ShortDescription']]
	icd10codesShort.head(30) # view output

	# take 1000 random icd10 codes and create a new df of it
	icd10codesShort_1k = icd10codesShort.sample(n=1000)
	# drop duplicates
	icd10codesShort_1k = icd10codesShort_1k.drop_duplicates(subset=['CodeWithSeparator'], keep='first')
	icd10codesShort_1k # view output



	#### real ndc codes (medications table)
	ndc_codes = pd.read_csv('https://raw.githubusercontent.com/hantswilliams/FDA_NDC_CODES/main/NDC_2022_product.csv')
	# take 1000 random ndc codes
	ndc_codes_1k = ndc_codes.sample(n=1000, random_state=1)
	# drop duplicates from ndc_codes_1k
	ndc_codes_1k = ndc_codes_1k.drop_duplicates(subset=['PRODUCTNDC'], keep='first')
	ndc_codes_1k # view output


	#### real cpt codes (treatment procedures table)
	cpt_codes = pd.read_csv('https://gist.githubusercontent.com/lieldulev/439793dc3c5a6613b661c33d71fdd185/raw/25c3abcc5c24e640a0a5da1ee04198a824bf58fa/cpt4.csv')
	# take 1000 random cpt codes
	cpt_codes_1k = cpt_codes.sample(n=1000, random_state=1)
	#drop duplicates from cpt_codes_1k
	cpt_codes_1k = cpt_codes_1k.drop_duplicates(subset=['com.medigy.persist.reference.type.clincial.CPT.code'], keep='first')
	cpt_codes_1k # view output


	### real loinc codes (social determinants table)
	loinc_codes = pd.read_csv('data\Loinc.csv')
	# take 1000 random loinc codes
	loinc_codes_1k = loinc_codes.sample(n=1000, random_state=1)
	#drop duplicates from loinc_codes_1k
	loinc_codes_1k = loinc_codes_1k.drop_duplicates(subset=['LOINC_NUM'], keep='first')
	loinc_codes_1k # view output

	insertQuery = "INSERT INTO production_patients (mrn, first_name, last_name, zip_code, dob, gender, contact_mobile, contact_home) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)"


	for index, row in df_fake_patients.iterrows():
	# db_azure.execute(insertQuery, (row['mrn'], row['first_name'], row['last_name'], row['zip_code'], row['dob'], row['gender'], row['contact_mobile'], row['contact_home']))
	db.execute(insertQuery, (row['mrn'], row['first_name'], row['last_name'], row['zip_code'], row['dob'], row['gender'], row['contact_mobile'], row['contact_home']))
	print("inserted row: ", index)

	# # query dbs to see if data is there
	# df_azure = pd.read_sql_query("SELECT * FROM production_patients", db_azure)
	df_gcp = pd.read_sql_query("SELECT * FROM production_patients", db)
	df_gcp # Verify dummy data has been in MySQL workbench: "select * from patient_portal.production_patients"

	########## INSERTING IN FAKE CONDITIONS ##########
	########## INSERTING IN FAKE CONDITIONS ##########
	########## INSERTING IN FAKE CONDITIONS ##########

	insertQuery = "INSERT INTO production_conditions (icd10_code, icd10_description) VALUES (%s, %s)"

	startingRow = 0
	for index, row in icd10codesShort_1k.iterrows():
	startingRow += 1
	print('startingRow: ', startingRow)
	# db_azure.execute(insertQuery, (row['CodeWithSeparator'], row['ShortDescription']))
	print("inserted row db_azure: ", index)
	db.execute(insertQuery, (row['CodeWithSeparator'], row['ShortDescription']))
	print("inserted row db_gcp: ", index)
	## stop once we have 100 rows
	if startingRow == 100:
	break

	# query dbs to see if data is there
	# df_azure = pd.read_sql_query("SELECT * FROM production_conditions", db_azure)
	df_gcp = pd.read_sql_query("SELECT * FROM production_conditions", db)
	df_gcp

	####### CREATING FAKE PATIENT CONDITIONS ###########
	####### CREATING FAKE PATIENT CONDITIONS ###########
	####### CREATING FAKE PATIENT CONDITIONS ###########
	##### now lets create some fake patient_conditions

	# first, lets query production_conditions and production_patients to get the ids
	df_conditions = pd.read_sql_query("SELECT icd10_code FROM production_conditions", db)
	df_patients = pd.read_sql_query("SELECT mrn FROM production_patients", db)

	# create a dataframe that is stacked and give each patient a random number of conditions between 1 and 5
	df_patient_conditions = pd.DataFrame(columns=['mrn', 'icd10_code'])
	# for each patient in df_patient_conditions, take a random number of conditions between 1 and 10 from df_conditions and palce it in df_patient_conditions
	for index, row in df_patients.iterrows():
	# get a random number of conditions between 1 and 5
	# numConditions = random.randint(1, 5)
	# get a random sample of conditions from df_conditions
	df_conditions_sample = df_conditions.sample(n=random.randint(1, 5))
	# add the mrn to the df_conditions_sample
	df_conditions_sample['mrn'] = row['mrn']
	# append the df_conditions_sample to df_patient_conditions
	df_patient_conditions = df_patient_conditions.append(df_conditions_sample)

	print(df_patient_conditions.head(20))


	## INSERTING RANDOM CONDITION INTO PATIENT ##
	## INSERTING RANDOM CONDITION INTO PATIENT ##
	insertQuery = "INSERT INTO production_patient_conditions (mrn, icd10_code) VALUES (%s, %s)"

	for index, row in df_patient_conditions.iterrows():
	db.execute(insertQuery, (row['mrn'], row['icd10_code']))
	print("inserted row: ", index)