tyasird/antibiotic-resistance-machine-learning

Jupyter Notebook

Centre for Computational Biology - University of Birmingham (03.2020 - 06.2020)

Stage 1

Data Visualization, NLP
use text as input and analyse text with nlp methods (tokenize, punctuation, stop words, stemming, lemmatizing etc.)
text visualization
Find How much of the text reducal
Find How big is the dictinary

Stage 2

train word2vec models to find robust networks
measure the variation in word distances
vectorize words
sentence -> tokenize -> count frequency
train a word2vec neural network
visulaize the results
n-dimension vector -> 2-dimension vector -> visulaize

Stage 3

fetch all articles from pubmed with keywords ("antibiotic resistant")
parse articles (title, year, abstract)
save data as json

Stage 4

Read all articles and find country count with geotext and pycountry
Filter by publication type and exclude reviews
Then we have to word2vec model to vectorize text and find word embeddings

Stage 5

visualise data and extract insights from data
Write an interactive geographical map to show the number of studdies on the map
Use bokeh and seaborn to develop the visulization tool
How much is antimicrobial resistance reported at different geographical scales over time?
How does the emergence of AMR vary across time for different classes of antimicrobials?

Stage 6

Data preparation and Data integration
unsupervised clustering
simple linear regression
how to find relationship between countries publications with countries gdp ?

Stage 7

prepare input data
read all wdi excel files, filter by country and merge all of them
create correlation matrix and Visualize with R language (Spearman test and Hierarchical Clustering)
correlation between vector distance and metadata difference

Stage 8

create/train random forests model and desicion trees
RMSE – shows error about how my model Works, because we are doing regression not classfication, our aim to predict.
create feature importance – how often the feature is used in the model for the predict
we want to see what paramters are more important? Visualise importance of features, how much you can predict that

Stage 9

find Local Importance with using SHAP
firstly create clustering algorithm to capture information about clusters then Dimension reduction algorithm
Check for Countries what factors are most important ?
create shap summary plot
local importance visualize for some countries