Centre for Computational Biology - University of Birmingham (03.2020 - 06.2020)

Stage 1

  1. Data Visualization, NLP
  2. use text as input and analyse text with nlp methods (tokenize, punctuation, stop words, stemming, lemmatizing etc.)
  3. text visualization
  4. Find How much of the text reducal
  5. Find How big is the dictinary

Stage 2

  1. train word2vec models to find robust networks
  2. measure the variation in word distances
  3. vectorize words
  4. sentence -> tokenize -> count frequency
  5. train a word2vec neural network
  6. visulaize the results
  7. n-dimension vector -> 2-dimension vector -> visulaize

Stage 3

  1. fetch all articles from pubmed with keywords ("antibiotic resistant")
  2. parse articles (title, year, abstract)
  3. save data as json

Stage 4

  1. Read all articles and find country count with geotext and pycountry
  2. Filter by publication type and exclude reviews
  3. Then we have to word2vec model to vectorize text and find word embeddings

Stage 5

  1. visualise data and extract insights from data
  2. Write an interactive geographical map to show the number of studdies on the map
  3. Use bokeh and seaborn to develop the visulization tool
  4. How much is antimicrobial resistance reported at different geographical scales over time?
  5. How does the emergence of AMR vary across time for different classes of antimicrobials?

Stage 6

  1. Data preparation and Data integration
  2. unsupervised clustering
  3. simple linear regression
  4. how to find relationship between countries publications with countries gdp ?

Stage 7

  1. prepare input data
  2. read all wdi excel files, filter by country and merge all of them
  3. create correlation matrix and Visualize with R language (Spearman test and Hierarchical Clustering)
  4. correlation between vector distance and metadata difference

Stage 8

  1. create/train random forests model and desicion trees
  2. RMSE – shows error about how my model Works, because we are doing regression not classfication, our aim to predict.
  3. create feature importance – how often the feature is used in the model for the predict
  4. we want to see what paramters are more important? Visualise importance of features, how much you can predict that

Stage 9

  1. find Local Importance with using SHAP
  2. firstly create clustering algorithm to capture information about clusters then Dimension reduction algorithm
  3. Check for Countries what factors are most important ?
  4. create shap summary plot
  5. local importance visualize for some countries