/metaclean

Primary LanguagePython

MetaClean

The purpose of metaclean is to use large language models to clean up semi-structured metadata and free text metadata presented in NCBI Biosample records.

There's a long winded explanation of the movitations here.

Outline of (proposed) work

Workflow

  • Fetch all Biosample data using fetch_biosample.py and validate
  • Create table of fields and create database tables create_db.py
  • Create prompts for each record create_prompt.py

Plan

  • Explore inherent clusters with tfidf
  • Review text preprocessing. Tokenising may need to be adjusted.
  • Explore inherent clusters from other vectorisation approaches - gpt, word2vec, glove, mpnet.
  • Tweak data processing based on above clusters - perhaps some existing EB ones are a little arbitrary.
  • Compare classification results with 'Logistic_Regression','Support_Vector_Machine', 'Random_Forest','Decision_Tree', - using each vectoriser above. (so that's 5 * 4 comparisons)
  • Create training set of true labels.
  • Repeat classification testing above.

https://derrickofori015.medium.com/gpt-3-vs-other-text-embeddings-techniques-for-text-classification-a-performance-evaluation-b3a3e6e84cb7

https://www.cdc.gov/foodsafety/ifsac/projects/food-categorization-scheme.html