/nih_reporter_ML

Using NIH RePORTER data as a machine learning playground for Databricks, NLP and collaborative development

nih_reporter

Using NIH RePORTER data as a machine learning playground for Databricks, NLP, Azure tools, and collaborative development

Stream Labels

This repo is intended to contain multiple streams (sub-projects or research ideas). Unique stream labels are to be used as directory names to organise the streams and match across directories. Label shared is reserved for codes and features common to all streams.

Stream labels should also be used as branch names to aid code management.

Key directories

  doc/                         - documentation
  src/                          - source codes
    |_  pipelines/[stream]/     - data / ml pipelines
    |_  notesbooks/[stream]/    - exploratory/ experimental notebooks
    |_  utils                   - utility scripts
  test/                         - codes for unit or regression testing
    |_ [stream]/                - organised by streams
  out/[stream]/                 - small output files(eg plots) generated by codes
  data/[stream]/                - small resources or files used by your program
  models/[stream]/              - saved models for deployment
  README.md
  requirements.txt              - use if applicable

Note: Large files ( say, > 1MB) should reside in external file system such as Databricks DBFS and OneDrive.

Notes for contributors

  1. FORK: Create a fork from the main repo [jtjli/nih_reporter] unless you want to develop on top on an existing fork.
  2. BRANCH: Use a branch that's representative of your development, such as using a Stream Label as the branch name. Avoid developing on the main branch.
  3. Create a Pull Request when your codes are ready for merging into the main repo.
  4. Wherever appropriate, use Stream Labels as section heading in files such as .gitignore, the global requirements.txt, and README

ML from here onwards

The few things I would like to do using NLP:

  1. Using Latent Dirichlet Allocation (LDA) to unbiasly cluster the data (grants), see if I can get them into different topics(Aka labels).
  2. What topics are in trend (based on success rate)?
  3. Can I build a classifier to figure out the key words for a success grant?