Using NIH RePORTER data as a machine learning playground for Databricks, NLP, Azure tools, and collaborative development
This repo is intended to contain multiple streams (sub-projects or research ideas). Unique stream labels are to be used as directory names to organise the streams and match across directories. Label shared is reserved for codes and features common to all streams.
Stream labels should also be used as branch names to aid code management.
doc/ - documentation
src/ - source codes
|_ pipelines/[stream]/ - data / ml pipelines
|_ notesbooks/[stream]/ - exploratory/ experimental notebooks
|_ utils - utility scripts
test/ - codes for unit or regression testing
|_ [stream]/ - organised by streams
out/[stream]/ - small output files(eg plots) generated by codes
data/[stream]/ - small resources or files used by your program
models/[stream]/ - saved models for deployment
README.md
requirements.txt - use if applicable
Note: Large files ( say, > 1MB) should reside in external file system such as Databricks DBFS and OneDrive.
- FORK: Create a fork from the main repo [jtjli/nih_reporter] unless you want to develop on top on an existing fork.
- BRANCH: Use a branch that's representative of your development, such as using a Stream Label as the branch name. Avoid developing on the main branch.
- Create a Pull Request when your codes are ready for merging into the main repo.
- Wherever appropriate, use Stream Labels as section heading in files such as .gitignore, the global requirements.txt, and README
The few things I would like to do using NLP:
- Using Latent Dirichlet Allocation (LDA) to unbiasly cluster the data (grants), see if I can get them into different topics(Aka labels).
- What topics are in trend (based on success rate)?
- Can I build a classifier to figure out the key words for a success grant?