This repository contains various NLP related examples.
There are a lot of todo's to make this better, here's a list of things to be aware of.
I need to do the following to make it easier to run the image processing pipeline
- Ensure we have a cluster spun up by Terraform. This is currently blocked by a chicken/egg situation where we need to compile poppler as a DBFS dependency
- Move into sub directory, this stuff should not be top level
- Build cluster for Pytorch / Tesseract, refactor dependencies. Right now we're hit with the "non-deterministic dependency" bug in our Terraform provider.
- Add README.md
Audio pipeline is simpler than Image processing but still has a few improvements to do.
- Add in some ML to the end of the pipeline (topic modelling)
- Build some simple analytics.
- Download mp3's once
- Add README.md