Build a data product that could process streaming data and has an end-to-end data pipeline that could be easily scaled upon request.
- Training tfidf and random forest model using pipeline on spark ML
- Saving models to S3
- Collecting real time twitter streams through Kafka
- Integrating Kafka with spark streaming
- Loading saved model to predict incoming streams in spark streaming
- Storing incoming streams to MongoDB in spark streaming
- Fetching data from MongoDB and publishing results on web application via flask