/Ensemble_ML_Cloud

Project repo for DSCI_525 Web and Cloud Computing

Primary LanguageJupyter NotebookMIT LicenseMIT

Ensemble ML Cloud Computing

About

This project aims to build and deploy Ensemble Machine Learning models in the cloud to predict daily rainfall in Australia on a large dataset (~6 GB), where features are outputs of different climate models and the target is the actual rainfall observation.

The dataset used in this work can be found here on figshare and has been put together by Dr. Tomas Beuzen of UBC MDS. The dataset contains daily rainfall data from 1889 to 2014 in New South Wales, Australia.

This project includes four milestones as described below:

Milestone 1: Getting the data from web using API, processing it and converting it to an efficient file format

Milestone 2: Moving the data to cloud, setting up the infrastructure in cloud and doing the ML model

Milestone 3: Setting up the distributed infrastructure (Spark) in cloud and running the same ML model using Spark

Milestone 4: Deploying the ML model in cloud so that other consumers can use it

Report

The report for Milestone 1: "Tackling big data on a laptop" can be found in a notebook here.

Dependencies

  • Python

  • pandas

  • dask

  • rpy2

  • pyarrow

  • R

  • scikit-learn

License

Contributors

Core contributor GitHub handle
Neel Phaterpekar @nphaterp
Arash Shamseddini @arashshams
Charles Suresh @charlessuresh

References