/earth-science-community-ML-tutorials

ESIP Lab 2021 – Cloud-based Open Science Machine Learning Tutorials for Earth Science

Primary LanguageJupyter Notebook

Cloud-based Open Science Machine Learning Tutorials for Earth Science

This project is funded through ESIP Lab Spring 2021 Request for Proposal. The project is led by Yuhan (Douglas) Rao at North Carolina Institute for Climate Studies in collaboration with Chris Slocum.

Project Description

Cloud computing is beginning to accelerate the science process by removing barriers associated with collecting and quality controlling data. Cloud computing also provides the means to improve scientific workflows and promote research sharing through open-source literate programming tools such as notebook (e.g., Jupyter and Rmarkdown). However, this is a seismic shift in how researchers in the Earth science community do their work. Many researchers are resistant to adopting and taking advantage of cloud computing because of the hurdles associated with starting, the lack of domain specific examples, unclear cloud computing costs, and the plethora of cloud computing vendor Application Programming Interfaces (APIs).

There are some existing efforts to create such notebooks to promote the adoption of cloud computing and open-source Artificial Intelligence (AI) tools. However, these efforts are usually side products related to a specific research project and developed by the researchers themselves. The notebook development process typically does not directly engage potential users, which may reduce the value and impact of the final notebooks. In an effort to develop interactive machine learning tutorials supported by ESIP Funding Friday, we found that training materials would be more useful and impactful when potential users were engaged in the development process. Additionally, many existing notebooks do not necessarily follow the best practices in cloud computing and AI applications (e.g., provenance, reproducibility, and content accessibility).

The project proposes creating well-documented notebooks that show how to collect, distribute, process, and analyze geophysical datasets with open-source AI tools. The development process will actively engage potential users to identify learning topics of high demand and seek user feedback along the development process. Additionally, all notebooks will follow and highlight community best practices on cloud computing and AI applications . This project will build a workflow and infrastructure using the open science ecosystem (i.e., Jupyter, Python, R, Google Colaboratory, Binder Project, and GitHub) that is scalable and can enable community contributions with notebook templates, contribution guidelines, and automated evaluation tools.

To demonstrate the diversity of cloud computing resources and public Earth science data, we will develop notebooks that use services and geophysical data from several cloud computing vendor APIs (e.g., Amazon AWS, Google Cloud Storage/Earth Engine) and data sets from various government agencies that have moved portions of their data holdings to the cloud (e.g., NOAA’s Big Data Project, NASA’s Earthdata Cloud Evolution, USGS’s Cloud Hosting Solutions). We will also leverage community-driven tools for open, reproducible, and scalable science, such as the Pangeo software ecosystem, in the notebook development process.

To create notebooks that are relevant to users with different levels of technical background, the project will follow the concept of “learning journey,” which is a series of progressive notebooks that are suitable for users with different levels of technical knowledge. The learning journey allows us to separate a complicated learning process into manageable pieces to facilitate more effective learning for potential users. Users can start their own learning journey via different entry points of their choice. The main learning objective of the project team is to identify the best practices and tools to make interactive notebooks accessible to all users by incorporating the Web Content Accessibility Guidelines (WCAG) developed by the World Wide Web Consortium (W3C).