/oci-data-science-ai-samples

This repo contains a series of tutorials and code examples highlighting different features of the OCI Data Science and AI services, along with a release vehicle for experimental programs.

Primary LanguageJupyter NotebookUniversal Permissive License v1.0UPL-1.0

Oracle Cloud Infrastructure Data Science and AI services Examples

The Oracle Cloud Infrastructure (OCI) Data Science service has created this repo to make demos, tutorials, and code examples that highlight various features of the OCI Data Science service and AI services. We welcome your feedback and would like to know what content is useful and what content is missing. Open an issue to do this. We know that a lot of you are creating great content and we would like to help you share it. See the contributions document.

Oracle Cloud Infrastructure (OCI) Data Science Services provide a powerful suite of tools for data scientists, enabling faster machine learning model development and deployment. With features like the Accelerated Data Science (ADS) SDK, distributed training, batch processing and machine learning pipelines, OCI Data Science Services offer the scalability and flexibility needed to tackle complex data science and machine learning challenges. Whether you're a beginner or an experienced machine learning practitioner or data scientist, OCI Data Science Services provide the resources you need to build, train, and deploy your models with ease.

Topics

The Accelerated Data Science (ADS) SDK is a data scientist-friendly library that speeds up common data science tasks and provides an interface to other OCI services. In this section, we provide JupyterLab notebooks that offer tutorials on how to use ADS. For example, the vault.ipynb notebook demonstrates how easy it is to store your secrets in the OCI Vault service.

The OCI Data Science service uses conda environments to manage available libraries that a notebook can use. OCI Data Science provides several conda environments designed to offer the best libraries for common data science tasks. Each family of conda environments has notebooks that demonstrate how to perform various data science tasks. This section is organized around these conda environment families and provides notebooks to help you get started quickly.

This section provides examples of how to train machine learning models and deploy them on the OCI Data Science service, making it ideal for anyone looking to walk through an end-to-end problem.

OCI Data Science supports LLMs in several ways:

  • Fine-tune, deploy and evaluate without writing code via AI Quick Actions
  • LangChain integration via ADS
  • Directly by coding Python in the service. You can find more information here

The Model Catalog offers a managed and centralized storage space for models. ADS helps you create the artifacts you need to use this service. However, you must provide a score.py file that loads the model and a function that makes predictions. The runtime.yaml provides information about the runtime conda environment if you want to deploy the model. You can also document a comprehensive set of metadata about the provenance of the model. This section provides examples of how to create your score.py and runtime.yaml files for various common machine learning models and configurations.

Oracle Cloud Infrastructure (OCI) Data Science Jobs is a powerful tool that allows you to define and run repeatable machine learning tasks on a fully managed infrastructure. With Jobs, you have the flexibility to apply custom tasks to meet your specific use cases, such as data preparation, model training, hyperparameter optimization, batch inference, large model training and more.

On-demand jobs and batch processing are especially important for businesses that need to process large volumes of data on a regular basis, as they enable companies to automate data processing workflows, reduce the need for manual intervention, and save costs associated with running compute resources for extended periods of time. With the ability to define and schedule jobs to run at specific times, businesses can automate their data processing workflows and reduce the need for manual intervention. This helps to improve efficiency, reduce errors, and save valuable time and resources. Additionally, by using a fully managed infrastructure, businesses can ensure that their data processing workflows are secure and compliant with industry regulations. Overall, OCI Data Science Jobs is a powerful tool that can help businesses to scale their machine learning workflows and improve their data processing capabilities.

Distributed training support with Jobs for machine learning for faster and more efficient model training on large datasets, allowing for more complex models and larger workloads to be handled. Distributed training could be used when the size of the dataset or the complexity of the model makes it difficult or impossible to train on a single machine, and when there is a need for faster model training to keep up with changing data or business requirements. This section describes our support for distributed training with Jobs for the following frameworks: Dask, Horovod, TensorFlow Distributed, and PyTorch Distributed.

Pipelines are essential for complex machine learning and data science tasks as they streamline and automate the model building and deployment process, enabling faster and more consistent results. They could be used when there is a need to build, train, and deploy complex models with multiple components and steps, and when there is a need to automate the machine learning process to reduce manual labor and errors. The Oracle Cloud Infrastructure Data Science Pipelines services helps automates and streamlines the process of building and deploying machine learning pipelines.

The data labeling service helps identify properties (labels) of documents, text, and images (records) and annotates (labels) them with those properties. This section contains Python and Java scripts to annotate bulk numbers of records in OCI Data Labeling Service (DLS).

The OCI Data Science service offers managed notebook(jupyterlab) sessions. Notebook lifecycle script features execute the customer provided scripts during CREATE/ACTIVATE/DEACTIVATE/DELETE notebook session lifecycle. This folder contains the examples script which needs little to no editing and ready to be used as lifecycle scripts input.

The Feature store service solves many of the problems because it is a centralized way to transform and access data for training and serving time, Feature stores help define a standardised pipeline for ingestion of data and querying of data.

Documentation

Check out the following resources for more information about the OCI Data Science and AI services:

Need Help?

Contributing

This project welcomes contributions from the community. Before submitting a pull request, please review our contribution guide.

Security

The Security Guide contains information about security vulnerability disclosure process. If you discover a vulnerability, consider filing an issue.

License

Copyright (c) 2021, 2023 Oracle and/or its affiliates.

Released under the Universal Permissive License v1.0 as shown at https://oss.oracle.com/licenses/upl/.