- Currently, data lakes comprising Oracle Data Warehouse and Apache Spark have these characteristics:
- They have separate data catalogs, even if they access the same data in an object store.
- Applications built entirely on Spark have to compensate for gaps in data management.
- Applications that federate across Spark and Oracle usually suffer from inefficient data movement.
- Operating Spark clusters are expensive because they lack administration tooling and they have gaps in data management. Therefore, price-performance advantages of Spark are overstated.
This project fixes those issues:
- It provides a single catalog: Oracle Data Dictionary.
- Oracle is responsible for data management, including:
- Consistency
- Isolation
- Security
- Storage layout
- Data lifecycle
- Data in an object store managed by Oracle as external tables
- It provides support for a full Spark programming model.
- Spark on Oracle has these characteristics:
- Full pushdown on SQL workloads: Query, DML on all tables, DDL for external tables.
- Push SQL operations of other workloads.
- Surface Oracle capabilities like machine learning and streaming in the Spark programming model.
- Co-processor on Oracle instances to run certain kinds of Scala code. Co-processors are isolated and limited and therefore are easy to manage.
- Enable simpler, smaller Spark clusters.
Feature summary:
- Catalog integration. (See this page.)
- Significant support for SQL pushdown, to the extent that more than 95 (of 99) TPCDS queries are completely pushed to Oracle instance. (See Operator and Expression translation pages.)
- Deployable as a Spark extension jar for Spark 3 environments.
- Language integration beyond SQL and DML support.
See Project Wiki for complete documentation.
Spark on Oracle can be deployed on any Spark 3.1 or above environment. See the Quick Start Guide.
See the wiki.
The demo script walks you through the features of the library.
Please file Github issues.
This project welcomes contributions from the community. Before submitting a pull request, please review our contribution guide.
Please consult the security guide for our responsible security vulnerability disclosure process.
Copyright (c) 2022 Oracle and/or its affiliates.
Released under the Universal Permissive License v1.0 as shown at https://oss.oracle.com/licenses/upl/.