/CDE_Tour_ACE_HOL

Primary LanguagePythonOtherNOASSERTION

CDE Tour ACE Workshop HOL

⚠ Warning
This Guide was created for CDE versions up until 1.18. CDE 1.19 was released in May 2023 and includes some important updates. If you have a CDE 1.19 Virtual Cluster at your disposal we recommend using the updated version of the HOL available at this GitHub repository.

Objective

CDE is the Cloudera Data Engineering Service, a containerized managed service for Cloudera Data Platform designed for Large Scale Batch Pipelines with Spark, Airflow and Iceberg. It allows you to submit batch jobs to auto-scaling virtual clusters. As a Cloud-Native service, CDE enables you to spend more time on your applications, and less time on infrastructure.

CDE allows you to create, manage, and schedule Apache Spark jobs without the overhead of creating and maintaining Spark clusters. With CDE, you define virtual clusters with a range of CPU and memory resources, and the cluster scales up and down as needed to run your Spark workloads, helping to control your cloud costs.

This Hands On Lab is designed to walk you through the Services's main capabilities. Throughout the exercises you will:

  1. Deploy an Ingestion, Transformation and Reporting pipeline with Spark 3.2.
  2. Learn about Iceberg's most popular features.
  3. Orchestrate pipelines with Airflow.
  4. Use the CDE CLI to execute Spark Submits and more from your local machine.

Step by Step Instructions

Detailed instructions in English are provided in the step_by_step_guides folder.

Next Steps

CDE is the Cloudera Data Engineering Service, a containerized managed service for Spark and Airflow.

If you are exploring CDE you may find the following tutorials relevant:

  • Spark 3 & Iceberg: A quick intro of Time Travel Capabilities with Spark 3.

  • Simple Intro to the CDE CLI: An introduction to the CDE CLI for the CDE beginner.

  • CDE CLI Demo: A more advanced CDE CLI reference with additional details for the CDE user who wants to move beyond the basics.

  • CDE Resource 2 ADLS: An example integration between ADLS and CDE Resource. This pattern is applicable to AWS S3 as well and can be used to pass execution scripts, dependencies, and virtually any file from CDE to 3rd party systems and viceversa.

  • Using CDE Airflow: A guide to Airflow in CDE including examples to integrate with 3rd party systems via Airflow Operators such as BashOperator, HttpOperator, PythonOperator, and more.

  • GitLab2CDE: a CI/CD pipeline to orchestrate Cross-Cluster Workflows for Hybrid/Multicloud Data Engineering.

  • CML2CDE: an API to create and orchestrate CDE Jobs from any Python based environment including CML. Relevant for ML Ops or any Python Users who want to leverage the power of Spark in CDE via Python requests.

  • Postman2CDE: An example of the Postman API to bootstrap CDE Services with the CDE API.

  • Oozie2CDEAirflow API: An API to programmatically convert Oozie workflows and dependencies into CDE Airflow and CDE Jobs. This API is designed to easily migrate from Oozie to CDE Airflow and not just Open Source Airflow.

For more information on the Cloudera Data Platform and its form factors please visit this site.

For more information on migrating Spark jobs to CDE, please reference this guide.

If you have any questions about CML or would like to see a demo, please reach out to your Cloudera Account Team or send a message through this portal and we will be in contact with you soon.

alt text