/sqlworkshops-bdc

Primary LanguageJupyter Notebook

Workshop: SQL Server Big Data Clusters - Architecture

A Microsoft Course from the SQL Server team

About this Workshop
Business Applications of this Workshop
Technologies used in this Workshop
Before Taking this Workshop
Workshop Details
Related Workshops
Workshop Modules
Next Steps

Welcome to this Microsoft solutions workshop on the architecture on SQL Server Big Data Clusters. In this workshop, you'll learn how SQL Server Big Data Clusters (BDC) implements large-scale data processing and machine learning, and how to select and plan for the proper architecture to enable machine learning to train your models using Python, R, Java or SparkML to operationalize these models, and how to deploy your intelligent apps side-by-side with their data.

The focus of this workshop is to understand how to deploy an on-premises or local environment of a big data cluster, and understand the components of the big data solution architecture.

You'll start by understanding the concepts of big data analytics, and you'll get an overview of the technologies (such as containers, container orchestration, Spark and HDFS, machine learning, and other technologies) that you will use throughout the workshop. Next, you'll understand the architecture of a BDC. You'll learn how to create external tables over other data sources to unify your data, and how to use Spark to run big queries over your data in HDFS or do data preparation. You'll review a complete solution for an end-to-end scenario, with a focus on how to extrapolate what you have learned to create other solutions for your organization.

This github README.MD file explains how the workshop is laid out, what you will learn, and the technologies you will use in this solution. To download this Lab to your local computer, click the Clone or Download button you see at the top right side of this page. More about that process is here.

You can view all of the courses and other workshops our team has created at this link - open in a new tab to find out more.

Learning Objectives

In this workshop you'll learn:

  • When to use Big Data technology
  • The components and technologies of Big Data processing
  • Abstractions such as Containers and Container Management as they relate to SQL Server and Big Data
  • Planning and architecting an on-premises, in-cloud, or hybrid big data solution with SQL Server
  • How to install SQL Server big data clusters on-premises and in the Azure Kubernetes Service (AKS)
  • How to work with Apache Spark
  • The Data Science Process to create an end-to-end solution
  • How to work with the tooling for BDC (Azure Data Studio)
  • Monitoring and managing the BDC
  • Security considerations

Starting in SQL Server 2019, big data clusters allows for large-scale, near real-time processing of data over the HDFS file system and other data sources. It also leverages the Apache Spark framework which is integrated into one environment for management, monitoring, and security of your environment. This means that organizations can implement everything from queries to analysis to Machine Learning and Artificial Intelligence within SQL Server, over large-scale, heterogeneous data. SQL Server big data clusters can be implemented fully on-premises, in the cloud using a Kubernetes service such as Azure's AKS, and in a hybrid fashion. This allows for full, partial, and mixed security and control as desired.

The goal of this workshop is to train the team tasked with architecting and implementing SQL Server big data clusters in the planning, creation, and delivery of a system designed to be used for large-scale data analytics. Since there are multiple technologies and concepts within this solution, the workshop uses multiple types of exercises to prepare the students for this implementation.

The concepts and skills taught in this workshop form the starting points for:

  • Data Professionals and DevOps teams, to implement and operate a SQL Server big data cluster system.
  • Solution Architects and Developers, to understand how to put together an end-to-end solution.
  • Data Scientists, to understand the environment used to analyze and solve specific predictive problems.

Businesses require near real-time insights from ever-larger sets of data from a variety of sources. Large-scale data ingestion requires scale-out storage and processing in ways that allow fast response times. In addition to simply querying this data, organizations want full analysis and even predictive capabilities over their data.

Some industry examples of big data processing are in Retail (Demand Prediction, Market-Basket Analysis), Finance (Fraud detection, customer segmentation), Healthcare (Fiscal control analytics, Disease Prevention prediction and classification, Clinical Trials optimization), Public Sector (Revenue prediction, Education effectiveness analysis), Manufacturing (Predictive Maintenance, Anomaly Detection) and Agriculture (Food Safety analysis, Crop forecasting) to name just a few.

The solution includes the following technologies - although you are not limited to these, they form the basis of the workshop. At the end of the workshop you will learn how to extrapolate these components into other solutions. You will cover these at an overview level, with references to much deeper training provided.

Technology Description
LinuxOperating system used in Containers and Container Orchestration
ContainersEncapsulation level for the SQL Server big data cluster architecture
Container Orechestration (such as Kubernetes)Management, control plane for Containers
Microsoft AzureCloud environment for services
Azure Kubernetes Service (AKS)Kubernetes as a Service
Apache HDFSScale-out storage subsystem
Apache KnoxThe Knox Gateway provides a single access point for all REST interactions, used for security
Apache LivyJob submission system for Apache Spark
Apache SparkIn-memory large-scale, scale-out data processing architecture used by SQL Server
Python, R, Java, SparkMLML/AI programming languages used for Machine Learning and AI Model creation
Azure Data StudioTooling for SQL Server, HDFS, Big Data cluster management, T-SQL, R, Python, and SparkML languages
SQL Server Machine Learning ServicesR, Python and Java extensions for SQL Server
Microsoft Data Science Process (TDSP)Project, Development, Control and Management framework
Monitoring and ManagementDashboards, logs, API's and other constructs to manage and monitor the solution
SecurityRBAC, Keys, Secrets, VNETs and Compliance for the solution

Condensed Lab: If you have already completed the pre-requisites for this course and are familiar with the technologies listed above, you can jump to a Jupyter Notebooks-based tutorial located here. Load these with Azure Data Studio, starting with bdc_tutorial_00.ipynb.

You'll need a local system that you are able to install software on. The workshop demonstrations use Microsoft Windows as an operating system and all examples use Windows for the workshop. Optionally, you can use a Microsoft Azure Virtual Machine (VM) to install the software on and work with the solution.

You must have a Microsoft Azure account with the ability to create assets, specifically the Azure Kubernetes Service (AKS).

This workshop expects that you understand data structures and working with SQL Server and computer networks. This workshop does not expect you to have any prior data science knowledge, but a basic knowledge of statistics and data science is helpful in the Data Science sections. Knowledge of SQL Server, Azure Data and AI services, Python, and Jupyter Notebooks is recommended. AI techniques are implemented in Python packages. Solution templates are implemented using Azure services, development tools, and SDKs. You should have a basic understanding of working with the Microsoft Azure Platform.

If you are new to these, here are a few references you can complete prior to class:

Setup

A full prerequisites document is located here. These instructions should be completed before the workshop starts, since you will not have time to cover these in class. Remember to turn off any Virtual Machines from the Azure Portal when not taking the class so that you do incur charges (shutting down the machine in the VM itself is not sufficient).

This workshop uses Azure Data Studio, Microsoft Azure AKS, and SQL Server (2019 and higher) with a focus on architecture and implementation.

Primary Audience:System Architects and Data Professionals tasked with implementing Big Data, Machine Learning and AI solutions
Secondary Audience: Security Architects, Developers, and Data Scientists
Level: 300
Type:In-Person
Length: 8-9 hours

This is a modular workshop, and in each section, you'll learn concepts, technologies and processes to help you complete the solution.

ModuleTopics
01 - The Big Data Landscape Overview of the workshop, problem space, solution options and architectures
02 - SQL Server BDC Components Abstraction levels, frameworks, architectures and components within SQL Server big data clusters
03 - Planning, Installation
and Configuration
Mapping the requirements to the architecture design, constraints, and diagrams
04 - Operationalization Connecting applications to the solution; DDL, DML, DCL
05 - Management and
Monitoring
Tools and processes to manage the big data cluster
06 - Security Access and Authentication to the various levels of the solution

Next Steps

Next, Continue to prerequisites

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Legal Notices

License

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/

Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.