Azure Data Platform/Analytics Platform

This project is here to provide an end-to-end data platform on Azure, with all of the required services in place and fully connected. This provides the empty shell of the solution, ready to "bring-your-own-data".

Pre-requisites:

An Azure Resource Group with Contributor Access (remember Azure inherits), Owner access is required for managed instance which is optional but discussed in this guide

How to set up the project

Step 1: Click Deploy To Azure below to get started

Fill in the corresponding fields:

Click deploy and you should have this showing:

This should take approximately 5 minutes, however that's solely due to there being a whole Virtual Machine deployment (IaaS), and you can continue to do the next steps as the Virtual Machine continues to deploy.

If you navigate to your newly created resource group or if you click Go To Deployment you should have all of these resources (the names may differ based on your input parameters upon deploying the template:

The final step looks like this:

Step 2: Configure the Azure Data Factory to use a Private Endpoint

Navigate to your Azure Data Factory that has been created in your resource group
Go to the networking tab on the left
Change the Networking Access from Public Endpoint to Private Endpoint

Select Private endpoint connections at the top
Add a Private Endpoint by clicking +Private Endpoint

Add an endpoint:

Tab 1:

Target your own resource in tab 2

Navigate to your Azure Data Factory and lunch it

Note that you also want to make sure that this private endpoint is configured with the correct VNET, else Azure Data Factory will resolve your public endpoint.

Step 3: Managed VNET

By creating a managed VNET it allows us to access our PaaS resources on a private endpoint.

Click the Manage option on the left menu
Click Integration Runtimes
Click New

Choose Azure, Self-Hosted, then continue

Then click Azure and continue again

Give it a good name, in this case I've named it managed-vnet
Enable Virtual network configuration Then click create

Click Manage Private Endpoints
New

Choose Azure Data Lake Gen 2 (ADLS2)

Give it a name, choose your Azure Subscription and your Storage Account that you made during the initial deployment stage

Go to your storage account and approve the private endpoint

Go back to Linked Services, and click new

Choose Azure Data Lake Storage Gen2

For Connect via integration runtime select from the dropdown your managed vnet that you just created in the previous step

If successful this should show up and your private endpoint should be approved

Test the connection

Now this approach is by using a key, if you have owner access to your subscription then you will be able to grant access via managed identity by configuring Access Control (IAM) and you can select Managed Identity in the steps above.

Go to integration runtimes
Click new

Choose Azure, Self-Hosted

Under Network Environment select Self-Hosted and then click continue

In this case I've named it self-hosted-vm

Follow either option 1 or 2, I did option 1 where I logged onto the VM and installed the application and used one of the keys

Now you can see that the status of the integration run time on my self-hosted VM is now running

You will need to use your VM in few steps, so don't log off yet

Add a new linked service again by going to linked services, new

Search for file and choose File System

Under Connect via integration runtime, choose your self-hosted-vm

On your VM, create a new folder on one of your drives

Create a new file in that folder
Copy the path of the folder

Under host, paste the file path

Enter your username and password that you set in the initial ARM resource template deployment stage

As best practice you can create a linked service to an Azure Key Vault

Linked Services
New

Select Key Vault

You would choose your Key Vault that you created during the deployment stage here This can then be used for secrets instead of inputting your password for the self-hosted VM in the previous stage

Go to your storage account and create a new container

Step 4: Creating connections for the pipeline

Click on Pipelines in the side menu on the left
Click the plus button
Choose datasets

Choose Azure Data Lake Gen 2

For format choose DelimitedText/CSV

Choose your linked service for your Azure Storage Account

Next you can click on the browse button

Choose your container that you previously created

For the import schema, click on None

Next do the same for your file server on the VM

Search for file and choose File System

Again choose DelimitedText/CSV

You can then choose your linked service for your file server

Browse for the file path

Select the file you created

Change the import schema to none

Step 5: Create a new pipeline

So all the stages before this are primarily just for set up. This is the most customisable step of the process, this will simply copy a file from your virtual machine to your container on Azure Blob Storage.

Firstly Drag Copy Data onto your canvas

At the bottom of the page

Click Source
Choose DelimitedText2 (the Virtual Machine Dataset)

Click Sink
Choose DelimitedText1 (the Storage Account Dataset)

Click Debug

It should then show the status as Succeeded

Complete! I hope you all enjoyed.

Feel free to create an issue on this GitHub page if you have found any problems setting it up or message me directly on LinkedIn or by emailing t-schish@microsoft.com

salmanmkc/Azure-Data-Platform