Azure Data Platform/Analytics Platform
This project is here to provide an end-to-end data platform on Azure, with all of the required services in place and fully connected. This provides the empty shell of the solution, ready to "bring-your-own-data".
Pre-requisites:
- An Azure Resource Group with Contributor Access (remember Azure inherits), Owner access is required for managed instance which is optional but discussed in this guide
How to set up the project
Step 1: Click Deploy To Azure below to get started
Fill in the corresponding fields:
Click deploy and you should have this showing:
This should take approximately 5 minutes, however that's solely due to there being a whole Virtual Machine deployment (IaaS), and you can continue to do the next steps as the Virtual Machine continues to deploy.
If you navigate to your newly created resource group or if you click Go To Deployment you should have all of these resources (the names may differ based on your input parameters upon deploying the template:
The final step looks like this:
Step 2: Configure the Azure Data Factory to use a Private Endpoint
- Navigate to your Azure Data Factory that has been created in your resource group
- Go to the networking tab on the left
- Change the Networking Access from Public Endpoint to Private Endpoint
- Select Private endpoint connections at the top
- Add a Private Endpoint by clicking +Private Endpoint
Add an endpoint:
Target your own resource in tab 2
Navigate to your Azure Data Factory and lunch it
Note that you also want to make sure that this private endpoint is configured with the correct VNET, else Azure Data Factory will resolve your public endpoint.
Step 3: Managed VNET
By creating a managed VNET it allows us to access our PaaS resources on a private endpoint.
Choose Azure, Self-Hosted, then continue
Then click Azure and continue again
- Give it a good name, in this case I've named it managed-vnet
- Enable Virtual network configuration Then click create
- Click Manage Private Endpoints
- New
Choose Azure Data Lake Gen 2 (ADLS2)
Give it a name, choose your Azure Subscription and your Storage Account that you made during the initial deployment stage
Go to your storage account and approve the private endpoint
Go back to Linked Services, and click new
Choose Azure Data Lake Storage Gen2
For Connect via integration runtime select from the dropdown your managed vnet that you just created in the previous step
If successful this should show up and your private endpoint should be approved
Test the connection
Now this approach is by using a key, if you have owner access to your subscription then you will be able to grant access via managed identity by configuring Access Control (IAM) and you can select Managed Identity in the steps above.
- Go to integration runtimes
- Click new
Choose Azure, Self-Hosted
Under Network Environment select Self-Hosted and then click continue
In this case I've named it self-hosted-vm
Follow either option 1 or 2, I did option 1 where I logged onto the VM and installed the application and used one of the keys
Now you can see that the status of the integration run time on my self-hosted VM is now running
You will need to use your VM in few steps, so don't log off yet
Add a new linked service again by going to linked services, new
Search for file and choose File System
Under Connect via integration runtime, choose your self-hosted-vm
On your VM, create a new folder on one of your drives
- Create a new file in that folder
- Copy the path of the folder
Under host, paste the file path
Enter your username and password that you set in the initial ARM resource template deployment stage
As best practice you can create a linked service to an Azure Key Vault
- Linked Services
- New
Select Key Vault
You would choose your Key Vault that you created during the deployment stage here This can then be used for secrets instead of inputting your password for the self-hosted VM in the previous stage
Go to your storage account and create a new container
Step 4: Creating connections for the pipeline
- Click on Pipelines in the side menu on the left
- Click the plus button
- Choose datasets
Choose Azure Data Lake Gen 2
For format choose DelimitedText/CSV
Choose your linked service for your Azure Storage Account
Next you can click on the browse button
Choose your container that you previously created
For the import schema, click on None
Next do the same for your file server on the VM
Search for file and choose File System
Again choose DelimitedText/CSV
You can then choose your linked service for your file server
Browse for the file path
Select the file you created
Change the import schema to none
Step 5: Create a new pipeline
So all the stages before this are primarily just for set up. This is the most customisable step of the process, this will simply copy a file from your virtual machine to your container on Azure Blob Storage.
Firstly Drag Copy Data onto your canvas
At the bottom of the page
- Click Source
- Choose DelimitedText2 (the Virtual Machine Dataset)
- Click Sink
- Choose DelimitedText1 (the Storage Account Dataset)
Click Debug
It should then show the status as Succeeded
Complete! I hope you all enjoyed.
Feel free to create an issue on this GitHub page if you have found any problems setting it up or message me directly on LinkedIn or by emailing t-schish@microsoft.com