Azure/spark-cdm-connector

Attaching to the CDM Folder as external tables in Synapse db

Closed this issue · 2 comments

Hi,

I hope this is an appropriate question for this repo (if not I will happily close the issue).

I have ran in the sample PySpark notebook to generate the CDM folder and parquet files...all good so far. I now have my ADLG2 storage account inside a container with the folders and parquet files.

I'd now like to understand how to connect to this newly created folder & parquet files from the Synapse sql pool so I can query the data from sql directly.

My thinking is that I'd like to use the CDM connector to generate data in CDM format, and then give me an option to query this data using Synapse SQL.

What I "think" I need to do is to first create a database and then create an external table such as the examples in this article:

https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table#create-an-external-table-backed-by-parquet-in-spark-and-query-from-serverless-sql-pool

The sample sql is:

CREATE TABLE mytestdb.myexternalparquettable
USING Parquet
LOCATION "abfss://@arcadialake.dfs.core.windows.net/synapse/workspaces/<synapse_ws>/warehouse/mytestdb.db/myparquettable/"

The docs state:

"Replace the placeholder with the file system name that is the workspace default file system and the placeholder <synapse_ws> with the name of the synapse workspace you're using to run this example."

I am struggling to understand what the specific values need to be.

The Synapse workspace is the name of the resource in Azure. Basically, whatever is before "-ondemand.sql.azuresynapse.net" for the Serverless SQL Endpoint.
Although this seems like a Synapse configuration issue rather than a Spark CDM connector issue and I'd recommend the following post, specifically under the Setup external data source in Azure SQL heading:
https://devblogs.microsoft.com/azure-sql/read-azure-storage-files-using-synapse-sql-external-tables/

Thanks @drinkingbird I have been able to get this working now.