A pretty simple stored procedure that can be run in Synapse SQL Serverless and returns a full CREATE EXTERNAL TABLE statement which can be run against Synapse SQL Dedicated.
The Stored Procedure, called MakeExternalTableFromParquet takes 6 arguements:
- InputFilePath - the location of the Parquet file / files being analysed e.g. https://[storage account].dfs.core.windows.net/[directories]/*.snappy.parquet
- ExternalTableFileLocation - to be used in the LOCATION arguement within the generated EXTERNAL TABLE statement, generally this will need to be written as '/directory/subdirectory/'.
- ExternalTableFileFormat - External Data Source to be used within the generated EXTERNAL TABLE statement.
- ExternalTableSchema - Schema in which the EXTERNAL TABLE will be created in.
- ExternalTableName - Table name of the EXTERNAL TABLE.
NOTE - ExternalTableFileLocation, ExternalTableFileFormat, ExternalTableSchema and ExternalTableName are purely used for generating the CREATE EXTERNAL TABLE statement, there is no validation etc on this.
Generating a CREATE EXTERNAL TABLE statement can be a pain, whilst tools like the Synapse Workspace can do this automatically (through the GUI) and Synapse Pipelines / Azure Data Factory can also do this as part of a COPY activity - that is not always suitable for all needs.
Test coveragge for this is laughably bad (sorry), so your mileage my vary but in the 2 dozen different parquet files I tested with, it worked well enough. I think there is plenty of scope for improvements (see below) and making it more efficient - but hopefully in it's current form it helps.
This stored procedure is built around a couple of steps:
- Create a view using OPENROWSET using a NEWID() as the view name.
- Generate the first part of the create external table statement - basically up to the where each column name and it's data type is specified, this part uses the ExternalTableSchema and ExternalTableName parameters defined when calling the stored procedure.
- Generate the last part of the create external table statement - basically everything after where each column name and it's data type is specified, this part uses the ExternalTableFileLocation and ExternalTableFileFormat parameters defined when calling the stored procedure.
- Extract details from INFORMATION_SCHEMA.COLUMNS based on the VIEW created in step 1, with some logic for capturing VARCHARs.
- Using a combination of CONCAT and STRING_AGG to bring together step 2, step 4 and step 3 (n.b. step 4 and this step actually happen in a single step).
- Drop the view
The output from Step 5 is all that should be returned from the Stored Procedure, you can then run this on a Synapse SQL Dedicated environment.
Unfortunately using sp_describe_first_result_set isn't very workable, so the creation of the view seems like the easiest approach - let me know if you can find a better way (noting Synapse SQL Serverless doesn't have temp tables!).
Oh so very much (not prioritized)
- The documentation needs to be improved - probably a diagram would help
- Error handling / rollback of the view if this errors
- Simplify the inputs?
- Look at efficiency options - variable is very lazily a VARCHAR(MAX).
- Not sure I love the name of the Stored Procedure
- Clean up the code (properly comment it, remove invalid calls which I have commented out)
- Plenty more testing.