/adls-log-analysis

Primary LanguagePythonApache License 2.0Apache-2.0

Azure Data Lake Storage Log Analysis

The repository contains artifacts that enable processing of diagnostic 'classic' log files generated by usage of Azure Data Lake Storage/Azure Blob Storage. The processing artifacts leverage use of Databricks Spark clusters to perform scale-out analysis.

Enabling emitting logs

ADLS and Blob Storage share the same mechanism to emit logs in the $logs container. See details here for information on how to enable generation of diagnostic logs.

Create Azure Databricks workspace

Once logs have been enabled on the ADLS account, create an Azure Databricks workspace. In the current form, you will need to create a Premium workspace as Azure Active Directory credential passthrough is used to authenticate to ADLS.

If you already have an Azure Databricks workspace, this can be reused.

Importing the notebook

Using the Azure Databricks portal, you may import the notebook from this repository by specifying the raw GitHub url.

Once imported, follow the instructions in the notebook. The basic steps are:

  1. Mount the $logs container into DBFS using AAD Credential Passthrough.

  2. Execute the DDL statement to create an external table definition that wraps the data being emitted as logs. This table definition does not require data to be inserted into it directly and simply acts as a mechanism to allow the Spark query engine to process queries against the data. This step also creates a view AdlsLogs which includes some computed columns and will be the main query target.

    This step should only be executed once as dropping and re-creating the table will lose all accumulated partition information.

  3. Discover hourly partitions. Close observation of the CREATE TABLE statement will reveal a partitioning structure of; Year, Month, Day, Hour. Unfortunately, because the directory names do not follow the Spark requirement for partition directory names, the partitions must be manually added via a series of ALTER TABLE ADD PARTITION statements. Step 3 processes the directory structure and adds newly discovered partitions.

    This step should be run prior to any querying 'session' as new partitions are created for every hour of access.

  4. Query the logs using partition columns. In addition to the raw partition columns, a computed column PartitionedRequestDate allows the specification of query predicates that efficiently restrict the amount of data read to satisfy any given query. Without using partition columns, the query engine must process ALL data which, depending on the volume of log data, can be very time consuming. Specifying partition columns enables partition pruning to limit the amount of data read.