Knowledge Mining - AzureML Solution Accelerator

km-aml solution accelerator banner

Introduction

This repo walks through the process of creating a knowledge mining solution to enrich your data by identifying custom entities in a corpus of data using an AI custom skill. We'll leverage a number of capabilities in Azure Cognitive Search and Azure Machine Learning to extract entities from documents.

The solution will show you how to:

1. Create a labeled dataset using your documents (Steps 1-3)

  1. Create a custom skill to label data for named entity recognition (NER)
  2. Create an enrichment pipeline with Azure Cognitive Search that integrates the label skill to create labelled data from a corpus of documents.
  3. Project the labeled data as a new dataset into the Cognitive Search Knowledge Store so that it can be used for training.

2. Train a BERT NER model (Steps 4-5)

If you already have labelled data or want to use the sample data provided with this repo, you can skip ahead to Step 4.

  1. Use the labeled data to train a Named Entity Recognition (NER) Model in Azure Machine Learning using a BERT model designed to extract entities from documents. The code used to train the model was derived from the NLP Recipes Repo.
  2. Integrate the BERT NER custom skill with Azure Cognitive Search to project the identified entities and content of each document into the knowledge store and the search index.

indexing documents

This is designed to be used in conjunction with the Knowledge Mining Solution Accelerator. After you train and deploy the model, you can easily integrate the model with the solution accelerator to showcase the results in a Web App.

Prerequisites

The directions provided in this guide assume you have a working understanding of Azure Machine Learning and Azure Cognitive Search. It's important to also understand the concept of Custom Skills, Knowledge Store and Projections in Cognitive Search. You'll also need access to an Azure Subscription.

Getting Started

This repo helps you to (1) Label data for Named Entity Recognition (NER) and (2) Train a model using that labelled data. Steps 1 thru 3 walk you through the process of labelling that data while Steps 4 and 5 walk you through the process of training and deploying the model

If you already have labelled data or want to use the sample data provided with this repo, you can skip ahead to Step 4.

First, deploy the necessary resources onto Azure:

Description ARM Template
Recommended: Deploy everything
Deploy Cognitive Search, Cognitive Services, Container Registry, and Azure Machine Learning
Deploy only Azure Machine Learning

Next, walk through the folders sequentially and follow the steps outlined in each README:

There are two custom skills that will be deployed to generate NER Labeled data from. Deploy a Custom Skill to label data in CONLL format based on a pre-determined list of entities (see labels.json). The custom Label skill is created using a Flask API App, wrapped up into a docker container and then deployed onto a Web App for Containers.

Walks you through the process of integrating the custom skill with an Azure Cognitive Search index. You'll create the index and project the documents to knowledge store using a Jupyter Notebook. Download Sample data to generate NER Label data from SampleData folder

Aggregate all the sentences from different documents in Knowledge Store into a single document that will be used to train NER ML model

Trains a BERT model to extract entities from documents and then deploys the model to an AKS cluster using a custom skill format.

At this point, you'll want to have your Azure Machine Learning Workspace setup so that you can walk through these instructions to set up a notebook VM and clone this repo onto it.

Walks through the process of integrating the AML Skill with the search index and spinning up the user interface. At the end of this step, you'll have a Cognitive Search index containing the entities extracted using your BERT model.