This repo walks through the process of creating a knowledge mining solution to enrich your data by identifying custom entities in a corpus of data using an AI custom skill. We'll leverage a number of capabilities in Azure Cognitive Search and Azure Machine Learning to extract entities from documents.
The solution will show you how to:
- Create a custom skill to label data for named entity recognition (NER)
- Create an enrichment pipeline with Azure Cognitive Search that integrates the label skill to create labelled data from a corpus of documents.
- Project the labeled data as a new dataset into the Cognitive Search Knowledge Store so that it can be used for training.
If you already have labelled data or want to use the sample data provided with this repo, you can skip ahead to Step 4.
- Use the labeled data to train a Named Entity Recognition (NER) Model in Azure Machine Learning using a BERT model designed to extract entities from documents. The code used to train the model was derived from the NLP Recipes Repo.
- Integrate the BERT NER custom skill with Azure Cognitive Search to project the identified entities and content of each document into the knowledge store and the search index.
This is designed to be used in conjunction with the Knowledge Mining Solution Accelerator. After you train and deploy the model, you can easily integrate the model with the solution accelerator to showcase the results in a Web App.
The directions provided in this guide assume you have a working understanding of Azure Machine Learning and Azure Cognitive Search. It's important to also understand the concept of Custom Skills, Knowledge Store and Projections in Cognitive Search. You'll also need access to an Azure Subscription.
This repo helps you to (1) Label data for Named Entity Recognition (NER) and (2) Train a model using that labelled data. Steps 1 thru 3 walk you through the process of labelling that data while Steps 4 and 5 walk you through the process of training and deploying the model
If you already have labelled data or want to use the sample data provided with this repo, you can skip ahead to Step 4.
First, deploy the necessary resources onto Azure:
Description | ARM Template |
---|---|
Recommended: Deploy everything | |
Deploy Cognitive Search, Cognitive Services, Container Registry, and Azure Machine Learning | |
Deploy only Azure Machine Learning |
Next, walk through the folders sequentially and follow the steps outlined in each README:
There are two custom skills that will be deployed to generate NER Labeled data from. Deploy a Custom Skill to label data in CONLL format based on a pre-determined list of entities (see labels.json). The custom Label skill is created using a Flask API App, wrapped up into a docker container and then deployed onto a Web App for Containers.
Walks you through the process of integrating the custom skill with an Azure Cognitive Search index. You'll create the index and project the documents to knowledge store using a Jupyter Notebook. Download Sample data to generate NER Label data from SampleData folder
Aggregate all the sentences from different documents in Knowledge Store into a single document that will be used to train NER ML model
Trains a BERT model to extract entities from documents and then deploys the model to an AKS cluster using a custom skill format.
At this point, you'll want to have your Azure Machine Learning Workspace setup so that you can walk through these instructions to set up a notebook VM and clone this repo onto it.
Walks through the process of integrating the AML Skill with the search index and spinning up the user interface. At the end of this step, you'll have a Cognitive Search index containing the entities extracted using your BERT model.