/govuk-content-metadata

GovNER: an encoder-based language model (RoBERTa) fine-tuned to perform Named Entity Recognition (NER) on GOV.UK content

Primary LanguagePythonMIT LicenseMIT

🔍 GovNER 🧐 : extracting Named Entities from GOV.UK

Repository for the GovNER project.

GovNER systematically extracts key metadata from the content of the GOV.UK website. GovNER is an encoder-based language model (RoBERTa) that has been fine-tuned to perform Named Entity Recognition (NER) on "govspeak", the language(s) specific of the GOV.UK content estate.

The repository consists of 5 main stand-alone components, each contained in their own sub-directory:

Tech Stack 🍒

  • Python
  • FastApi / uvicorn
  • Docker
  • Google Cloud Platform (Cloud Engine, Vertex AI, Workflows, Cloud Run, BigQuery, Cloud Storage, Scheduler)
  • Github Actions
  • bash

Named Entity Recognition (NER) and Entity Schema

Named Entity Recognition (NER) is an Natural Language Processing (NLP) technique, a type of multi-class supervised machine-learning method that identifies and sorts 'entities', real-world things like people, organisations or events, from text.

The Named Entity Schema is the set of all entity types (i.e., categories) that the NER model is trained to extract, together with their definitions and annotation instructions. For GovNER, we built as much as possible on schema.org. Using an agile approach, delivery was broken down into 3 phases, corresponding to three sets of entity types, for which we fine-tuned separate NER models. We have so far completed 2 phases. Predictions from these models were combined at inference stage.

Phase-1 entities

  • Money (amount)
  • Form (government forms)
  • Person
  • Date
  • Postcode
  • Email
  • Phone (number)

Phase-2 entities

  • Occupation
  • Role
  • Title
  • GPE
  • Location (non-GPE)
  • Facility
  • Organisation
  • Event

Daily 'new content only' inference pipeline 🚀

Complete code, requirements and documentation in inference_pipeline_new_content.

Inference pipeline scheduled to run daily to extract named entities from the content items on GOV.UK that substantially changed or were newly created the day before.

Vertex AI Batch Predictions are served via HTTP POST method, as part of a scheduled Google Cloud Workflow.

Serving the model in production via FastAPI and uvicorn 🦄

Complete code, requirements and documentation in fast_api_model_serving.

Containerised code to deploy and run an HTTP server to serve predictions vis API for our custom-trained fine-tuned NER models.

Bulk inference pipeline 🏋️

Complete code, requirements and documentation in bulk_inference_pipeline.

Inference pipeline to extract named entities from the whole GOV.UK content estate (in "bulk"). The pipeline is deployed in a Docker container onto a Virtual Machine (VM) instance with GPU on Google Compute Engine (GCE).

The bulk pipeline is intended to be executed as a one-off, if either of the phase-1 entity or phase-2 entity models is retrained and re-deployed.

Training pipeline 🏃

Complete code, requirements and documentation in training_pipe.

Pipeline to fine-tune the encoder-style transformer roberta-base for custom NER on Google Vertex AI, using a custom container training workflow and spaCy Projects for the training application.

Annotation workflow 📝

Complete code, requirements and documentation in prodigy_annotation.

Containerised code to create an annotation environment for annotators, using the proprietary software Prodigy.

GovNER web app 💻

Complete code, requirements and documentation in src/ner_streamlit_app.

Containerised code to build the interactive web application aimed at helping prospective users understand how NER works via visualisation and user interaction.

Developing 🏗️

Where we refer to the root directory we mean where this README.md is located.

Requirements 🚧

In addition:

Credentials

Access to the project on Google Cloud Platform.

Python requirements and pre-commit hooks

To install the Python requirements and pre-commit hooks, open your terminal and enter:

make requirements

or, alternatively, to only install the necessary Python packages using pip:

pip install -r requirements.txt

To add to the Python requirement file, add any new dependencies actually imported in your code to the requirements-original.txt file, and then run:

pip freeze -r requirements-original.txt > requirements.txt

Tests 🚦

Tests are run as part of a GitHub action.

To run test locally:

pytest

Licence

Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation. The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.