Repository for the GovNER project.
GovNER systematically extracts key metadata from the content of the GOV.UK website. GovNER is an encoder-based language model (RoBERTa) that has been fine-tuned to perform Named Entity Recognition (NER) on "govspeak", the language(s) specific of the GOV.UK content estate.
The repository consists of 5 main stand-alone components, each contained in their own sub-directory:
- Daily NER inference pipeline
- Bulk NER inference pipeline
- Model fine-tuning pipeline
- Model serving via REST API
- GovNER web app
- Python
- FastApi / uvicorn
- Docker
- Google Cloud Platform (Cloud Engine, Vertex AI, Workflows, Cloud Run, BigQuery, Cloud Storage, Scheduler)
- Github Actions
- bash
Named Entity Recognition (NER) is an Natural Language Processing (NLP) technique, a type of multi-class supervised machine-learning method that identifies and sorts 'entities', real-world things like people, organisations or events, from text.
The Named Entity Schema is the set of all entity types (i.e., categories) that the NER model is trained to extract, together with their definitions and annotation instructions. For GovNER, we built as much as possible on schema.org. Using an agile approach, delivery was broken down into 3 phases, corresponding to three sets of entity types, for which we fine-tuned separate NER models. We have so far completed 2 phases. Predictions from these models were combined at inference stage.
- Money (amount)
- Form (government forms)
- Person
- Date
- Postcode
- Phone (number)
- Occupation
- Role
- Title
- GPE
- Location (non-GPE)
- Facility
- Organisation
- Event
Complete code, requirements and documentation in inference_pipeline_new_content.
Inference pipeline scheduled to run daily to extract named entities from the content items on GOV.UK that substantially changed or were newly created the day before.
Vertex AI Batch Predictions are served via HTTP POST method, as part of a scheduled Google Cloud Workflow.
Complete code, requirements and documentation in fast_api_model_serving.
Containerised code to deploy and run an HTTP server to serve predictions vis API for our custom-trained fine-tuned NER models.
Complete code, requirements and documentation in bulk_inference_pipeline.
Inference pipeline to extract named entities from the whole GOV.UK content estate (in "bulk"). The pipeline is deployed in a Docker container onto a Virtual Machine (VM) instance with GPU on Google Compute Engine (GCE).
The bulk pipeline is intended to be executed as a one-off, if either of the phase-1 entity or phase-2 entity models is retrained and re-deployed.
Complete code, requirements and documentation in training_pipe.
Pipeline to fine-tune the encoder-style transformer roberta-base
for custom NER on Google Vertex AI, using a custom container training workflow and spaCy Projects for the training application.
Complete code, requirements and documentation in prodigy_annotation.
Containerised code to create an annotation environment for annotators, using the proprietary software Prodigy.
Complete code, requirements and documentation in src/ner_streamlit_app.
Containerised code to build the interactive web application aimed at helping prospective users understand how NER works via visualisation and user interaction.
Where we refer to the root directory we mean where this README.md is located.
- Git
- pre-commit
- Make
- gcloud CLI
- Docker
- Python (v3.9+) - Only for local development without Docker
In addition:
- a
.secrets
file in this repository's root directory - loaded environment variables from
.envrc
Access to the project on Google Cloud Platform.
To install the Python requirements and pre-commit hooks, open your terminal and enter:
make requirements
or, alternatively, to only install the necessary Python packages using pip:
pip install -r requirements.txt
To add to the Python requirement file, add any new dependencies actually imported in your code to the requirements-original.txt
file, and then run:
pip freeze -r requirements-original.txt > requirements.txt
Tests are run as part of a GitHub action.
To run test locally:
pytest
Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation. The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.