streamlit-video-custom-ner.webm
This is a project developed to create a code template and to understand different Named Entity Recognition(NER) methods for custom entities. This project includes different training notebooks to create different kind of custom NER models. This project also includes a code to make productionaized NER api using standard practices in MLOps.
- collect,clean,annotate text data
- implement different methods of NER models
- build inference api
- create streamlit application
- write unit test cases and performance test cases
- code documentation
- code formatting
- code deployment using docker and circleci
This code can be used for end to end NER project development as well as deployment.
If you are only looking to learn/use model building techniques,directly jump to notebooks:
1.Custom NER using Spacy
2.Custom NER using transformers
3.Custom NER using custom Neural Network using Pytorch
The basic code template for this project is derived from my another repo code template
The project considers following phases in ML project development lifecycle:
Requirement
Data Collection
Model Building
Inference
Testing
Deployment
Create NER api which accepts a news article or a sentence from news article and identifies named Entities defined by businnes user,they are ORG,PLACE,PERSON,ROLE.
I've taken these entities but we can add any new type of custom entity and train model.
News article data is collected by refering my another repo news api.
Then sample 250 sentences annotated using doccano,a data annotation tool. Please note that since this is just a demo project,we have not used huge data. We have used only 250 sentences. In reality,data might be huge and any other data annotation technique can be used.
Model techniques used :
1.Spacy
Create spacy formatted data and train using Spacy library
2.transformers
Create BILOU format data at sentence level i.e 1 sentence per row and train using hugging face transformers trainer api.
3.custom neural network
Create BILOU format data at word level i.e 1 word per row and BERT tokenizer will add extra
tokens like sub word,CLS,SEP. Can we treat this sequence of tokens of a single word as sequence classification for NER?
I've explored this in custom_ner_dl notebook.
However,I am not sure about this approach. Need to do more research.Hence not included this in final models list.
Model Name | Library | Description | Named Entites |
---|---|---|---|
custom_ner_spacy | Spacy | train spacy for custom NER | CUSTOM_ORG,CUSTOM_PERSON,CUSTOM_PLACE,CUSTOM_ROLE |
custom_ner_transformers | transformers | Finetune DistilBertTokenClassification for custom NER in BILOU format | B-CUSTOM_ORG,I-CUSTOM_ORG,L-CUSTOM_ORG,U-CUSTOM_ORG,B-CUSTOM_PERSON,I-CUSTOM_PERSON,L-CUSTOM_PERSON,U-CUSTOM_PERSON,B-CUSTOM_ROLE,I-CUSTOM_ROLE,L-CUSTOM_ROLE,U-CUSTOM_ROLE,B-CUSTOM_PLACE',I-CUSTOM_PLACE,L-CUSTOM_PLACE,U-CUSTOM_PLACE |
How to handle imbalance data?
Since in BILOU format O stands for non-entity and we are going to have most of words as non-entity,our
dataset is imbalanced.So we've used class_weights to handle this.
There are 2 ways to deploy this application.
- API using FastAPI.
- Streamlit application
Unit test cases are written
Deployment is done locally using docker.
Like any production code,this code is organized in following way:
- Keep all Requirement gathering documents in docs folder.
- Keep Data Collection and exploration notebooks in src/training folder. data_collection_eda.ipynb
- Keep datasets in data folder.
Raw data kept in raw_data csv. Cleaned sentences stored in sentences_clean_data.jsonl
Annotated sentences stored in 250_sentences_annotated_data.jsonl used for spacy training
Annotated data for transformers training stored in BIOUL_data.csv - Keep model building notebooks at src/training folder.
- Keep generated model files at src/models.
- Write and keep inference code in src/inference.
- Write Logging and configuration code in src/utility.
- Write unit test cases in tests folder.pytest,pytest-cov
- Write performance test cases in tests folder.locust
- Build docker image.Docker
- Use and configure code formatter.black
- Use and configure code linter.pylint
- Use Circle Ci for CI/CD.Circlci
Clone this repo locally and add/update/delete as per your requirement.
Since we have used different design patterns like singleton,factory.It is easy to add/remove model to this code. You can remove code files for all models except the model which you want to keep as a final.
Please note that this template is in no way complete or the best way for your project structure.
This template is just to get you started quickly with almost all basic phases covered in creating production ready code.
├── README.md <- top-level README for developers using this project.
├── pyproject.toml <- black code formatting configurations.
├── .dockerignore <- Files to be ognored in docker image creation.
├── .gitignore <- Files to be ignored in git check in.
├── .circleci/config.yml <- Circleci configurations
├── .pylintrc <- Pylint code linting configurations.
├── Dockerfile <- A file to create docker image.
├── environment.yml <- stores all the dependencies of this project
├── main.py <- A main file to run API server.
├── main_streamlit.py <- A main file to run API server.
├── src <- Source code files to be used by project.
│ ├── inference <- model output generator code
│ ├── model <- model files
│ ├── training <- model training code
│ ├── utility <- contains utility and constant modules.
├── logs <- log file path
├── config <- config file path
├── data <- datasets files
├── docs <- documents from requirement,team collabaroation etc.
├── tests <- unit and performancetest cases files.
│ ├── cov_html <- Unit test cases coverage report
Development Environment used to create this project:
Operating System: Windows 10 Home
Anaconda:4.8.5 Anaconda installation
Go to location of environment.yml file and run:
conda env create -f environment.yml
Here we have created ML inference on FastAPI server with dummy model output.
- Go inside 'custom_ner_api' folder on command line.
Run:
conda activate custom_ner_api
python main.py
Open 'http://localhost:5000/docs' in a browser.
- Or to start Streamlit application
- Run:
conda activate custom_ner_api
streamlit run main_streamlit.py
- Go inside 'tests' folder on command line.
- Run:
pytest -vv
pytest --cov-report html:tests/cov_html --cov=src tests/
- Open 2 terminals and start main application in one terminal
python main.py
- In second terminal,Go inside 'tests' folder on command line.
- Run:
locust -f locust_test.py
- Go inside 'custom_ner_api' folder on command line.
- Run:
black src
- Go inside 'custom_ner_api' folder on command line.
- Run:
pylint src
- Go inside 'custom_ner_api' folder on command line.
- Run:
docker build -t myimage .
docker run -d --name mycontainer -p 5000:5000 myimage
- Add project on circleci website then monitor build on every commit.
1.custom_ner_dl is not complete/need to be researched more.hence not included in final inference.But notebook is present.
2.models are not checked in because of size. You can generate models by running corresponding notebooks.
3.You'll need to create news api key to get news data,so create and update api key in data_collection notebook.
Please create a Pull request for any change.
NOTE: This software depends on other packages that are licensed under different open source licenses.