This project leverages BERT for Named Entity Recognition (NER) on a medical dataset. The notebook provides a step-by-step guide, from dataset preparation to fine-tuning and saving the trained model for medical NER tasks.
-
Dataset Loading
Upload and load your medical dataset in JSON format:from google.colab import files import json # Upload and load JSON data uploaded = files.upload() file_name = list(uploaded.keys())[0] with open(file_name, 'r') as f: dataset = json.load(f)
-
Tokenization and BIO Tagging
Use BERT's tokenizer to tokenize sentences and assign BIO tags to each token:from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-cased') def tokenize_and_create_labels(sentence, entities, tokenizer): # Tokenize and label tokens tokenized_sentence = tokenizer.tokenize(sentence) labels = ['O'] * len(tokenized_sentence) # Assign 'B-' and 'I-' labels as per BIO tagging format # (Sample logic for locating and labeling entities) return tokenized_sentence, labels # Process each sentence in the dataset tokenized_data = [] for item in dataset: tokens, labels = tokenize_and_create_labels(item["sentence"], item["entities"], tokenizer) tokenized_data.append({"tokens": tokens, "labels": labels})
-
Tensor Conversion
Convert tokenized data to tensors with padding for uniform input size:import torch from sklearn.model_selection import train_test_split # Convert tokens to IDs, create attention masks, and pad inputs def convert_to_tensors(tokenized_data, tokenizer, max_length=128): # Example function for tensor conversion and padding return input_ids, attention_masks, label_ids input_ids, attention_masks, label_ids = convert_to_tensors(tokenized_data, tokenizer) train_inputs, test_inputs, train_labels, test_labels, train_masks, test_masks = train_test_split( input_ids, label_ids, attention_masks, test_size=0.2 )
-
Model Training
Fine-tune the BERT model on the medical dataset using Hugging Face’sTrainer
:from transformers import BertForTokenClassification, Trainer, TrainingArguments model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=len(unique_labels)) training_args = TrainingArguments( output_dir='./results', evaluation_strategy="epoch", # Evaluate at the end of every epoch per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=15, weight_decay=0.01, logging_dir='./logs', logging_steps=100, ) trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset) trainer.train()
-
Saving the Model
Save the trained model weights for future use:model_save_path = "model.pt" torch.save(model.state_dict(), model_save_path) print(f"Model saved as {model_save_path}")
-
Evaluation Metrics
Metric BERT Model Accuracy 96% Precision (Average) 96% Recall (Average) 96% F1 Score (Average) 96% -
NER Inferences Enter a sentence for entity recognition: Mrs. Williams attends physical therapy to improve her mobility after hip replacement surgery.
Token-Level Predictions: [ { "token": "Mrs", "label": "B-Person" }, { "token": ".", "label": "I-Person" }, { "token": "Williams", "label": "I-Person" }, { "token": "attends", "label": "O" }, { "token": "physical", "label": "B-Service" }, { "token": "therapy", "label": "I-Service" }, { "token": "to", "label": "O" }, { "token": "improve", "label": "B-Outcome" }, { "token": "her", "label": "O" }, { "token": "mobility", "label": "B-Outcome" }, { "token": "after", "label": "O" }, { "token": "hip", "label": "B-MedicalProcedure" }, { "token": "replacement", "label": "I-MedicalProcedure" }, { "token": "surgery", "label": "I-MedicalProcedure" }, { "token": ".", "label": "O" } ] Entity-Level JSON Output: { "sentence": "Mrs. Williams attends physical therapy to improve her mobility after hip replacement surgery.", "entities": [ { "text": "Mrs . Williams", "label": "Person" }, { "text": "physical therapy", "label": "Service" }, { "text": "improve", "label": "Outcome" }, { "text": "mobility", "label": "Outcome" }, { "text": "hip replacement surgery", "label": "MedicalProcedure" } ] }
- Python 3.x
- Libraries:
transformers
,torch
,scikit-learn
- Install Requirements
pip install transformers torch scikit-learn
- Upload the Dataset
Place your JSON dataset in the same directory or upload it in the notebook. - Run the Notebook
Execute each cell sequentially to preprocess data, train, and save the model.
- Hugging Face
transformers
- PyTorch for tensor manipulation
Let me know if you'd like further customization!