Insurance Q&A Intent Classification with Databricks & Hugging Face


TLDR; this repo contains code that showcases the process of:

  • Ingesting data related to Insurance questions and answers (Insurance QA Dataset) into Delta Lake
  • Basic cleaning and preprocessing
  • Creating custom PyTorch Lightning DataModule and LightningModule to wrap, respectively, our dataset and our backbone model (distilbert_en_uncased)
  • Training with multiple GPUs while logging desired metrics into MLflow and registering model assets into Databricks Model Registry
  • Running inference both with single and multiple nodes

Additional Reference

  1. Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou. Applying Deep Learning to Answer Selection: A Study and An Open Task
  2. Fine-tune Transformers Models with PyTorch Lightning
  3. PyTorch Lightning MLflow Logger
  4. dbx by Databricks Labs