MathBERT

MathBERT is a BERT model trained on the below mathematics text.

  • pre-k to high school math curriculum from engageny.org
  • G6-8 math curriculum from utahmiddleschoolmath.org
  • G6-high school math from illustrativemathematics.org
  • high school to college math text books from openculture.com
  • G6-8 math curriculum from ck12.org
  • College to graduate level MOOC math course syllabus from classcentral.com
  • math paper abstracts from arxiv.org

MathBERT has its own vocabulary (mathVocab) that's built via BertTokenizer to best match the training corpus. We also trained MathBERT with the original BERT vocabulary (baseVocab) for comparison. Both models are uncased versions.

Downloading Trained Models

We release the tensorflow and the pytorch version of the trained models. The tensorflow version is compatible with code that works with the model from Google Research. The pytorch version is created using the Hugging Face library.

  • Tensorflow download
    • note: to download mathbert-mathvocab version, change the model name to mathbert-mathvocab-uncased in the below code
    wget http://tracy-nlp-models.s3.amazonaws.com/mathbert-basevocab-uncased/bert_config.json
    wget http://tracy-nlp-models.s3.amazonaws.com/mathbert-basevocab-uncased/vocab.txt
    wget http://tracy-nlp-models.s3.amazonaws.com/mathbert-basevocab-uncased/bert_model.ckpt.index
    wget http://tracy-nlp-models.s3.amazonaws.com/mathbert-basevocab-uncased/bert_model.ckpt.meta
    wget http://tracy-nlp-models.s3.amazonaws.com/mathbert-basevocab-uncased/bert_model.ckpt.data-00000-of-00001
    
  • Pytorch download
from transformers import *

tokenizer = AutoTokenizer.from_pretrained('tbs17/MathBERT')
model = AutoModel.from_pretrained('tbs17/MathBERT')

tokenizer = AutoTokenizer.from_pretrained('tbs17/MathBERT-custom')
model = AutoModel.from_pretrained('tbs17/MathBERT-custom')

Pretraining and fine-tuning

The pretraining code is located at /mathbert/ and fine-tuning notebook is at /scripts/MathBERT_finetune.ipynb. Unfortunately, we can't release the fine-tuning data set per the data owner's request. All the packages we use is in the requirements.txt file.