/mm_extraction

This repository contains the code and pointer to the trained models to extract proofs and theorems from scientific articles

Primary LanguageJupyter Notebook

Multimodal Machine Learning for Extraction of Theorems and Proofs in the Scientific Literature

TensorFlow Jupyter Notebook Python Keras NumPy

Alt text

This repository provides code, data, and models supplementing the research article Multimodal Machine Learning for Extraction of Theorems and Proofs in the Scientific Literature by Shrey Mishra, Antoine Gauquier and Pierre Senellart.

Paper Page

The efforts are made in the direction of Theoremkb project

What is Multimodal Extraction πŸ€”β“

The goal of this paper implementation is to provide access to trained machine learning models that can be used to extract proofs and theorems from raw pdfs articles. The classifiers come from various modalities that include the Text, Vision and Font based approaches and even a hybrid Multimodal approach that cumilatively looks at all three modalities to better suggest the label of the paragraph.

Alt text

Our classifiers also take the sequential information (for ex - label of the previous block the vertical/horizontal distance from the last block, the page number etc) from the paper, which play a very crucial role in deciding the label of the paragraph.

class_names:

  1. BASIC (Neither Proof nor Theorem)
  2. Proof
  3. Theorem
  4. Overlap between Proof and Theorem

Using Finetuned model on several modalities

To use these models, you first need to select the type of model form the list of models in the table πŸ‘‡ and then download it from the link provided below:

Model name πŸš€ Modality Size πŸ‹οΈβ€β™‚οΈ Mean Accuracy πŸ“ˆ Mean F1 Score 🎯 Download ⬇️
Roberta-pretrained_from_scratch-ft Text πŸ’¬ 498.9 MB 76.45 72.33 Download from πŸ€—
Scibert-ft Text πŸ’¬ 440.9 MB 76.89 71.33 Download from πŸ€—
LSTM Font Seq πŸ–ŠοΈ 21 MB 65.00 45.00 Download from πŸ€—
Bi-LSTM Font Seq πŸ–ŠοΈ 22 MB 68.26 45.66 Download from πŸ€—
EfficientnetB4 Vision πŸ‘οΈ 211.5 MB 68.46 54.33 Download from πŸ€—
EfficientNetV2m Vision πŸ‘οΈ 638.3 MB 69.43 60.33 Download from πŸ€—
GMU model πŸ”± Text πŸ’¬ + Font_Seq πŸ–ŠοΈ + Vision πŸ‘οΈ 783.5 MB 76.86 73.87 Download from πŸ€—
CRF-GMU model πŸš€ Sequential Blocks of GMU embeddings πŸ”— 76 KB 84.38 83.01 Download from πŸ€—
#for NLP based model
import numpy as np
import tensorflow as tf

from transformers import AutoTokenizer, AutoModel, utils
from transformers import TFAutoModelForSequenceClassification

#choose a path
load_path="InriaValda/roberta_from_scratch_ft"

#tokenizers
tokenizer = AutoTokenizer.from_pretrained(load_path)
loaded_model = TFAutoModelForSequenceClassification.from_pretrained(load_path, output_attentions=True)

sample1="""Proof. A feasible solution to this linear
program will define (setting p i = e x i )
a sequence p = (p 1 , . . . , p n ) ∈ (0, 1] n such that"""

input_text_tokenized = tokenizer.encode(sample,
                                        truncation=True,
                                        padding=True,
                                        return_tensors="tf")

print(input_text_tokenized)

prediction = loaded_model(input_text_tokenized)

prediction_logits = prediction[0]
prediction_probs = tf.nn.softmax(prediction_logits,axis=1).numpy()

np.set_printoptions(suppress=True)
print(f'The prediction probs are: {prediction_probs}')
print("rounded label(argmax) :{}".format(np.argmax(prediction_probs)))
# Sequence based model
import pickle
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

font_seq= 'cmr10 cmr10 cmr10 cmmi10 cmmi8 cmr10 cmr10'

filepath = "tokenizer_52000_v1.pkl"

#load font encoder
with open(filepath, 'rb') as f:
    tokenizer = pickle.load(f)

val_tokenized_train = tokenizer.texts_to_sequences(font_seq)

max_length=1000 #paddig length
tokenized_font = pad_sequences(val_tokenized_train, maxlen=max_length)

#load model
model_path="lstm_font.h5"
model = tf.keras.models.load_model(model_path)

model.predict(tokenized_font)

y_pred = np.argmax(predictions, axis=1)

print(y_pred)

Dataset Overview

  1. Pretraining data πŸ“š: The dataset contains ~196k collected from the arXiv bulk API the list of all the arXiv id's used for pretraining part are mentioned in ./assets/pretrained_language_models/pretraining_code/pretrain_ids.text πŸ”₯ πŸ”₯ Try the pretraining demo (MLM task) at πŸ€— πŸ‘‰ Open is Spaces

  2. Finetuning data πŸ”: This section contains the list of arXiv ids for the pdfs that are divided into training and validation data which is used for finetuning the pretrained models and evaluating them.

Each batch contains 1k pdfs that are feed as a single batch and the scores are reported based on incremental growth per ~1k pdfs, named in the following order:

* the batch order is the exact same as they are feed to the model, however internal shuffling (in pdf order) was allowed to avoid overfitting

The validation data remains constant for each modality consisting of 3682 pdfs, see:

Data Pipeline 🚰 and code πŸ’»

A rough dataset pipeline implementation is provided below:

Alt text

A. The Data pipeline notebook walks through the enire step of regenerating the labelled ground truths from the obtained latex sources.

It assumes certain dependencies including the GROBID to extract semantically parsed paragraphs and Pdfalto to extract font name and their font information.

Here is a description of the .py files used in the Data pipeline notebook:

  1. tex2pdf.py: Converts latex sources to a pdf while injecting them with the latex plugin

  2. grobid_clean.py : Applies Grobid on the pdfs to get a .tei.xml extension file

  3. Pdfalto_on_pdfs.py : Applies pdfalto on pdfs to generate annotations and extract font information

  4. labelling.py : Visualize the grobid blocks which are then labelled with colors to denote their labels

  5. dataframe_for_eval.py : Generates the data.csv file that contains the merged output from grobid and Pdfalto

B. Filter privacy information πŸ™ˆ: This notebook generates the text files from scientific papers filtering the author information and the references. It then genrates the tokenizer on this large pretraining dataset vocabulary .

Alt text

C. Pretrain language model: This notebook walks through the process of pretraining both BERT/Roberta style pretraining on scientific papers (see the full list of papers)

The same pretrained models are also available on the hugging face model hub πŸ€—, see the table below πŸ‘‡

Models πŸ”₯ Get Frozen Weights ❄️ Data Type Data size
BERT (Ours πŸ”₯) Download from πŸ€— or here Scientific Papers (197k) 11 GB
RoBERTa (Ours οΌ  ep01 πŸ”₯) Download from πŸ€— or here Scientific Papers (197k) 11 GB
RoBERTa (Ours οΌ  ep10 πŸ”₯) Download from πŸ€— or here Scientific Papers (197k) 11 GB

D. Finetuning models (Proof/theorem identification task): several notebooks demonstrate:

  1. Finetuning NLP model: This notebook demonstrates how to finetune a pretrained language model.

  2. Finetuning font based model: This notebook demonstrates how to train a font based sequential model that captures the font sequence to decide the label of the block.

  3. Finetnuning Vision model: This notebook demonstrates the training of the visions model.

* All images must be inverted to a input dimension of 400 ❌ 1400 and padded relatively

To generate the patches refer these notebooks for train and validation data respectively, to apply transformation on generated patches refer this notebook

  1. Finetuning Multimodal (GMU):The GMU model is based upon this paper using the Font sequence model, language model and the vision model feeding in a gated network to decide the importance of each modality, please refer this notebook for the implemenation.

  2. Finetuning Sequential models: The Sequential model conisists of a Linear chain CRF model running on the features extracted from the frozen GMU model. To format the data in the preprocessing format refer Data Preprocessing notebook. For training and testing refer training notebook and testing notebook respectively.

Related

This project is part of the Theoremkb Project and related to it's extension.

FAQ

Q1) Can I use these models to fine-tune on a similar task?

Certainly! πŸ‘ Fine-tuning these models on a similar task is one of the valuable use cases. While we have primarily tested these models on proof/theorem extraction tasks, they can be adapted for other tasks as well. Finetuning on a different task can yield promising results. We encourage you to explore the potential and let us know if you achieve something remarkable with them 🀩.

Q2) Are these models available in PyTorch?

Regrettably, these models are currently only available in TensorFlow. At present, we do not provide direct support for PyTorch. However, we are continuously expanding our offerings, so please stay updated for any future developments.

Acknowledgements

This work has been funded by the French government under management of Agence Nationale de la Recherche as part of the β€œInvestissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).

Pierre Senellart's work is also supported by his secondment to Institut Universitaire de France.

Contributing

Contributions and collaborations are always welcome!

We are always looking for interesting candidates, Please contact pierre@senellart.com πŸ“§ , If you are interested.

πŸ”— Reach out to Us πŸ’œ

βœ…Shrey Mishra: mishra@di.ens.fr

linkedin

βœ…Antoine Gauquier: antoine.gauquier@ens.psl.eu

linkedin

βœ…Pierre Senellart: pierre@senellart.com

portfolio

linkedin