/mathGPT

A GPT-based generative LM for combined text and math formulas, leveraging tree-based formula encoding.

Primary LanguagePython

MathGPT

Code for the paper Tree-Based Representation and Generation of Natural and Mathematical Language

If you end up using this code in your research, please cite us like:

@inproceedings{scarlatos-lan-2023-tree,
    title = "Tree-Based Representation and Generation of Natural and Mathematical Language",
    author = "Scarlatos, Alexander and Lan, Andrew",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.205",
    pages = "3714--3730",
}

Setup

Python Environment

Ensure Python3 is installed (this code was tested on v3.9.1).

Create virtual environment

python3 -m venv <env_name>
source <env_name>/bin/activate

Install libraries

python3 -m pip install -r requirements.txt

Make TangentCFT available in Python path

export PYTHONPATH=..:../TangentCFT/

External Dependencies

The following need to be installed for full functionality. LaTeXML is only required to run pre-processing.

TangentCFT

Download TangentCFT to the folder above the root of this repo: https://github.com/BehroozMansouri/TangentCFT/tree/2b189dff67d6d3b15e323358921bdbca779bfcd9

Note that some small fixes were made to TangentCFT, so the file semantic_symbol.py is copied from there with some changes and is overloaded automatically.

LaTeXML

https://math.nist.gov/~BMiller/LaTeXML/get.html

NLG-Eval

https://github.com/Maluuba/nlg-eval

Known installation issue: Maluuba/nlg-eval#61

Data

Here are links to the datasets required for the following tasks. For each, ensure the dataset's root folder is above the root of this repo.

The datasets for the following tasks cannot be released publicly:

  • Answer Scoring
  • Feedback Generation

Run

The starting point for all code is __main__.py, and you can see a list of command line options by running:

python3 __main__.py --help

Default values can be found in the TrainOptions constructor in utils.py.

Here is the typical workflow to replicate our experiments:

  • Pre-process the Wikipedia dataset (this step also constructs the vocabulary, which is needed for all following steps)
  • Pre-train a MathGPT model
  • Pre-process the downstream dataset
  • Run cross-validation on the downstream dataset, which for each fold:
    • Fine-tunes the pre-trained MathGPT model on the downstream dataset
    • Runs evaluation on the downstream dataset's test set