/mg_algebras_parser

Minimalist Algebras and neural parsers

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Minimalist Parsing into minimalist algebra terms

Minimalist Grammars without the features! parse into algebra terms that generate whatever you like.

Public version for MG+2.3

Really, this README is probably not accurate.

Torr's MGBank corpus

The source of the data

  • Split
    • in data/processed/mg_bank/split
    • split based on the split used to train the supertagger
    • see minimalist_parser.convert_mgbank.create_data_split for details and to do it yourself
  • Original corpus in Autobank/MGParse, copied here into data/raw
    • wsj_MGBankSeed
    • wsj_MGBankAuto
  • to visualise a tree:
    • inside the MGParse folder:
    • get a tree from one of the corpora above as a string
      • format: JSON
      • keys are numbers as strings, e.g. "1". These are the sentence number from that bit of the PTB
      • values are lists of strings. The first element is the full derivation tree in a format we can visualise as follows.
    • from gen_derived_tree import gen_derivation_tree, gen_derived_tree, gen_xbar_tree
    • derivation_tree = gen_derivation_tree(tree_as_string)
    • derivation_tree.visualize_tree()

Generalised Minimalist Algebras

The mathematical foundation of the project

Major modules:

  • algebras.algebra defined algebra terms and algebras
  • trees.trees defines trees
  • minimalism.minimalist_algebra defines generalised minimalist algebras with main structures as inner algebra terms
  • algebras.hm_algebra defined abstract class for inner algebras, for MG that works with the terms (rather than objects) of the inner algebra
  • Some inner algebras:
    • algebras.hm_triple_algebra.py
    • algebras.hm_interval_pair_algebra

More algebras, MG

  • Some more algebras not necessarily yet checked for obsolescence:
    • string_algebra
    • tree_algebra.py makes "bare trees" a la Stabler 1997
    • tuple_algebra.py makes triples of (left of the head, head, right of the head) strings, allowing for head movement, as well as a wrap function like in TAGs
    • tag_algebra.py probably isn't up to date with psi_mga, but it's string pairs for the TAG string algebra
  • minimalism.minimalist_grammar implements the feature-driven grammar
    • (incomplete)

Project organization

Legend:
PG: Project Generated
RO: Read Only
HW: Human Written

.
├── .gitignore
├── CITATION.md
├── LICENSE.md
├── README.md
├── requirements.txt    <- Currently in requirements folder, of dubious accuracy
├── config_files        <- Configuration files (HW)
│                           - yml files for Docker Compose
│                           - jsonnet files for allennlp
├── data                <- All project data, ignored by git
│   ├── processed       <- The final, canonical data sets for modeling. (PG)
│   │   ├── mg_bank     <- mgbank files that are not strictly raw. (PG)
│   │   │   └── split   <- train/dev/test split (official) (PG)
│   │   └── seq2seq     <- train and dev sets for seq2seq models                                
│   ├── raw_data        <- The original, immutable data dump. (RO)
│   └── temp            <- Intermediate data that has been transformed. (PG)
├── docs                <- Documentation notebook for users (HW)
│   ├── manuscript      <- Manuscript source, e.g., LaTeX, Markdown, etc. (HW)
│   └── reports         <- Other project reports and notebooks (e.g. Jupyter, .Rmd) (HW)
├── results
    ├── analysis        <- analysis results (PG & HW)
│   ├── figures         <- Figures for the manuscript or reports (PG)
│   └── predictions     <- model predictions (PG)
├── scripts             <- bash scripts (HW)
├── tests               <- unit tests (HW)
├── trained_models      <- models and associated files (PG)
└── minimalist_parser   <- Source code for this project (HW)