GEM-benchmark/NL-Augmenter

Informal & Untested Suggestions for Possible Transformations

kaustubhdhole opened this issue · 6 comments

Here are some random ideas informally put which could be used for perturbations & augmentations. @vgtomahawk is making a formal list in this branch.

Meanwhile here is an informal list for the benefit of the participants.

  1. Interchange positions of SRL AM arguments for non-overlapping AM arguments:

    • Alex left for Delhi with his wife at 5 pm. --> Alex left for Delhi at 5 pm with his wife.
    • "at 5 pm" (AM-TMP) and "with his wife" (AM-COM) can be exchanged: This is safe to do only with non-core arguments and non-overlapping arguments. Check what SRL is here.
  2. The ButterFingersPertubation could be implemented for keyboard types other than English - like Devanagiri (Hindi, Marathi, Nepail), Shahmukhi (Urdu, Persian), South Indian languages (Tamil, Telugu, Kannada, Malayalam) or Chinese, etc.

  3. Style transfer approaches could be interesting to look at - Changing formal to informal and vice versa. Check this model.

  • What the heck is going on? --> What is going on?
  • What you upto? --> What are you doing?
  1. Word Order Changes: Active to Passive & vice versa, Topicalisation, Extraposition, Wh-fronting, (& vice versa) & other used in constituency tests.
    Scrambling (for German, Turkic languages)
    John went to the store to buy bread. --> To buy bread, John went to the store.

The above are only related to SentenceOperation. There are other transformation types too which could be looked at.

Adversarial SQUAD adds wrong but similar facts at the end of the context in a QuestionAnswer setting which does not affect the QA pair.

These two surveys provide a great overview of previous approaches - This is a great place to look for ideas:
https://github.com/AgaMiko/data-augmentation-review
https://arxiv.org/pdf/2105.03075.pdf

Another excellent set of paraphrases can be checked here: http://cognet.mit.edu/pdfviewer/journal/coli_a_00166

Another excellent set of paraphrases can be checked here: http://cognet.mit.edu/pdfviewer/journal/coli_a_00166

In particular from the lists in this paper, "Converse Substitution", "Manipulator-Device Substitution" and "Metaphor Substitution" are three which I have seldom seen being implemented anywhere properly in code..

There is interesting work on gapping worth looking at: https://arxiv.org/pdf/1804.06922.pdf
Paul likes coffee and Mary tea. (gapped sentence)
Paul likes coffee and Mary likes tea. (ungapped sentence)
It would be interesting for building rules to convert to and fro between the above two forms.

This semi-syntactic paraphrasing algorithm by Tanya Goyal et al, based on reordering source word position [a part of the stream of work following up SCPNs a.k.a Syntactically Controlled Paraphrase Networks (Wieting et al) ] is a really interesting augmentation, particularly due to its reduced sensitivity to the constituency parses.