google-research/retvec

Is the code to generate the augmented data available anywhere?

Opened this issue · 1 comments

In the paper, the authors write

Augmentations Token augmentation consists of randomly inserting up to 4 typos per token up to 25% of the token length. This is consistent with an observed maximum human error frequency of around 20% [11]. We use 22 distinct typo augmentations, which can be grouped into four categories: deletion, insertion, substitution, and transposition. For each token, we randomly select a target augmentation percentage between 0-25%, and for each augmentation step we randomly apply an augmentation from one of the four typo categories. The full list of augmentations used is reported in Appendix D.

Is the code to apply these augmentations available anywhere? I'd like to use & adapt it for my specific use-case.

The reason that I ask is that the description from the paper is somewhat ambiguous.

Token augmentation consists of randomly inserting up to 4 typos per token up to 25% of the token length.

How does compare the original and augmented token to arrive at a percentage of length? The strings "wrong" and "wring" differ by one character (substitution of o for i). Does that mean that the percentage of length is 0% (because they have the same length) or 20% (1 in 5 characters changed)?

Perhaps you mean that you're measuring some sort of a string metric (e.g. Levenshtein distance) between the two strings, and the percent change is distance("wrong", "wring") / length("wrong") * 100. Is this the case?

Things get more complex when we consider the emoji modifications. As an example, the pirate flag emoji is composed of several unicode codepoints (black flag, zero width joiner, skull and crossbones [itself 2 unicode codepoints]), even though it's displayed as a single emoji. How is the percent of token length measured in this case?

For each token, we randomly select a target augmentation percentage between 0-25%

Suppose the selected token has 5 characters and you have a 10% augmentation percentage. The smallest number of characters that you can change is 1 character, but this represents 20% of the length of the token. Is 1 augmentation still applied, even though 20% is greater than 10%?

When describing some of the augmentations themselves, the paper is also unclear. One of the insertion augmentations is

n-grams based prefix insertion for n = 3, 4, 5

Where do the n-grams come from? Are the n-grams chosen uniformly from among the set of all n-grams with respect to an alphabet, or is more weight given to the more frequent n-grams in some text corpus?

Thank you for clarifying.