makcedward/nlpaug

Augment a batch of texts with Contextual Word Embeddings Augmenters

AliKarimi74 opened this issue · 6 comments

Hi,

Thank you for this excellent library.
Is it possible to augment a batch of texts with contextual word augmenters? I try to use this type of augmenter during the training, and augmenting examples one by one is frustrating. I appreciate any suggestions.

Thanks!

There are 2 possible causes. The first scenario is 1 input and multiple outputs. The second scenario is a list of input and output is the same size as the input.

For the first case, you can use something like this one with augment function
augmenter.augment(text, n=2)

For the second case, you can use augments while input is list

texts = [
    'The quick brown fox jumps over the lazy dog .',
    'It is proved that augmentation is one of the anchor to success of computer vision model.'
]

aug = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased', action="insert")
augmented_texts = aug.augments(texts)

Just added a sample code. You may take a look

Thanks for your response.

My scenario is the second one. I took a look at the code, and it seems that the augments method iterates over the list of samples and run the same augment method when GPU is available. So, I think if I want to augment a dataset, the execution time doesn't change significantly. Am I right?
For contextual (and maybe back-translation), is it possible to augment a batch of examples in a single forward pass?

Will enhance to support a single forward pass for multiple inputs per augmentation in order to speed up the process. For multiple input cases, it will definitely speed up the process. From my testing, it speeds up around 3x when augmenting 3 inputs.

However, it still needs to go through augmentation one by one per input. For example, we have input "A B C D E" and we want to augment B and D. To prevent grammatical mistakes, it will augment B and then D. In order words, there is 2 single forward pass for B and D respectively. The process is like that if we augment two words (B and D):
Orignal: A (B) C (D) E
First Augmentation: A (X) C (D) E
Second Augmentation: A (X) C (Y) E

For multiple inputs, the process is like:
Orignal: [{A1 (B1) C1 (D1 E}, {(A2) B2 C2 (D2) E2}]
First Augmentation: [{A1 (X1) C1 (D1) E1}, {(X2), B2 C2 (D2) E2}]
Second Augmentation: [{A1 (X1) C1 (Y1) E1}, {(X2), B2 C2 (Y2) E2}]

Fixed in 1.0.0 version

@makcedward is it also fixed in latest version?

@rajat-tech-002
Yes. It is supported from v1.0.0.

You can

aug = ContextualWordEmbsAug(batch_size=32) # default is 32
aug_texts = aug.augment(texts)