Implementation of a Masking Stage with Random Masking Options
Opened this issue · 1 comments
Oufattole commented
Implementation of a Masking Stage with Random Masking Options
Problem
The absence of a dedicated masking stage in our pipeline limits our ability to handle incomplete or noisy data effectively during model training.
Proposed Solution
Introduce a masking stage designed to randomly mask a specified percentage of the data or subsequences within the data:
- Position: Place the masking stage after the input encoder and before the sequence model.
- Functionality:
- Support random masking, either a random percentage of the tokens are masked or a randomly sampled continuous subsequence is masked.
- We should add to the batch a key indicating the labels that will be used by the Model stage to compute masked imputation loss.
- Configurability: Allow users to set the percentage of data to mask.