How to choose minimum sequence length while avoiding truncation
marcospiau opened this issue · 0 comments
Hi,
I have a task that uses seqio.TfdsDataSource
as its source and a pipeline with preprocessors final steps that looks like this: [...
, seqio.preprocessors.tokenize
, seqio.CacheDatasetPlaceholder()
, seqio.preprocessors.append_eos_after_trim
].
I have cached this task, so I know the maximum token lengths for both inputs and targets.
My question is: when training a model with t5.models.mesh_transformer_main
using this task and providing gin bindings for utils.run.sequence_length
, should I use the values I see on the cached stats, or should I add +1 to account for the EOS token? My goal is to avoid data truncation by specifying smaller sequence lengths than what my data requires.
(P.S.: I know this is also related to the t5 repository, but I opened the issue here because I think my question is related to the seqio.preprocessors.append_eos_after_trim
function. If you think it would be more appropriate to open this issue in another repository, please let me know, and I can change it.)
Thanks in advance,
Marcos