huggingface/tokenizers

How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification

insookim43 opened this issue · 0 comments

Hello.

I'm using the tokenizer to encoding pair sentences in TemplateProcessing in batch_encode.
There's a confusing part where the method requires two lists for sentence A and sentence B.

According to the guide documentation: "To process a batch of sentences pairs, pass two lists to the Tokenizer.encode_batch method: the list of sentences A and the list of sentences B."

Since it instructs to input two lists, it seems like [[A1, A2], [B1, B2]] --(encode)-> {A1, B1}, {A2, B2}.

However, the actual input expects individual pairs batched, not splitting the sentence pairs into lists for A and B.
So, it should be [[A1, B1], [A2, B2]] to encode as {A1, B1}, {A2, B2}.

I've also confirmed that the length of the input list for encode_batch keeps increasing with the number of batches.

Since the guide instructs to input sentence A and sentence B, this is where the confusion arises.
If I've misunderstood anything, could you help clarify this point so I can understand it better?