TokenClassificationPipeline support is_split_into_words tokeniser parameter

Question

TokenClassificationPipeline support is_split_into_words tokeniser parameter

swtb3 opened this issue 18 days ago · 2 comments

Feature request

The TokenClassificationPipeline currently sets a hardcoded tokeniser config within it sanitiser method. This prevents users from passing their own config to the tokeniser.

It would be good to support some user input for tokeniser config. Especially for is_split_into_words as input data may be split already.

Motivation

It is common for token classification datasets to be split into words already so that they match their labels.

Your contribution

I naivley anticipate this being a simple change, so I am happy to submit a PR for it. Though it would first be nice to see a discussion surrounding the feature and if it fits with the goals of Transformers.

Answer 1 · 2024-05-13T09:17:03.000Z

cc @ArthurZucker @Rocketknight1

Answer 2 · 2024-05-13T13:47:07.000Z

This makes sense to me, but I'm not super-familiar with that pipeline. I'd support a PR to allow some options to be passed through to the tokenizer, though, since that shouldn't have any backward compatibility issues!