Is it possible to get byte or character offsets of tokenized words / sentences?

Question

Is it possible to get byte or character offsets of tokenized words / sentences?

Closed this issue 9 years ago · 2 comments

For example, if the following were tokenized:

hello, world!

Could we get tuples of (0,5), (7,12)? I'm flexible about the details like if the numbers are bytes or characters, or 0 or 1 based or inclusive / exclusive. Thanks for the cool project! 🌴

Answer 1 · 2015-06-06T19:00:51.000Z

I'm in the middle of reworking some things right now, but this is definitely possible.

Answer 2 · 2015-06-24T21:37:09.000Z

Thanks! I'm excited to give this a shot sometime!