ferristseng/rust-punkt

Is it possible to get byte or character offsets of tokenized words / sentences?

Closed this issue · 2 comments

For example, if the following were tokenized:

hello, world!

Could we get tuples of (0,5), (7,12)? I'm flexible about the details like if the numbers are bytes or characters, or 0 or 1 based or inclusive / exclusive. Thanks for the cool project! 🌴

I'm in the middle of reworking some things right now, but this is definitely possible.

Thanks! I'm excited to give this a shot sometime!