Is it possible to get byte or character offsets of tokenized words / sentences?
Closed this issue · 2 comments
shepmaster commented
For example, if the following were tokenized:
hello, world!
Could we get tuples of (0,5), (7,12)
? I'm flexible about the details like if the numbers are bytes or characters, or 0 or 1 based or inclusive / exclusive. Thanks for the cool project! 🌴
ferristseng commented
I'm in the middle of reworking some things right now, but this is definitely possible.
shepmaster commented
Thanks! I'm excited to give this a shot sometime!