unicode-rs/unicode-segmentation

`UWordBoundIndices` doesn't expose the indices

wez opened this issue · 4 comments

wez commented

As far as I can tell, UWordBoundIndices is just a wrapper around UWordBounds with an identical interface.

In my use case I have a line of text and an index into the .chars() of that string from a mouse double click and I need to obtain the indices of the start and end of the word that enclose that index.

It seemed to me that UWordBoundIndices is what I'd want here, but I don't see how to use it for this purpose. Is this an oversight, or is there a better way to do get the result I'd like?

The UWordBoundIndices iterator definitely yields word indices, and the docs don't appear stale. However, you are right in that the current interface isn't suitable for identifying word boundaries from random access.

Graphemes suffered the same issue prior to the introduction of a cursor API in #21, and I suppose that word segmentation could be similarly updated.

wez commented

The problem I had was that that critical portion of the docs on that page:

type Item = (usize, &'a str)

is buried a bit further down in the page (that's just how they render), so I was left to fixate on the as_str() method. Would you mind expanding the doc comment to something like this to make it a little clearer?

External iterator for word boundaries and byte offsets.
Yields (usize, &str), the byte offset and string slice for each word.

I would love to have an API directed at random access! I have this somewhat clunky solution for the moment:

  for (x, word) in line.split_word_bound_indices() { 
     if event.x < x {
        break;
     }
     if event.x <= x + word.len() {
        // this is the matching word
       return;
     }
  }

that critical portion of the docs […] is buried a bit further down in the page (that's just how they render), so I was left to fixate on the as_str() method. Would you mind expanding the doc comment to something like this to make it a little clearer?

Yep, it can be pretty easy to miss things. Trait impls often look a bit lost in the rendered page, and the convention established by the standard library is that the behavior of iterators is documented on their builder method rather than the struct itself. I'll add a comment.

I would love to have an API directed at random access! I have this somewhat clunky solution for the moment:

I'd be happy to work on it, but before that I wouldn't mind seeing some consolidation between the unicode-rs organization and the recent, seemingly more active unic. That's a conversation that should be started, although not here. @Manishearth, could I maybe ping you on IRC to get a sense of where we stand, or would you rather I opened an issue/forum thread directly?