why does this library return byte offsets instead of codepoint offsets?
gribml opened this issue · 12 comments
I'm using your package to match some patterns against long pieces of text, sometimes containing non-ASCII characters. When I investigate the matches I find that the offsets correspond to byte-offsets, not character offsets. This is confusing. It would be great to get character offsets instead, or as an option.
character at position 36 is code point 8217 in this example:
The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll
So a match searching from eBook
correctly returns 22 as the offset, while a match searching for Adventures
returns 41 though it should be 39
Could you say why you what character offsets? (I assume you mean codepoint offsets.) What do you want to do with them?
Sure, in NLP, you usually want to train named entity recognition models on spans (start, end, label, text), where the start and end are code point offsets. In these applications, you don't want to predict bytes, but rather something from the input vocabulary. If I want to go through large amounts of text and auto-label some things based on known strings and their labels (e.g. Sudan is always a "location" entity), then it's important for me to know what offset that is into the text.
I don't understand why byte offsets inhibit that. If anything, they should make it easier to do that.
I think it would help if you showed a little code.
It's not necessarily a code thing. Suppose I'm trying to collect some labels about spans of text for a named-entity recognition task. I might go through and mark tokens as e.g. "person", "organization", "location". I was hoping to use your package to match a ton of known names against the training corpus but also have some labelers go through and label stuff by hand. The labels that comes back from manual labeling are not byte offsets, they're code point offsets, so it's messy to combine the two sets of labels, hope that makes more sense.
Also, let's say I already have a model that predicts entities in text, this outputs code point offsets and the label for that span of text, and I want to supplement it with matches from Aho-Corasick matches to large list of names. Getting byte offsets doesn't help in displaying the matches in the same piece of text.
OK, so the reason why I asked for code here is because, based on your responses so far, I think there is some confusion going on. Code is the ultimate forcing function, because it makes everything concrete to both of us in exactly the same way. Surely, this problem is concrete to you, but I actually can't tell what the shape of that problem is. For example, based on what you've said, I can't tell whether what you want are codepoint offsets or UTF-16 code unit offsets.
This is not a theoretical problem and it is not an obvious one. Language designers get this stuff wrong. For example, consider the following Javascript code:
const icons = 'a𝛃a';
console.log(icons.codePointAt(0));
console.log(icons.codePointAt(1));
console.log(icons.codePointAt(2));
console.log(icons.codePointAt(3));
The output of this program, in a browser, is:
> 97
> 120515
> 57027
> 97
Despite the fact there are three codepoints in a𝛃a
, Javascript permits indexing them as if there are four. In actuality, Javascript's codePointAt
routine is really utf16CodeUnitAt
because it doesn't handle surrogate pairs. codePointAt
is only correct (incidentally) when the entire text is within the basic multilingual plane. I note that this is a known problem, although the docs are awfully terse on the failure modes here.
So if your labeling program is written in Javascript and isn't careful to handle these sorts of cases, you might not actually be getting codepoint offsets at all!
OK, with that interlude out of the way, let me address why this request is both wrong and won't happen. Namely, that this library returns byte offsets is intended, correct, the only possible choice and the only sensible choice.
See, in the Rust ecosystem, everything is UTF-8 everywhere. There is no other predominant encoding in use. While there is some support for transcoding to and from UTF-16 in the standard library, these routines are only available for interacting with APIs that still require UTF-16 (particularly on Windows).
In general, the offsets you want to return from substring search routines should always be code unit offsets. The reason why code unit offsets are correct is because they permit you to take a substring by offset in constant time. Any other offset type requires a linear time (or linear space) implementation. Let me repeat that because this is the crucial point. If this library returned codepoint offsets (or UTF-16 code unit offsets or anything other than UTF-8 code unit offsets), then the offsets would result in a performance and correctness footgun so severe that nobody would use this library.
So what exactly are code unit offsets? Well, they depend on the encoding used:
- For UTF-8, code unit offsets are equivalent to byte offsets, since each byte is a code unit. In general, the "concept" of code units isn't particularly useful in UTF-8 since they are all just single bytes. But talking about code units as a general concept itself is illustrative here.
- For UTF-16, code unit offsets are offsets into an array of 16-bit integers.
- For UTF-32, code unit offsets are offsets into an array of 32-bit integers. They are equivalent to codepoint offsets.
In all such cases, search routines should always return valid code unit offsets. In the case of UTF-8, this means never returning an offset inside the encoding of a single codepoint. In the case of UTF-16, this means never returning an offset that splits the encoding of a codepoint outside the basic multilingual plane (i.e., between two surrogate codepoints). In the case of UTF-32, all offsets are valid since it's impossible to split a codepoint in this encoding.
So for example, if I have a variable string
in Rust with type &str
and s
and e
are byte offsets, then I can write &string[s..e]
to get the substring of those offsets. This operation is done in constant time with just a little bit of pointer arithmetic. It is extremely fast. However, if s
and e
were actually codepoint offsets, then I'd need to do something much more elaborate to get the corresponding substring:
assert!(s <= e);
let (mut byte_start, mut byte_end) = (None, None);
for (codepoint_index, (byte_offset, _)) in string.char_indices().enumerate() {
if codepoint_index == s {
byte_start = Some(byte_offset);
}
if codepoint_index == e {
byte_end = Some(byte_end);
break;
}
}
&string[byte_start.unwrap()..byte_end.unwrap()]
The fact that this code is not easy to write is very much purposeful. It is not easy for the following reasons:
- There are serious costs to doing this, especially if one expects substring operations to be cheap.
- Slicing by codepoint is virtually never what you want.
And of course, if what you actually have are UTF-16 code unit offsets, then the code above is wrong for strings containing text outside of the BMP.
I hope this makes it clear why this library returns byte offsets, or more generally, "UTF-8 code unit offsets." It's because this library is encoding agnostic but assumes, by convention, that the bytes are UTF-8.
But why can't I make this an option in the library for cases like yours? In theory it is possible to do, but:
- It would be quite a lot of complexity to maintain both code paths, as the optimal way of doing this would be to perform UTF-8 decoding during the search routine.
- Since UTF-8 decoding must be done on every byte searched, this would also inhibit a lot of optimizations that are encoding agnostic. The vectorized Teddy multiple substring matcher and
memchr
are two examples that literally could not be used if codepoint offsets had to be returned.
The use cases that require codepoint offsets are rare and niche, and thus not worth the above costs.
OK OK, I hear you, but what should I do?
I think first, you have to figure out whether what you have are actually codepoint offsets or UTF-16 code unit offsets. In either case, the answer is simple: you must translate byte offsets back to codepoint offsets (or vice versa if necessary). If a search is one off, then this could be done after the search using a linear scan not too dissimilar from the one above. If you're searching the same text many times, then you might instead consider building a map from byte offset to codepoint offset, and then using that to do the translation in constant time for each search.
You might say that this imposes an additional cost, and it's true. But if this library did it for you, it would impose additional costs there as well. Also, presumably, you are at least transcoding the text to UTF-8 before you use this library, and it could be that point at which you compute the translation table. If you can afford the transcoding, you might be able to afford the translation. If not, then yeah, maybe this library doesn't work for your use case.
In general, your problem here is that you have offsets in some text being reported and collaborated on by two different ecosystems. The best way, in my experience, to connect these is indeed to use codepoint offsets, because they are truly independent of encoding. The downside of them is that you'll have to translate between them and code unit offsets when you touch something that is coupled with a particular encoding, like this library is.
N.B. I have actually worked in exactly the space you're talking about for years. At my job, we do NER on web pages and have a UI in the browser. We had to solve a similar problem with offsets, and we did that by using codepoint offsets. In the Javascript world, those have to be translated to UTF-16 code unit offsets. But in our backend, which uses UTF-8, they have to be translated to UTF-8 code unit offsets (equivalent to byte offsets).
This is truly a superb explanation. You've won me over. Everything should be done at the byte level and then re-mapped to code points downstream (rather than the reverse). Please correct me if I'm wrong, but I think when you reference "UTF-8" above, you really mean ASCII, since UTF-8 is a potentially multi-byte encoding, while ASCII is a fixed-width 1-byte encoding.
For future projects, I'll definitely keep this in mind but for now, I've decided to use the unicode-segmentation crate to get the code point offsets from the output of AC.
Thanks!
Please correct me if I'm wrong, but I think when you reference "UTF-8" above, you really mean ASCII, since UTF-8 is a potentially multi-byte encoding, while ASCII is a fixed-width 1-byte encoding.
I reviewed my comment and every use of UTF-8 looks precisely correct to me. If you want to pick out a particular phrasing that is unclear, I would be happy to elaborate.
I've decided to use the unicode-segmentation crate to get the code point offsets from the output of AC.
So this isn't quite correct. That crate deals with grapheme clusters (among other things). Those are distinct from codepoints. Many codepoints may be a part of a single grapheme cluster. Getting codepoint offsets only requires the standard library.
If you have example inputs and outputs, along with some code, I would be happy to review it for you. It would be helpful if your input contained codepoints outside the basic multilingual plane, as it should help crystallize edge cases.
I'm not concerned with non basic multilingual plane characters, so the example from Alice In Wonderland at the beginning contains a non-ASCII Unicode decimal code 8217 (https://www.codetable.net/decimal/8217). Here's some example code. And I would want matches to contain (0, 39, 49) instead of (0, 41, 51). Pretty new to Rust so apologies in advance.
fn main() {
let doc = "The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll";
let patterns = vec!["Adventures"];
let ac = AhoCorasickBuilder::new().auto_configure(&patterns).build(patterns);
// let indices = doc.grapheme_indices(false).collect::<Vec<(usize, &str)>>();
let mut offset = 0usize;
for mat in self.ac.find_iter(doc) {
// find start offset
// while indices[offset].0 < mat.start() {
// offset += 1;
// }
// let start = offset;
// // find end offset
// while indices[offset].0 < mat.end() {
// offset += 1;
// }
// let end = offset;
matches.push((mat.pattern(), mat.start(), mat.end()));
// push result to output array
// matches.push((mat.pattern(), start, end));
}
println!("{:#?}", matches);
}
I've commented out the parts using segmentation.
I'm not concerned with non basic multilingual plane characters
Hmmm, okay, I guess I'm confused by this. Especially since you jumped to graphemes. Writing code that works within the BMP but not outside it is a really bad idea IMO. At the very least, please consider returning an error if non-BMP codepoints are detected. You might think your inputs will never have a non-BMP codepoint, but if that ever changes, bad things happen. In the best case, you'll get an out-of-bounds panic from a bad offset. Worst case, you won't get any overt error at all, and instead, the offsets will just be subtly incorrect.
It's usually not hard to adapt whatever code you write to work with non-BMP codepoints. Typically, code is only broken in this respect when it assumes that UTF-16 is a fixed width encoding. (It's not. It's variable width just like UTF-8.)
What led you to want to use graphemes for this? This whole issue was about codepoint offsets, and then bam, in came graphemes. I'm very confused about that.
And I would want matches to contain (0, 39, 49) instead of (0, 41, 51). Pretty new to Rust so apologies in advance.
This should do it. And it works with non-BMP codepoints. Remember, this is returning codepoint offsets and not UTF-16 code unit offsets. They are two different things!
use aho_corasick::AhoCorasickBuilder;
fn main() {
let doc = "The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll";
let patterns = vec!["Adventures"];
let ac = AhoCorasickBuilder::new().auto_configure(&patterns).build(patterns);
let byte_to_codepoint = new_byte_to_codepoint_map(doc);
let mut matches = vec![];
for mat in ac.find_iter(doc) {
let start = byte_to_codepoint[mat.start()];
let end = byte_to_codepoint[mat.end()];
matches.push((mat.pattern(), start, end));
}
println!("{:?}", matches);
}
/// The vector returned maps byte offsets in the given document to their
/// corresponding codepoint offsets. Byte offsets that fall between an encoded
/// codepoint boundary should never be used, and in the map returned, are
/// mapped to usize::MAX.
fn new_byte_to_codepoint_map(doc: &str) -> Vec<usize> {
let mut map = vec![usize::MAX; doc.len()];
for (codepoint_off, (byte_off, _)) in doc.char_indices().enumerate() {
map[byte_off] = codepoint_off;
}
map
}
For reference, the above code is slightly wrong, missing final value, here's a fixed version:
fn new_byte_to_codepoint_map(haystack: &str) -> Vec<usize>
let mut byte_to_code_point = vec![usize::MAX; haystack.len() + 1];
let mut max_codepoint = 0;
for (codepoint_off, (byte_off, _)) in haystack.char_indices().enumerate() {
byte_to_code_point[byte_off] = codepoint_off;
max_codepoint = codepoint_off;
}
// End index is exclusive (e.g. 0:3 is first 3 characters), so handle
// the case where pattern is at end of string.
if haystack.len() > 0 {
byte_to_code_point[haystack.len()] = max_codepoint + 1;
}
byte_to_code_point
}
Added more bug fixes to above, courtesy of property testing with Hypothesis.