WorksApplications/sudachi.rs

Python Exception Types

polm opened this issue · 5 comments

polm commented

I noticed that if input is too long in Python an Exception is thrown, but it's a plain Exception, not a ValueError or something. I see in the Rust code there are a variety of specific error types.

I'm not familiar with Rust, but surely it's possible to have the Python code throw something more specific like a InputTooLongException?

We will try to use better exception types later

return Err(SudachiError::InputTooLong(sz, REALLY_MAX_LENGTH));

is it possible to u16 -> u32 or more?
cant use for large text

polm commented

@gulldan Your question is not actually related to the main issue here, which is not about length but types. You should open a new issue.

Separately, from experience, tokenizers like this aren't designed for long inputs like that and you should split yours up into multiple calls.

Adding to @polm, making the max input length to be u32::MAX will make it possible for Sudachi to crash with OOM because memory usage for long sentences will be very significant. In future it would be better to add an API for analyzing long text, as Java version has.

Also, getting to the original issue, I think that I changed all usages of Python Exception type to SudachiError during last 3-4 versions. The next version will fix last couple of usages.

SudachiError will be used instead.

polm commented

Using a single generic error feels a little too general, but it's much better than a full Exception - thanks! I'll go ahead and close this.