extract-file failed: 'attempt to subtract with overflow'
bact opened this issue · 7 comments
(in attempt to fix #133)
For a experiment purpose, to see how sentence extractor rules for Thai will work if we have a proper sentence splitter,
I get all the text from Wikipedia using this command:
cargo run -- extract -l th -d ../wikiextractor/text/ --no_check >> wiki.th.all.txt
Then I use an external sentence splitter ( https://pythainlp.github.io/docs/2.3/api/tokenize.html#module-pythainlp.tokenize.crfcut ) to get more proper sentences and store them in another text file.
Then I tried to extract sentences, that match the rules, from that line-break separated file (one line, one sentence),
and I got this error message:
thread 'main' panicked at 'attempt to subtract with overflow', src/extractor.rs:101:63
The full error message and backtrace is here:
$ cargo run -- extract-file -l th -d ../texts/ >> wiki.th-new.txt
Finished dev [unoptimized + debuginfo] target(s) in 0.11s
Running `target/debug/common_voice_sentence_collector extract-file -l th -d ../texts/`
Loading rules at "./src/rules/th.toml"
Using Rules Rules { min_trimmed_length: 3, min_word_count: 1, max_word_count: 5, min_characters: 6, may_end_with_colon: false, quote_start_with_letter: true, needs_punctuation_end: false, needs_uppercase_start: false, needs_letter_start: true, allowed_symbols_regex: "[0-9 \u{200b}\u{200c}ก-ฮะ-\u{e39}เ-ๅ\u{e47}-\u{e4c}\\-\\.‚;:!\\?“”‘’\"'`]", disallowed_symbols: [], disallowed_words: {}, broken_whitespace: [String(" "), String(" ,"), String(" ."), String(" ;")], abbreviation_patterns: [String("[A-Z]{2,}"), String("[A-Z]+\\.*[A-Z]+"), String("[ก-ฮ]{1,3}\\.([ก-ฮ]{1,3}\\.)+")], other_patterns: [String("[\\.,:;-]$"), String("[,:;]\\S"), String("[\\.|\\?|!].+$"), String("^.{81,}$"), String("(^|\\s+)[ะาำๅ\u{e31}\u{e34}\u{e35}\u{e36}\u{e37}\u{e4d}\u{e47}\u{e38}\u{e39}\u{e48}\u{e49}\u{e4a}\u{e4b}\u{e3a}\u{e4c}\u{e4d}\u{e4e}]"), String("[เแโใไ](\\s+|$)"), String("[\u{200b}\u{200c}ก-ฮะ-\u{e39}เ-\u{e4c}‘’‚;:“”\"'`\\-\\?\\.!]{55,}"), String("^[\u{200b}\u{200c}]*[^ณ]\\s"), String("^[\u{200b}\u{200c}]*[บ\u{e49}าง|ก\u{e48}อน|เลย|แล\u{e49}ว|หร\u{e37}อไม\u{e48}|ไหม|ล\u{e48}ะ|ด\u{e49}วย|อ\u{e35}ก|และ|หร\u{e37}อ|ก\u{e31}บ|ก\u{e47}]\\s"), String("^\\S{2,3}[\u{200b}\u{200c}]*\\s"), String("\\s\\S{1,3}[\u{200b}\u{200c}]*$"), String("\\s[และ|หร\u{e37}อ|ก\u{e31}บ|เช\u{e48}น][\u{200b}\u{200c}]*$"), String("[เแโใไ]{2,}"), String("[ะาำๅ]{2,}"), String("[\u{e31}\u{e34}\u{e35}\u{e36}\u{e37}\u{e4d}\u{e47}]{2,}"), String("[\u{e38}\u{e39}]{2,}"), String("[\u{e48}\u{e49}\u{e4a}\u{e4b}]{2,}"), String("\u{e3a}{2,}"), String("\u{e4c}{2,}"), String("\u{e4d}{2,}"), String("\u{e4e}{2,}"), String("[เแโใไะาำๅ][\u{e48}\u{e49}\u{e4a}\u{e4b}\u{e3a}\u{e4c}\u{e4d}\u{e4e}]"), String("[\u{e48}\u{e49}\u{e4a}\u{e4b}\u{e3a}\u{e4c}\u{e4d}\u{e4e}][\u{e31}\u{e34}\u{e35}\u{e36}\u{e37}\u{e4d}\u{e47}\u{e38}\u{e39}]")], replacements: [Array([String("\u{200b}"), String("")]), Array([String("\u{200c}"), String("")]), Array([String(" พ.ร.บ."), String(" พระราชบ\u{e31}ญญ\u{e31}ต\u{e34}")]), Array([String(" พ.ร.ก."), String(" พระราชกำหนด")]), Array([String(" พ.ศ. "), String(" พ\u{e38}ทธศ\u{e31}กราช ")]), Array([String(" ค.ศ. "), String(" คร\u{e34}สต\u{e4c}ศ\u{e31}กราช ")]), Array([String(" ม.ร.ว."), String(" หม\u{e48}อมราชวงศ\u{e4c}")]), Array([String(" ."), String(".")]), Array([String(" ,"), String(" ")]), Array([String(" :"), String(":")]), Array([String(" ;"), String(";")]), Array([String(" !"), String("!")]), Array([String(" ?"), String("?")]), Array([String(":"), String(": ")]), Array([String("?"), String("? ")]), Array([String("!"), String("! ")]), Array([String(","), String(" ")]), Array([String(".."), String(" ")]), Array([String("..."), String(" ")]), Array([String("...."), String(" ")]), Array([String(" ."), String(".")]), Array([String(" "), String(" ")]), Array([String(" "), String(" ")]), Array([String(" "), String(" ")]), Array([String("เเ"), String("แ")]), Array([String("\u{e4d}า"), String("ำ")]), Array([String("\u{e4d}\u{e48}า"), String("\u{e48}ำ")]), Array([String("\u{e4d}\u{e49}า"), String("\u{e49}ำ")]), Array([String("\u{e4d}\u{e4a}า"), String("\u{e4a}ำ")]), Array([String("\u{e4d}\u{e4b}า"), String("\u{e4b}ำ")]), Array([String("ฤา"), String("ฤๅ")]), Array([String("ฦา"), String("ฦๅ")])], even_symbols: [String("\""), String("'")], matching_symbols: [Array([String("‘"), String("’")]), Array([String("“"), String("”")])] }
Using disallowed_word_file = false
file_name = "../texts/wiki.th.all-filtered.txt"
thread 'main' panicked at 'attempt to subtract with overflow', src/extractor.rs:101:63
stack backtrace:
0: 0x103142b64 - std::backtrace_rs::backtrace::libunwind::trace::h79c24a8108eef51e
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
1: 0x103142b64 - std::backtrace_rs::backtrace::trace_unsynchronized::hf491b9388f4887f5
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x103142b64 - std::sys_common::backtrace::_print_fmt::h5132bce5284c3ec2
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:67:5
3: 0x103142b64 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hba4e1e451ca8711d
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:46:22
4: 0x103160b4e - core::fmt::write::h7baaf1618474dae0
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/fmt/mod.rs:1094:17
5: 0x10314011a - std::io::Write::write_fmt::hd293de47cc154cdf
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/io/mod.rs:1580:15
6: 0x10314481f - std::sys_common::backtrace::_print::hb9d4bc7b9e0ae081
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:49:5
7: 0x10314481f - std::sys_common::backtrace::print::h82a68481004d7b57
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:36:9
8: 0x10314481f - std::panicking::default_hook::{{closure}}::h11b9cc5ac5c4d127
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:208:50
9: 0x103144329 - std::panicking::default_hook::hfe650a460287c541
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:225:9
10: 0x103144f75 - std::panicking::rust_panic_with_hook::h5212f5e986dcd234
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:591:17
11: 0x103144ac9 - std::panicking::begin_panic_handler::{{closure}}::hd4a4baba3ac1c064
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:495:13
12: 0x103143008 - std::sys_common::backtrace::__rust_end_short_backtrace::h5a76e76b61bd088d
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:141:18
13: 0x103144a5a - rust_begin_unwind
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:493:5
14: 0x10315f00f - core::panicking::panic_fmt::h6b7498085d32aaee
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/panicking.rs:92:14
15: 0x10315ef67 - core::panicking::panic::he65ad651ff2e7951
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/panicking.rs:50:5
16: 0x102da151b - common_voice_sentence_collector::extractor::pick_sentences::h5dff4df1baa07207
at /Users/arthit/projects/cv-sentence-extractor/src/extractor.rs:101:63
17: 0x102da0f5f - common_voice_sentence_collector::extractor::choose::hf29c68337c2c6437
at /Users/arthit/projects/cv-sentence-extractor/src/extractor.rs:68:9
18: 0x102da0612 - common_voice_sentence_collector::extractor::extract::ha227c2098ed10762
at /Users/arthit/projects/cv-sentence-extractor/src/extractor.rs:27:29
19: 0x102d8cdd3 - common_voice_sentence_collector::app::start::h28d652a6aaf2528f
at /Users/arthit/projects/cv-sentence-extractor/src/app.rs:80:16
20: 0x102d6400d - common_voice_sentence_collector::app::run::he91c6de970f5ecc7
at /Users/arthit/projects/cv-sentence-extractor/src/app.rs:59:5
21: 0x102d77b26 - common_voice_sentence_collector::main::hfd4bf9963f894313
at /Users/arthit/projects/cv-sentence-extractor/src/main.rs:8:5
22: 0x102d77bc5 - core::ops::function::FnOnce::call_once::hacfea633331549bd
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/ops/function.rs:227:5
23: 0x102d668cc - std::sys_common::backtrace::__rust_begin_short_backtrace::h741cc0dfecc9cbff
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:125:18
24: 0x102d67f78 - std::rt::lang_start::{{closure}}::ha40e5aeaf02316c1
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/rt.rs:66:18
25: 0x1031452e4 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h88801ec30fa967bc
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/ops/function.rs:259:13
26: 0x1031452e4 - std::panicking::try::do_call::ha5838b1ed53bb3ce
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:379:40
27: 0x1031452e4 - std::panicking::try::h2c2c426e3f3c01a8
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:343:19
28: 0x1031452e4 - std::panic::catch_unwind::h383eb7eff10b175f
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panic.rs:431:14
29: 0x1031452e4 - std::rt::lang_start_internal::h09b48eb36ffca70d
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/rt.rs:51:25
30: 0x102d67f4e - std::rt::lang_start::hc9ed7f08068d5206
at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/rt.rs:65:5
31: 0x102d77b46 - _main
Note that this is not urgent for me.
But anyone who have an interest in extract-file
may like to learn about this.
Good catch! Can you attach the txt file here so I can try to reproduce?
Here the link for the txt file (51 MB)
https://drive.google.com/file/d/13GGr0wxwQXhWrTXTvmzdCodhJ9Atf9NJ/view?usp=sharing
Thank you, with the fix I just pushed I was able to run through the hole file (took some time, but worked).
Thank you! That was quick!
Btw, the resulting file from this process will not pass the legal requirement, right? Since it doesn't guarantee that only 3 sentences will be picked from an article.
Just to confirm that we cannot submit the output to the Sentence Collector. thx
Btw, the resulting file from this process will not pass the legal requirement, right? Since it doesn't guarantee that only 3 sentences will be picked from an article.
If there is no manual intervention needed we might be able to find a solution even if it's not just the code in this repo only. However we definitely need to make sure we're not taking more than 3 sentences per article (and no sentences for articles with less than 3 sentences in it). For this case here I'm not sure how we can guarantee that though :/
Just to confirm that we cannot submit the output to the Sentence Collector. thx
The output of the extraction wouldn't go through the Sentence Collector. Once extractor rule files get merged we can run an automatic extraction and then add the output directly to the Common Voice repo. The important thing here is that it's run through our process so we can guarantee that we indeed did not take more than 3 per article.
Thank you for clarification.