common-voice/cv-sentence-extractor

extract-file failed: 'attempt to subtract with overflow'

bact opened this issue · 7 comments

bact commented

(in attempt to fix #133)

For a experiment purpose, to see how sentence extractor rules for Thai will work if we have a proper sentence splitter,
I get all the text from Wikipedia using this command:

cargo run -- extract -l th -d ../wikiextractor/text/ --no_check >> wiki.th.all.txt

Then I use an external sentence splitter ( https://pythainlp.github.io/docs/2.3/api/tokenize.html#module-pythainlp.tokenize.crfcut ) to get more proper sentences and store them in another text file.

Then I tried to extract sentences, that match the rules, from that line-break separated file (one line, one sentence),
and I got this error message:

thread 'main' panicked at 'attempt to subtract with overflow', src/extractor.rs:101:63

The full error message and backtrace is here:

$ cargo run -- extract-file -l th -d ../texts/ >> wiki.th-new.txt
    Finished dev [unoptimized + debuginfo] target(s) in 0.11s
     Running `target/debug/common_voice_sentence_collector extract-file -l th -d ../texts/`
Loading rules at "./src/rules/th.toml"
Using Rules Rules { min_trimmed_length: 3, min_word_count: 1, max_word_count: 5, min_characters: 6, may_end_with_colon: false, quote_start_with_letter: true, needs_punctuation_end: false, needs_uppercase_start: false, needs_letter_start: true, allowed_symbols_regex: "[0-9 \u{200b}\u{200c}ก-ฮะ-\u{e39}เ-ๅ\u{e47}-\u{e4c}\\-\\.‚;:!\\?“”‘’\"'`]", disallowed_symbols: [], disallowed_words: {}, broken_whitespace: [String("  "), String(" ,"), String(" ."), String(" ;")], abbreviation_patterns: [String("[A-Z]{2,}"), String("[A-Z]+\\.*[A-Z]+"), String("[ก-ฮ]{1,3}\\.([ก-ฮ]{1,3}\\.)+")], other_patterns: [String("[\\.,:;-]$"), String("[,:;]\\S"), String("[\\.|\\?|!].+$"), String("^.{81,}$"), String("(^|\\s+)[ะาำๅ\u{e31}\u{e34}\u{e35}\u{e36}\u{e37}\u{e4d}\u{e47}\u{e38}\u{e39}\u{e48}\u{e49}\u{e4a}\u{e4b}\u{e3a}\u{e4c}\u{e4d}\u{e4e}]"), String("[เแโใไ](\\s+|$)"), String("[\u{200b}\u{200c}ก-ฮะ-\u{e39}เ-\u{e4c}‘’‚;:“”\"'`\\-\\?\\.!]{55,}"), String("^[\u{200b}\u{200c}]*[^ณ]\\s"), String("^[\u{200b}\u{200c}]*[บ\u{e49}าง|ก\u{e48}อน|เลย|แล\u{e49}ว|หร\u{e37}อไม\u{e48}|ไหม|ล\u{e48}ะ|ด\u{e49}วย|อ\u{e35}ก|และ|หร\u{e37}อ|ก\u{e31}บ|ก\u{e47}]\\s"), String("^\\S{2,3}[\u{200b}\u{200c}]*\\s"), String("\\s\\S{1,3}[\u{200b}\u{200c}]*$"), String("\\s[และ|หร\u{e37}อ|ก\u{e31}บ|เช\u{e48}น][\u{200b}\u{200c}]*$"), String("[เแโใไ]{2,}"), String("[ะาำๅ]{2,}"), String("[\u{e31}\u{e34}\u{e35}\u{e36}\u{e37}\u{e4d}\u{e47}]{2,}"), String("[\u{e38}\u{e39}]{2,}"), String("[\u{e48}\u{e49}\u{e4a}\u{e4b}]{2,}"), String("\u{e3a}{2,}"), String("\u{e4c}{2,}"), String("\u{e4d}{2,}"), String("\u{e4e}{2,}"), String("[เแโใไะาำๅ][\u{e48}\u{e49}\u{e4a}\u{e4b}\u{e3a}\u{e4c}\u{e4d}\u{e4e}]"), String("[\u{e48}\u{e49}\u{e4a}\u{e4b}\u{e3a}\u{e4c}\u{e4d}\u{e4e}][\u{e31}\u{e34}\u{e35}\u{e36}\u{e37}\u{e4d}\u{e47}\u{e38}\u{e39}]")], replacements: [Array([String("\u{200b}"), String("")]), Array([String("\u{200c}"), String("")]), Array([String(" พ.ร.บ."), String(" พระราชบ\u{e31}ญญ\u{e31}ต\u{e34}")]), Array([String(" พ.ร.ก."), String(" พระราชกำหนด")]), Array([String(" พ.ศ. "), String(" พ\u{e38}ทธศ\u{e31}กราช ")]), Array([String(" ค.ศ. "), String(" คร\u{e34}สต\u{e4c}ศ\u{e31}กราช ")]), Array([String(" ม.ร.ว."), String(" หม\u{e48}อมราชวงศ\u{e4c}")]), Array([String(" ."), String(".")]), Array([String(" ,"), String(" ")]), Array([String(" :"), String(":")]), Array([String(" ;"), String(";")]), Array([String(" !"), String("!")]), Array([String(" ?"), String("?")]), Array([String(":"), String(": ")]), Array([String("?"), String("? ")]), Array([String("!"), String("! ")]), Array([String(","), String(" ")]), Array([String(".."), String(" ")]), Array([String("..."), String(" ")]), Array([String("...."), String(" ")]), Array([String(" ."), String(".")]), Array([String("    "), String(" ")]), Array([String("   "), String(" ")]), Array([String("  "), String(" ")]), Array([String("เเ"), String("แ")]), Array([String("\u{e4d}า"), String("ำ")]), Array([String("\u{e4d}\u{e48}า"), String("\u{e48}ำ")]), Array([String("\u{e4d}\u{e49}า"), String("\u{e49}ำ")]), Array([String("\u{e4d}\u{e4a}า"), String("\u{e4a}ำ")]), Array([String("\u{e4d}\u{e4b}า"), String("\u{e4b}ำ")]), Array([String("ฤา"), String("ฤๅ")]), Array([String("ฦา"), String("ฦๅ")])], even_symbols: [String("\""), String("'")], matching_symbols: [Array([String("‘"), String("’")]), Array([String("“"), String("”")])] }
Using disallowed_word_file = false
file_name = "../texts/wiki.th.all-filtered.txt"
thread 'main' panicked at 'attempt to subtract with overflow', src/extractor.rs:101:63
stack backtrace:
   0:        0x103142b64 - std::backtrace_rs::backtrace::libunwind::trace::h79c24a8108eef51e
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
   1:        0x103142b64 - std::backtrace_rs::backtrace::trace_unsynchronized::hf491b9388f4887f5
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x103142b64 - std::sys_common::backtrace::_print_fmt::h5132bce5284c3ec2
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:67:5
   3:        0x103142b64 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hba4e1e451ca8711d
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:46:22
   4:        0x103160b4e - core::fmt::write::h7baaf1618474dae0
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/fmt/mod.rs:1094:17
   5:        0x10314011a - std::io::Write::write_fmt::hd293de47cc154cdf
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/io/mod.rs:1580:15
   6:        0x10314481f - std::sys_common::backtrace::_print::hb9d4bc7b9e0ae081
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:49:5
   7:        0x10314481f - std::sys_common::backtrace::print::h82a68481004d7b57
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:36:9
   8:        0x10314481f - std::panicking::default_hook::{{closure}}::h11b9cc5ac5c4d127
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:208:50
   9:        0x103144329 - std::panicking::default_hook::hfe650a460287c541
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:225:9
  10:        0x103144f75 - std::panicking::rust_panic_with_hook::h5212f5e986dcd234
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:591:17
  11:        0x103144ac9 - std::panicking::begin_panic_handler::{{closure}}::hd4a4baba3ac1c064
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:495:13
  12:        0x103143008 - std::sys_common::backtrace::__rust_end_short_backtrace::h5a76e76b61bd088d
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:141:18
  13:        0x103144a5a - rust_begin_unwind
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:493:5
  14:        0x10315f00f - core::panicking::panic_fmt::h6b7498085d32aaee
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/panicking.rs:92:14
  15:        0x10315ef67 - core::panicking::panic::he65ad651ff2e7951
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/panicking.rs:50:5
  16:        0x102da151b - common_voice_sentence_collector::extractor::pick_sentences::h5dff4df1baa07207
                               at /Users/arthit/projects/cv-sentence-extractor/src/extractor.rs:101:63
  17:        0x102da0f5f - common_voice_sentence_collector::extractor::choose::hf29c68337c2c6437
                               at /Users/arthit/projects/cv-sentence-extractor/src/extractor.rs:68:9
  18:        0x102da0612 - common_voice_sentence_collector::extractor::extract::ha227c2098ed10762
                               at /Users/arthit/projects/cv-sentence-extractor/src/extractor.rs:27:29
  19:        0x102d8cdd3 - common_voice_sentence_collector::app::start::h28d652a6aaf2528f
                               at /Users/arthit/projects/cv-sentence-extractor/src/app.rs:80:16
  20:        0x102d6400d - common_voice_sentence_collector::app::run::he91c6de970f5ecc7
                               at /Users/arthit/projects/cv-sentence-extractor/src/app.rs:59:5
  21:        0x102d77b26 - common_voice_sentence_collector::main::hfd4bf9963f894313
                               at /Users/arthit/projects/cv-sentence-extractor/src/main.rs:8:5
  22:        0x102d77bc5 - core::ops::function::FnOnce::call_once::hacfea633331549bd
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/ops/function.rs:227:5
  23:        0x102d668cc - std::sys_common::backtrace::__rust_begin_short_backtrace::h741cc0dfecc9cbff
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/sys_common/backtrace.rs:125:18
  24:        0x102d67f78 - std::rt::lang_start::{{closure}}::ha40e5aeaf02316c1
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/rt.rs:66:18
  25:        0x1031452e4 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h88801ec30fa967bc
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/core/src/ops/function.rs:259:13
  26:        0x1031452e4 - std::panicking::try::do_call::ha5838b1ed53bb3ce
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:379:40
  27:        0x1031452e4 - std::panicking::try::h2c2c426e3f3c01a8
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panicking.rs:343:19
  28:        0x1031452e4 - std::panic::catch_unwind::h383eb7eff10b175f
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/panic.rs:431:14
  29:        0x1031452e4 - std::rt::lang_start_internal::h09b48eb36ffca70d
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/rt.rs:51:25
  30:        0x102d67f4e - std::rt::lang_start::hc9ed7f08068d5206
                               at /rustc/132b4e5d167b7e622fcc11fa2b67b931105b4de1/library/std/src/rt.rs:65:5
  31:        0x102d77b46 - _main

Note that this is not urgent for me.
But anyone who have an interest in extract-file may like to learn about this.

Good catch! Can you attach the txt file here so I can try to reproduce?

Thank you, with the fix I just pushed I was able to run through the hole file (took some time, but worked).

bact commented

Thank you! That was quick!

bact commented

Btw, the resulting file from this process will not pass the legal requirement, right? Since it doesn't guarantee that only 3 sentences will be picked from an article.

Just to confirm that we cannot submit the output to the Sentence Collector. thx

Btw, the resulting file from this process will not pass the legal requirement, right? Since it doesn't guarantee that only 3 sentences will be picked from an article.

If there is no manual intervention needed we might be able to find a solution even if it's not just the code in this repo only. However we definitely need to make sure we're not taking more than 3 sentences per article (and no sentences for articles with less than 3 sentences in it). For this case here I'm not sure how we can guarantee that though :/

Just to confirm that we cannot submit the output to the Sentence Collector. thx

The output of the extraction wouldn't go through the Sentence Collector. Once extractor rule files get merged we can run an automatic extraction and then add the output directly to the Common Voice repo. The important thing here is that it's run through our process so we can guarantee that we indeed did not take more than 3 per article.

bact commented

Thank you for clarification.