lindera-morphology/lindera

Not using correct right_id to calculate cost?

BlueGreenMagick opened this issue · 1 comments

Is there a reason this crate only uses a word entry's left_id to calculate cost? left_id() and right_id() both returns entry.cost_id, which is the second entry in lex.csv, which is the left cost id. The third entry which is the right cost id is not being used at all.

pub fn left_id(&self) -> u32 {
self.cost_id as u32
}
pub fn right_id(&self) -> u32 {
self.cost_id as u32
}

for (row_id, row) in rows.iter().enumerate() {
word_entry_map
.entry(row[0].to_string())
.or_insert_with(Vec::new)
.push(WordEntry {
word_id: WordId(row_id as u32, true),
word_cost: i16::from_str(row[3].trim()).map_err(|_err| {
LinderaErrorKind::Parse
.with_error(anyhow::anyhow!("failed to parse word_cost"))
})?,
cost_id: u16::from_str(row[1].trim()).map_err(|_err| {
LinderaErrorKind::Parse
.with_error(anyhow::anyhow!("failed to parse cost_id"))
})?,
});
}

@BlueGreenMagick
In IPADIC, the right context ID and the left context ID were the same value, so this is a trick to reduce the binary size of the dictionary as much as possible.
There are cases where other dictionaries have different values, so this code should be corrected.
Thanks for your comment. 👍