long-term vs long term
naymaraq opened this issue · 4 comments
Hi
I have question regarding to transcripts.
There are many words in the reference containing a dash character (e.g., non-cash, year-to-date,...).
How to deal with these words? One option is to post-process the hypothesis files, and another option is to add these words to the list of synonyms.
Hi!
If you are using our tool https://github.com/revdotcom/fstalign to calculate WER, hyphenated words are by default treated as synonyms with their non-hyphenated forms (if you don't want this, you can disable with --disable-hyphen-ignore
. If you are using other tools to calculate error, I would agree with your suggestion to just replace them with spaces on the reference and hypothesis side.
I am using the fstalign tool.
And here is a part of side-by-side log with the command line arguments.
As you can see, hyphenated words are considered errors even though I didn't set the --disable-hyphen-ignore flag.
stores stores
and and
<ins> non ERR
non-comp comp ERR
stores stores
which which
helps helps
us us
to to
evaluate evaluate
their their
performance performance
<ins> post ERR
post-refresh refresh ERR
of of
our our
approximately approximately
args = ["build/fstalign",
"wer",
"--hyp",
f"{out}/output/{file_id}.nlp",
"--ref",
f"{out}/transcripts/nlp_references/{file_id}.nlp",
"--syn",
syn,
"--ref-json",
f"{norms}/{file_id}.norm.json",
"--json-log",
f"{log_folder}/{file_id}.json",
"--output-sbs",
f"{sbs_folder}/{file_id}.txt",
]
Can you help me by checking the version of fstalign? That feature was released here https://github.com/revdotcom/fstalign/releases/tag/1.4.0 in February. Thanks!
Yeah, that's it. My version is 1.2.0.
Thanks!