revdotcom/speech-datasets

long-term vs long term

naymaraq opened this issue · 4 comments

Hi

I have question regarding to transcripts.

There are many words in the reference containing a dash character (e.g., non-cash, year-to-date,...).
How to deal with these words? One option is to post-process the hypothesis files, and another option is to add these words to the list of synonyms.

qmac commented

Hi!

If you are using our tool https://github.com/revdotcom/fstalign to calculate WER, hyphenated words are by default treated as synonyms with their non-hyphenated forms (if you don't want this, you can disable with --disable-hyphen-ignore. If you are using other tools to calculate error, I would agree with your suggestion to just replace them with spaces on the reference and hypothesis side.

I am using the fstalign tool.
And here is a part of side-by-side log with the command line arguments.
As you can see, hyphenated words are considered errors even though I didn't set the --disable-hyphen-ignore flag.

          stores    stores
             and    and
           <ins>    non                     ERR
        non-comp    comp                    ERR
          stores    stores
           which    which
           helps    helps
              us    us
              to    to
        evaluate    evaluate
           their    their
     performance    performance
           <ins>    post                    ERR
    post-refresh    refresh                 ERR
              of    of
             our    our
   approximately    approximately
   
   
args = ["build/fstalign",
        "wer",
        "--hyp", 
        f"{out}/output/{file_id}.nlp",
        "--ref",
        f"{out}/transcripts/nlp_references/{file_id}.nlp",
        "--syn",
        syn,
        "--ref-json",
        f"{norms}/{file_id}.norm.json",
        "--json-log",
        f"{log_folder}/{file_id}.json",
        "--output-sbs",
        f"{sbs_folder}/{file_id}.txt",
        ]
qmac commented

Can you help me by checking the version of fstalign? That feature was released here https://github.com/revdotcom/fstalign/releases/tag/1.4.0 in February. Thanks!

Yeah, that's it. My version is 1.2.0.

Thanks!