divvun/libdivvun

divvun-normaliser

snomos opened this issue · 6 comments

Draft specification here.

Tasks:

  • Add support for analyser
  • Add support for generator
  • Add support for normaliser
  • Add support for tag filtering
  • Proper output formatting
  • store the original lemma in a tag string in the same reading, replacing it with the normalized lemma
  • #58

The folllowing works fine without divvun-normaliser:

echo 'Man vuoras: 23' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g tools/tokenisers/mwe-dis.bin | cg-mwesplit 
"<Man>"
	"Man" N Prop Sem/Plc Sg Nom <W:0.0>
	"Man" N Prop Sem/Sur Sg Nom <W:0.0>
	"man" Adv <W:0.0>
	"mij" Pron Interr Sg Gen <W:0.0>
	"mij" Pron Interr Sg Ill Attr <W:0.0>
	"mij" Pron Interr Sg Ine Attr <W:0.0>
	"mij" Pron Rel Sg Gen <W:0.0>
	"mij" Pron Rel Sg Ill Attr <W:0.0>
	"mij" Pron Rel Sg Ine Attr <W:0.0>
: 
"<vuoras>"
	"vuoras" A Attr <W:0.0>
	"vuoras" A Sg Nom <W:0.0>
	"vuoras" Err/Orth A Attr <W:0.0>
	"vuoras" Err/Orth A Sg Nom <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Err/Orth Sg3 <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Sg3 <W:0.0>
"<:>"
	":" CLB <W:0.0>
: 
"<23>"
	"23" A Arab Ord Attr CLBfinal <W:0.0>
	"23" Num Arab Sg Ela Attr <W:0.0>
	"23" Num Arab Sg Gen <W:0.0>
	"23" Num Arab Sg Ill Attr <W:0.0>
	"23" Num Arab Sg Ine Attr <W:0.0>
	"23" Num Arab Sg Nom <W:0.0>
	"23" Num Sem/ID <W:0.0>
:\n

But with divvun-normaliser I get a lidivvun error (and not the expected output format):

echo 'Man vuoras: 23' | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g tools/tokenisers/mwe-dis.bin \
| cg-mwesplit \
| divvun-normaliser -a src/analyser-gt-desc.hfst -n tools/tts/transcriptor-gt-desc.hfst -g src/generator-gt-norm.hfst 
libdivvun: ERROR: HfstException.
"<Man>"
: 
"<vuoras>"
"<:>"
: 
"<23>"
:\n

It seems I didn't manage to set the default for -t tags so it didn't print nothing, now it should copy input if no tags are set to be expanded.

pushed few more debugging; it seems we need hfstol's to lookup_fd:

echo 'Man vuoras: 23' | hfst-tokenise -g ~/github/giellalt/lang-smj/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | vislcg3 -g ~/github/giellalt/lang-smj/tools/tokenisers/mwe-dis.bin | cg-mwesplit | src/divvun-normaliser -a ~/github/giellalt/lang-smj/src/analyser-gt-desc.hfstol -n ~/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol -g ~/github/giellalt/lang-smj/src/generator-gt-norm.hfstol --tags Arab -v
libdivvun: ERROR: HfstException: Exception: NotTransducerStreamException: transducer type not recognised in file: HfstInputStream.cc on line: 1088
Read /home/flammie/github/giellalt/lang-smj/tools/tts/transcriptor-gt-desc.hfstol, /home/flammie/github/giellalt/lang-smj/src/generator-gt-norm.hfstol, /home/flammie/github/giellalt/lang-smj/src/analyser-gt-desc.hfstol
"<Man>"
	"Man" N Prop Sem/Plc Sg Nom <W:0.0>
	"Man" N Prop Sem/Sur Sg Nom <W:0.0>
	"man" Adv <W:0.0>
	"mij" Pron Interr Sg Gen <W:0.0>
	"mij" Pron Interr Sg Ill Attr <W:0.0>
	"mij" Pron Interr Sg Ine Attr <W:0.0>
	"mij" Pron Rel Sg Gen <W:0.0>
	"mij" Pron Rel Sg Ill Attr <W:0.0>
	"mij" Pron Rel Sg Ine Attr <W:0.0>
: 
"<vuoras>"
	"vuoras" A Attr <W:0.0>
	"vuoras" A Sg Nom <W:0.0>
	"vuoras" Err/Orth A Attr <W:0.0>
	"vuoras" Err/Orth A Sg Nom <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Err/Orth Sg3 <W:0.0>
	"vuorrat" Ex/V IV Der/st V Ind Prs Sg3 <W:0.0>
"<:>"
	":" CLB <W:0.0>
: 
"<23>"
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" A Arab Ord Attr CLBfinal <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Ela Attr <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Gen <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Ill Attr <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Ine Attr <W:0.0>
	"guaktalåkgålmmå" <W:0.0> "guaktalåkgålmmå"phon
		"23" Num Arab Sg Nom <W:0.0>
	"23" Num Sem/ID <W:0.0>
:\n

Nice progress 🙂

@unhammer are there any CG syntax restrictions on the transcripted string, "guaktalåkgålmmå"phon in the test case above? We modelled it after the divvun-cgspell output, but that one has only one letter after the actual string. Just asking to avoid major changes later 🙂

"guaktalåkgålmmå"phon is a valid CG tag, though it is not considered a textual tag - not that I think that matters for you. The rule is that if it starts with " then include anything to next " and from there include to next whitespace. This avoids much unnecessary escaping.

A case we haven't considered: dynamic compounds, ie cohorts with sub-readings. There are two considerations:

  • we create subreadings out of the original - the normalized reading is the main reading, the original is stored in a subreading
  • in dynamic compounds, we may want to normalize each part separately, as in:
echo 1800-lågon | ./tools/tts/modes/smj-txt2ipa.mode 
"<1800-lågon>"
	"lågos" N Sem/Dummytag Ess <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/Hyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
	"låhko" N Sem/Amount Sg Ine <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/Hyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
	"lågos" N Sem/Dummytag Ess <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/OblHyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
	"låhko" N Sem/Amount Sg Ine <W:0.0> @HNOUN #1->0 "1800-lɔkon"phon
		"1800" Num Cmp/OblHyph Cmp <W:0.0> #1->0 "1800-lɔkon"phon
:\n

If we could normalize 1800- independently of the rest of the compound, we would solve a lot of corner cases.

Perhaps the best solution would be to not change the basic cohort structure at all, ie that we do NOT add the original lemma as a subreading. Instead I suggest that we store the original in a tag string along the lines of the "abc"phon string, something like: "1800-"orig or "1800-"olemma or something similar. The main purpose of retaining the original lemma is for debugging, and changing the cohort structure seems to cost too much.

@flammie could you have a look at this? I added the new tasks to the task list in the initial comment.