Reading a full forms lexicon
arademaker opened this issue · 7 comments
The words command produce all pairs of up/lower words. Do we have any command do read a file with those pairs and produce an fst from the pairs?
You can use read spaced-text
for that; however, the format required is a little different. You need to separate symbols with spaces and input/output pairs go on separate lines, with newlines in between. Example:
c a t
g a t o
d o g
p e r r o
produces a transducer that maps cat
to gato
and dog
to perro
.
Thank you, surely that can help us to have a morphological analyzer out of our full-forms Portuguese Lexicon at https://github.com/LR-POR/MorphoBr/. But, of course, such a transducer is not the perfect solution since it does not capture the rules of the morphology nor the position classes and the respective morphemes.
a l e t o l o g i n h a s
a l e t o l o g i a +N +DIM +F +PL
Hi @mhulden,
foma[0]: read spaced-text all.foma
Stack full!
I got a stack full
error while reading a file with 8,027,574 lines. Any alternative? Can I increase the stack size? The file was created according to the above instructions
% head all.foma
a
a +N +M +SG
a s
a +N +M +PL
a z i n h o
a +N +DIM +M +SG
I was able to compile the spaced-text files
% ll -h *.sp
-rw-r--r-- 1 ar staff 32M Mar 20 16:25 adjectives.sp
-rw-r--r-- 1 ar staff 1.4M Mar 20 16:25 adverbs.sp
-rw-r--r-- 1 ar staff 31M Mar 20 16:25 nouns.sp
-rw-r--r-- 1 ar staff 150M Mar 20 16:25 verbs.sp
with the foma script
% cat compile-m.foma
!Copyright (C) 2023 Alexandre Rademaker
read spaced-text nouns.sp
define nouns ;
clear stack
read spaced-text verbs.sp
define verbs ;
clear stack
read spaced-text adjectives.sp
define adjs ;
clear stack
read spaced-text adverbs.sp
define advs ;
clear stack
save defined morphobr.bin
after changing the https://github.com/mhulden/foma/blob/master/foma/int_stack.c#L22 to 5097152
. Does it make sense?
The only strange behaviour I got is that adjectives are not considered:
% echo "fracota" | flookup -a -i morphobr.bin
fracota fracote+N+F+SG
ar@tenis morpho-br % rg fracota
nouns/nouns-f.dict
16878:fracota fracote+N+F+SG
16879:fracotas fracote+N+F+PL
16880:fracotazinha fracote+N+DIM+F+SG
16881:fracotazinhas fracote+N+DIM+F+PL
adjectives/adjectives-f.dict
16046:fracota fracote+A+F+SG
16047:fracotas fracote+A+F+PL
16048:fracotazinha fracote+A+DIM+F+SG
16049:fracotazinhas fracote+A+DIM+F+PL
Any idea?
Consider doing this instead of save defined
regex nouns | verbs | adjs | advs;
save stack morphbr.bin
(save defined saves several FSTs and flookup only loads one - with the above, you should get a single FST one the stack and save that.)
Thanks, it worked. The strange behavior is that I tested it with nouns and verbs, and it works. That is, an ambiguous word. The problem may be that without this explicit combination of the FSTs with the disjunction. We ended up with an FST with multiple starting states, and the flookup
tool tried only one?! But I was using the -a
flag!
Anyway, the explicit disjunction to combine the FSTs worked fine!