dkpro/dkpro-jwktl

JWKTL enters in an infinite loop when parsing translation for German personal pronoun

Closed this issue · 1 comments

Dear Friends,

today I tried to parse the last dump of the german wiktionary, and shortly after starting the process enters in an infinite loop when trying to parse a translation for the page with id 802. The term is the personal pronoun "er" in german, and the translation is a translation to Japanese: "*{{ja}}:; [1] {{Üt|ja|彼|かれ, kare}}; あの人 [あのひと]; あの方 [あのかた]; 彼奴 [あいつ]; ''[[w:Japanisches Schriftsystem#Rōmaji|Rōmaji]]:''aitsu; anohito; anokata; kare".

I managed to find the source code where the parsing enters in an infinite loop. It happens in the class "DETranslationHandler", around the lines 213-218.
matcher = NEXT_TRANSLATION_PATTERN.matcher(remainingText); if (matcher.find()) remainingText = matcher.group(2) + matcher.group(3); else remainingText = null;

The remaining text remains always the same as in the previous iteration, and because of that happens the infinite loop.

I did a workaround by comparing the remainingText with the text from the previous iteration, and when they remain the same, then finishing the do-while iterations, but I'm not sure if this is the correct way to fix it, probably not. I don't understand completely how the parsing works.

Maybe someone can fix this bug in a more correct way.

Other questions that I have. I'm interested in writing the classes and components needed to parse the Spanish, Portuguese, French and Italian wiktionaries.

Is there in someplace a guide of how to implement these classes, or do I have to read the existing classes for German and English and try to realize how the process works?

Thank you for this cool project.

Regards.

Good catch! I've just committed a fix and could successfully parse the latest dump.

It would be very nice to have parsers for other Wiktionaries. We could definitely integrate them here (see contributing guide for infos). Unfortunately, there's no good documentation. A rough overview of the parsing architecture is here https://dkpro.github.io/dkpro-jwktl/documentation/add-parsers/ - hope this helps. The oldest part of the code are over ten years old, so be prepared to find lots of legacy code when reading the existing parsers. But the good news is that it is possible to develop a new parser without digging to deep into the existing parsers.