Duden dictionaries conversion issue with missing space characters

Question

Duden dictionaries conversion issue with missing space characters

wymmij opened this issue 3 years ago · 9 comments

First things first - great tool!

This issue is awkward to express because with only the outcome of conversions, there's little for me to go on.

Nevertheless, if you'll bear with me...

When converting a particular Duden dictionary (Das Herkunftswörterbuch), an appreciable number of words in the entries are not separated by a space despite the same (original, unconverted) dictionary being loaded into Duden Bibliothek 5 correctly having the spaces.

Though there is no obvious pattern to such occurrences that leaps out, some words tended to be more commonly affected: for example, 'ist' would often find itself suffixed onto the previous word, so the conversion would produce 'esist' instead of 'es ist'. Other commonly affected words would be short prepositions such as 'aus', or other short but commonly occurring words such as 'auch'. The words often affected aren't always affected either, just to add to the mystery!

I'm assuming that the issue lay with some strange Duden encoding that your parser doesn't know about? I had thought that it may be something like a non-breaking space character, but I've failed to prove that using the Duden Bibliothek 5 software since it happily line-breaks entries on affected words.

Besides, there is another issue that may be related to a non-breaking space and that is that the '~' (tilde) character proliferates all over this conversion and which is used in the DSL format to reproduce the entry's headword. It occurs typically in places you might expect a non-breaking space too, like, erm...for example(!), 'z.~B', which then renders in GoldenDict, if say the headword is 'fordern', as 'z.fordernB'. In this particular case, it was trivial to deal with because a simple regexp could eliminate all tildes.

The non-appearances of spaces where they ought to be also occurs sometimes in other contexts that are at least more easily handled by a few carefully constructed regexps, such as after some full-stops, but, and just to restate the main issue here, regexps only go so far when trying to deal with cases such as no space occurring in, for example, 'Zusammensetzungenhaben'.

Of course, unless you immediately know what the problem is, I guess you'd ideally need access to the dictionary affected, which may pose an issue, so I just thought I'd register this issue nevertheless in the hope that there might be an easy solution.

Answer 1 · 2021-08-05T00:45:37.000Z

Thank you for such a detailed description.

I found the dictionary in question on my PC and sure enough the problems you described are very apparent. It turns out that I misunderstood the meaning of new lines inside articles. Usually an explicit '\\' is used for breaking a line and the normal '\n' should be ignored when converting to DSL. For example:

Anglizismen-Wörterbuch, begr. v. Broder Carstensen. Fortgef. v. Ulrich Busse. 3 Bde. Berlin 1993‒1996.\\
\\
Battisti, Carlo / Alessio, Giovanni: Dizionario etimologico italiano. 5 Bde. Neuausgabe Florenz 1975.\\
\\
Birkhan, Helmut: Etymologie des Deutschen. Bern 1985.\\
\\

But when '\n' is used without any '\\' nearby, it actually needs to be treated as a whitespace:

@1mogeln @0@C%ID=16314
\\
@8\\
@9
(ugs. für:) @0»dem Glück ein bisschen nachhelfen;
kleine betrügerische Kniffe anwenden«: Die Herkunft des erst seit dem 18.~Jh. bezeugten Verbs
ist nicht sicher geklärt. Vielleicht handelt es sich um eine Nebenform
von mdal. \F{_80 80 40}maucheln \F{80 80 40_}@0»heimlich
oder hinterlistig handeln, betrügen« (vgl. \S{meucheln;:003682289}).@0

The distribution of these '\n' breaks seems random and not all articles have them.

As for the tilde -- you are absolutely right. It is indeed used as a non-breaking space that simply needs to be converted to Unicode.

I refactored the newline handling and fixed the tilde. The result looks promising.

I published a new windows build (in case you are not compiling lsd2dsl yourself) and will make a proper hotfix a bit later once I've done more testing.

Answer 2 · 2021-08-05T14:20:08.000Z

Thanks for taking a look at this so quickly.

And the fixes do the job too; great work!

A couple of further issues relating still to Duden dictionaries, or at least to this particular dictionary. I'll go with the stranger of the two first:

Sieg
Sieger (Sieg)
siegen (Sieg)
siegreich (Sieg)
	[b]Sieg: [/b][br]
	[br]
	Das gemeingerm. Substantiv mhd. [c black]sic, sige, [/c]ahd. [c black]sigi, sigu, [/c]got. [c black]sigis, [/c]aengl. [c black]sige, [/c]schwed. [c black]seger [/c]geht auf die idg. Wurzel [c black]*seg̑h- [/c]»festhalten, im Kampf überwältigen; Sieg« zurück, vgl. z. B. aind. [c black]sáhatē [/c]»er bewältigt, vermag, erträgt« (mit dem Substantiv [c black]sáhas- [/c]»Gewalt, Sieg«) und griech. [c black]échein (íschein) [/c]»halten, besitzen, haben« (vgl. den Artikel [ref]hektisch[/ref]). ‒ Abl.: [b]siegen [/b]»den Sieg davontragen« (mhd. [c black]sigen, [/c]ähnlich ahd. [c black]ubarsiginōn, -sigirōn[/c]), dazu [b]Sieger [/b]»jemand, der den Sieg errungen hat« (16. Jh.; rhein. im 13. Jh. [c black]segere[/c]). [c blue]Zus.: [b]siegreich [/b]»den Sieg errungen habend; oft siegend, erfolgreich« (mhd. [c black]sigerīche[/c]).[/c]

Okay, sadly you have to scroll all the way to the right there, but right at the end of the entry there is: [c blue]Zus.: [b]siegreich [/b]»den Sieg errungen habend; oft siegend, erfolgreich« (mhd. [c black]sigerīche[/c]).[/c] which will be displayed entirely in blue (but for those regions with some other colour). Now this isn't formatting that apparently displays at all when the same dictionary is viewed in its original form in the Duden Bibliothek software, although I'm assuming that there is some un-rendered Duden markup that is being converted into these 'blue' sections, since it doesn't seem entirely random, as it is usually logical units in the entry such as, for example in this case, the 'Zusammensetzungen' section, or the 'Ableitungen' section.

The second issue relates to the conversion of the cross-references, and for ease-of-illustration here is an example of an entry which is just a cross-reference:

sielen
	[b]sielen [/b][br]
	[br]
	[ref]Sauferei (saufen)[/ref] (saufen).

But this same entry in the original form is rendered in Duden Bibliothek as something like:

sielen

↑saufen

with the ‘↑’ representing the hyperlink to the 'saufen' entry. So the first problem with the converted version is that the cross-reference is to a non-existent entry, since it adds a word which is part of the intended link destination entry, the choice of which isn't random, since it seems to be the word as it appears at the top of list of all the headings that point to the same entry, and then in addition to that, there is the unnecessary reduplication of the headword of the cross-reference in parentheses. I say unnecessary because I think for the times when the cross-reference destination and the actual cross-referenced word (i.e. typically either a derivation or compound) differ, then this seems to already be noted in the original Duden format (at least as far I can see).

One other issue was to do with compiling from source. I'm on Linux (Arch Linux), and the source code wouldn't initially compile because of the use of minizip-ng. Not ultimately a big deal because I could obtain minizip-ng (in the AUR as minizip-git) easily enough and though it required removing the core repository version of minizip from my machine, I'm assuming this will be fine further down the line due to the compatibility layer.

Finally, just out of pure curiosity, is there a way to use lsd2dsl is such a way that I could get at the original Duden markup before it's converted to DSL, just as in the examples you included above?

Answer 3 · 2021-08-05T14:27:19.000Z

Oops. Sorry for the closing and then reopening of the issue! I didn't intend to close it in the first place, but as inexperienced as I am with doing this kind of thing, I clicked "Close with comment" rather than just "Comment", thinking that it meant something like 'stop editing the comment and (presumably) post'.

Nevertheless, I'm happy to close the issue if that's the done thing at this stage? After all, I suppose the list of potential quirks that Duden markup is still yet to reveal could very well remain open for some time!

Answer 4 · 2021-08-05T21:34:07.000Z

rhein. im 13.~Jh. \F{_80 80 40}segere\F{80 80 40_}). \F{~0000FF}Zus.: @4siegreich @0»den Sieg
errungen habend; oft siegend, erfolgreich« (mhd. \F{_80 80 40}sigerīche\F{80 80 40_}).\F{0000FF~}@0

Well, apparently tag \F{~0000FF} doesn't mean blue, even though it looks pretty close to \F{_00 00 ff} that does. I wonder what it's used for then. I see that sigerīche too wasn't meant to be black. I probably need to find a better way of mapping RGB to the limited set of DSL colors.

I think I know what's wrong with the reference. In this case the link, while correct, wasn't resolved by GoldenDict because the parentheses in the heading weren't escaped and so they were treated as DSL variant headings, which they are not. The extra text is supposed to be the "display text" for a link. Duden has this feature, unlike DSL. But I could probably remove it when it's clearly unnecessary.

About building lsd2dsl. I actually had the opposite issue with minizip on Fedora. At some point the old version got replaced with minizip-ng, without changing the package name, while the old version got pushed into another package (without a corresponding mingw32 build, to make things even more annoying). That's why I switched to minizip-ng in the last commit.

And yes, you can extract Duden markup. Though you'll need a debug build (CMAKE_BUILD_TYPE=Debug) for this, as it has a few additional switches enabled. You can call lsd2dsl like this:

lsd2dsl --bof path-to-bof --idx path-to-idx --duden-utf --out some-dir-path

BOF and IDX files together form an archive. There are several of them in a typical dictionary and each contains either text of binary data. For example, my version of Herkunftswoerterbuch dictionary contains its articles in the du7.bof/du7.idx archive. The text itself is not Unicode, but some Duden encoding, that's why I used --duden-utf to convert it to Unicode. The result is a single "decoded" file containing the whole archive.

Also, I don't see any reason to close this issue. While it's true that Duden has plenty of quirks for sure and a perfect conversion might be a long way off, I still think that fixing at least the most glaring issues is worth it. So I appreciate your reporting them. Thanks.

I will try to fix the issues you found in a few days.

Answer 5 · 2021-08-09T21:16:32.000Z

I've made a few changes to reference handling, there should be less duplicate text in parentheses next to an article reference. Also, the altogether broken references should now be fixed.

The unexpected text color will go away, though unless Duden text color maps exactly to one of the standard colors, the text will remain black for now.

Answer 6 · 2021-10-17T12:35:25.000Z

I have a follow up question on the Duden markup files that your decoder produces, although first let me just thank you for pointing out its use to me as it certainly satisfied my curiosity!

So, what I'd like to know is whether there's a simple way to re-compile back from this decoded Duden markup to the BOF and IDX archive bundles that the Duden bibliothek software deals with? For ease, I was thinking of the raw Duden markup, rather than the format converted to Unicode.

I'm trying better to understand the Duden markup because I'm trying to convert it to XDXF. Obviously, due to the nature of XDXF, this couldn't possibly be an automated conversion tool that could just blindly convert between the formats, but rather a set of principles and approaches for analysing a file that is in a dictionary format that has predominantly visual formatting (such as Duden), and then use this as a basis for writing ad-hoc code to produce a format that has logical formatting (such as XDXF).

My aim is motivated by the fact that often times these dictionaries that the likes of Duden produce have a wealth of information pertaining to usage, syntax, semantics, register and so on that aren't visually emphasized or differentiated well enough to make their comprehension as quick as they could be. Of course, when you use a bilingual dictionary, you have to be prepared for a little rummaging; there is rarely a one-to-one mapping between words from different languages, and so the act of using a bilingual dictionary is as much about clarifying to oneself more precisely what you actually mean by the use of a word in the source language as it is about merely trying to find a word in the target language, but even so, being able to get ‘in and out’ of a dictionary lookup ought to be a smooth and relatively quick process. I'll stop here before this becomes inappropriately ranty, as this has long been a gripe of mine with producers of dictionary software; an extremely valuable and detailed source of language information that is so often hideously presented as word entries that are little more than a densely packed and immediately off-putting ‘wall of text’.

Anyway, I'm wanting to better understand the Duden markup so that I can make use of as much of the information that is useful and relevant to the task of converting to XDXF, and it would be useful to be able to experimentally poke at the markup to see what effect it has, as a means to better filling out the detailed picture of what the Duden markup actually is. I think I understand the most salient features of it now, and perhaps wanting the complete picture has now gone well beyond the point of vastly diminishing returns, so perhaps it's as much about wanting to understand it for its own sake. I don't know.

I know that what I'm asking about would in effect be a de-facto method for editing Duden dictionaries and that this isn't what your tool is about, but I nevertheless wonder what else would need to be done to ‘complete the circle’ and recompile back from the decoded format. Any pointers would be appreciated!

Answer 7 · 2021-10-17T21:25:37.000Z

Well, making an IDX/BOF archive is easy: BOF consists of deflated blocks and IDX stores their offsets. The headings (HIC) though are quite a bit more complicated, and I haven't done enough reverse engineering to be able to recreate them. The headings sometimes contain some information that lsd2dsl loses along the way.

Anyway, here's a python script to create IDX/BOF:

import zlib
import struct

def deflate(data):
    obj = zlib.compressobj(
        0,
        zlib.DEFLATED,
        -zlib.MAX_WBITS,
        zlib.DEF_MEM_LEVEL,
        0)
    return obj.compress(data) + obj.flush()

def inflate(data):
    obj = zlib.decompressobj(-zlib.MAX_WBITS)
    return obj.decompress(data) + obj.flush()

input_path = '/tmp/decoded.dump'
idx_path = '/tmp/test.idx'
bof_path = '/tmp/test.bof'

block_size = 8192

input = open(input_path, 'rb').read()
input_len = len(input)

bof = bytearray()
idx = bytearray()

pos = 0
while True:
    block = input[pos:pos+block_size]
    if len(block) == 0:
        break
    pos += block_size
    idx += struct.pack('<I', len(bof))
    bof += deflate(block)
    
idx += struct.pack('<I', len(bof))
idx += struct.pack('<I', len(bof))
idx += struct.pack('<I', input_len)

open(idx_path, 'wb').write(idx)
open(bof_path, 'wb').write(bof)

I used zero compression to work around Duden's use of custom compression tables. I explained them in a blog if you are interested.

Please note that articles are referenced from HIC by absolute offsets into the decoded file, so changing the length of an article will destroy all the articles that follow it.

Answer 8 · 2021-10-17T22:51:12.000Z

Perfect! That works superbly. Thank you so much!

And since I'm only wanting to explore Duden's formatting for exploratory purposes, that the absolute offsets from the HIC will no longer be correct for articles that follow any article-length-altering edits is no problem - although, would it be naive to think that the offsets for each article that follows any edit couldn't just be updated to reflect the new offsets? I mean, if I know I've added 21 bytes to an article, couldn't I just add 21 to each number greater than whatever the offset is of the edited article? I suppose the cross-references would have to be updated too.

Answer 9 · 2021-10-19T12:10:30.000Z

Glad to hear the script helped.

Sure, you could update the offsets. It's just that the HIC is a binary format and as such is awkward to edit. Good thing there's no reason to do that.