eb4j/dsl4j

Dictionary loading hungs forever

Closed this issue · 21 comments

plotn commented

Hi!
I try to do the following:

DslDictionary dslDictionary = DslDictionary.loadDictionary(
new File("c:\github\JsoupExperiments\_tmp\Apresyan\En-Ru_Apresyan.dsl"));

The dictionary files is here: https://drive.google.com/file/d/1YEVOBanuRBX2R3wUwP0tqlDAWiQfGv3l/view?usp=sharing

The problem place I think is here:

DslDictionaryLoader.class:

private static List loadEntriesFromDslFile(Path path, boolean isDictzip, byte[] eol, Charset charset) throws IOException
.....
while(!isSpaceOrTab((InputStream)is, charset)) {
next += eolSearcher.search((InputStream)is);
}

After loading 117416 entries the code freezes into the line "next +=" and next is constantly decreasing, beeing NEGATIVE value, e.g. -2376230

Thank you for report.
I made big change on main branch.

New main HEAD now load the dictionary well, such as https://sourceforge.net/projects/goldendict/files/dictionary%20content/1.1/enruen-content-1.1.tar.bz2

Please check it.

Your data has strange lines at end of file as follows:

Условные обозначения
        [m1][com][c navy]♦[/c] (ромб) — вводит фразеологическую зону словарной статьи[/m]
        [m1][c navy]||[/c] (параллельные линии) — в статьях числительных вводит модели составных числительных (типа [i]forty-one[/i])[/m]
        [m1][c navy]=[/c] (знак равенства) — отсылает к другим словарным статьям[/m]
        [m1][c navy]≈[/c] (знак приблизительного равенства) — используется при отсутствии полного лексического или стилистического соответствия между входом и переводо
м[/m]
        [m1][c navy]|[/c] (вертикальная черта) — отделяет повторяющуюся часть транскрипции, заменяемую дефисом.[/m]
        [m1][c navy]([/c]...[c navy])[/c] (круглые скобки) — могут обозначать факультативную часть входа или перевода[/m]
        [m1][c navy]/[/c]...[c navy]/[/c] (косые скобки) — синонимы: синонимическая замена предшествующего слова или слов[/m]
        [m1][c navy]\[[/c]...[c navy]\][/c] (квадратные скобки) — сочетаемостные варианты: одновременная замена предшествующего слова или слов во входе и в переводе[/m
]
        [m1]...[c navy](-)[/c]... (дефис в скобках) — показывает, что возможно как написание через дефис, так и раздельное (не слитное!) написание: [i]xxx-yyy[/i] и [i
]xxx yyy[/i][/com][/m]
        [m1] [/m]
        [m1][*][p]см. тж[/p] <<Содержание>>[/*][/m]

{{ The End }}

{{ Техническая часть }}

About {(}Dictionary{)}
About{ (Dictionary)}
_About {(}Dictionary{)}
_About{ (Dictionary)}
_О словаре
О словаре
**
        Version: 10.4

{{ Техническая часть }}
plotn commented

Hi! thank you!
I would like to talk about another one thing. I developing an app called KnownReader. It is a ebook reading app (https://github.com/plotn/coolreader) and I want to integrate "offline dictionary" feature into it. I did it one way with StarDicts and now I want to integrate DSLs. But! My integration is not "native" - I load dic data into sqlite db, then I find into it - is is very convient way for me, and I think it is effective in speed (db has indexes) and memory consumption.

But your lib seems to be very mature and I would like to use it (I did not try it under my app, but i hope it will work - i use quite old api and support old java support because eink devices often have android 4.4), but there is some thoughts about it.

  1. I'd like not to load all dictionary entries into memory like now, I would preffer use some event like 'onNewEnrty' - then I'd save entries during "initial scan" into database; I unerstand, that entry do not contain article data and this is good.
  2. Then, when using, I will lookup for entries myself from database (it will be fast) and I need mechanism to extract article data from dic file for specified entry (it could be more slow, but let it be).

Or/And the another way -

  1. Same, but event OnNewEntryExt, containing article data.

Is it possible now or in the future?

Screenshot_2022-02-18-13-00-03-748_org knownreader premium

dsl4j don't ALL load data into memory. It load location index into memory.
You can check a method to store the index into file using ProtocolBuffer at:

https://github.com/eb4j/dsl4j/blob/main/src/main/java/io/github/eb4j/dsl/DslDictionaryLoader.java#L146

You may be able to modify it to store index into database.

plotn commented

Thank you! Is there an example of using previously saved index?

Your data has strange lines at end of file as follows:


{{ The End }}

{{ Техническая часть }}


{{ Техническая часть }}

These are comment definition in DSL specification.
https://documentation.help/ABBYY-Lingvo8/Comments.htm

This seems comments

plotn commented

Okay, i will check everything. Will return to you when finished integration, or, maybe, with some questions, if appear. Thank you

@plotn I've updated to support comment in current main HEAD. Could you try it?

plotn commented

@miurahr , how should I try it? I have following in my pom:

      <dependency>
          <groupId>io.github.eb4j</groupId>
          <artifactId>dsl4j</artifactId>
          <version>0.4.5</version>
      </dependency>

Were your fixes (and commits in main) apply automatically and my app will reload them or it will use "cached" version?
Or you need to bump version to 0.4.6 in the central?

I wanted to try to build your lib and link it as a static file in my android app, but there were some errors - I asked you in discussions. Maybe I should link all of your dependencies as static libs too?

As for now, test app is hunging, but I suppose it use an old version....

You can clone repository, build snapshot version, and publish to your local maven repository by ./gradlew publishToMavenLocal.
A version string can be get by ./gradlew printVersion

plotn commented

Thanks! A new knowledge for me. Now it is very good. First time dictionary load time - about 2 min 20 sec.
Second time with index file - almost instant.

What about partial lookup by "starting" algoritm (optionally) ? I mean that when I search "cat" I'd like to get "cataclism", "catacomb" too (I've found that it is lookupPredictive, yes?).

Some misbehaviour of plain visitor (you can see [m0] [lang id=...] tags):

cat -> [m0]Ⅰ [kæt] n

    1. кот, кошка
  1. зоол. кошка домашняя ([lang id=1142]Felis domesticus)
  2. зоол. животное семейства кошачьих
    [lang id=1033]wild cat — дикая кошка (Felis sylvestris)
  1. разг. сварливая {{}}или{{}} недоброжелательная женщина; сплетница, язва
    [lang id=1033]old cat — старая ведьма
    [lang id=1033]don't be a cat! — не злословь!
plotn commented

tried to link library to andoid project and got errors - it can be linked via
implementation 'io.github.eb4j:dsl4j:0.4.5'
bun not via local maven
implementation 'io.github.eb4j:dsl4j:0.4.5-31-78c49a1bb9-SNAPSHOT'
Specified mavenLocal() in gradle - it works.

Could you update to 0.4.6 in central? I'd include it in my book reading app...

v0.5.0 out.

plotn commented

Tested 0.5.0, those dic is ok, but following dics are:

https://drive.google.com/file/d/14Crq8ywyBdfC1YnDjsv7gZOu70ZH_4cd/view?usp=sharing
https://drive.google.com/file/d/1TYDfxr_j0b3A_h99kmqbul1iFKxiGhfq/view?usp=sharing

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 761 out of bounds for length 385

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 942 out of bounds for length 477

plotn commented

stacktrace is:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 942 out of bounds for length 477
at org.dict.zip.DictZipHeader.getPosition(DictZipHeader.java:398)
at org.dict.zip.DictZipInputStream.seek(DictZipInputStream.java:271)
at org.dict.zip.DictZipInputStream.reset(DictZipInputStream.java:154)
at io.github.eb4j.dsl.impl.EntriesLoaderImpl.skipSpaceTabs(EntriesLoaderImpl.java:270)
at io.github.eb4j.dsl.impl.EntriesLoaderImpl.load(EntriesLoaderImpl.java:106)
at io.github.eb4j.dsl.DslDictionaryLoader.load(DslDictionaryLoader.java:98)
at io.github.eb4j.dsl.DslDictionary.loadDictionary(DslDictionary.java:148)

plotn commented

The command was:
DslDictionary dslDictionary = DslDictionary.loadDictionary(
// new File("c:\github\JsoupExperiments\_tmp\test4.dsl"));
// Paths.get("c:\github\JsoupExperiments\_tmp\Apresyan\En-Ru_Apresyan.dsl"),
// Paths.get("c:\github\JsoupExperiments\_tmp\Apresyan\En-Ru_Apresyan.dsl.idx")
// Paths.get("c:\github\JsoupExperiments\_tmp\mueller\Mueller (En-Ru)_new.dsl.dz"),
// Paths.get("c:\github\JsoupExperiments\_tmp\mueller\Mueller (En-Ru)_new.dsl.idx")
Paths.get("c:\github\JsoupExperiments\_tmp\smirnitsky\Ru-En-Smirnitsky.dsl.dz"),
Paths.get("c:\github\JsoupExperiments\_tmp\smirnitsky\Ru-En-Smirnitsky.dsl.idx")
);

plotn commented

Trying to unpack dz, get the following:
Exception in thread "main" java.lang.NullPointerException
at io.github.eb4j.dsl.index.DslIndex$Builder.setDictionaryName(DslIndex.java:2123)
at io.github.eb4j.dsl.DslDictionaryLoader.buildIndexFile(DslDictionaryLoader.java:175)
at io.github.eb4j.dsl.DslDictionaryLoader.load(DslDictionaryLoader.java:102)
at io.github.eb4j.dsl.DslDictionary.loadDictionary(DslDictionary.java:148)

pls raise another issue.