rdoeffinger/Dictionary

Add more detailed steps to update the dictionaries

Opened this issue · 5 comments

Add software requirements and steps to update the dictionaries.
This should help potential extra maintainers

I managed to get WiktionarySplitter.sh to start processing with the following steps.
(I'm dumping this here for now in case it helps someone)

Using an Ubuntu 18.04 VM
# clone Dictionary
# clone DictionaryPC
$ apt install openjdk-11-jdk

$ ./compile.sh 
ICU4J needs to be installed
--> apt search ICU4J
--> apt install libicu4j-49-java

$ ./compile.sh 
Junit needs to be installed
---> sudo apt install junit

$ ./compile.sh 
Xerces needs to be installed
--> sudo apt install libxerces2-java


$ ./compile.sh 
commons-lang needs to be installed
---> apt install libcommons-lang3-java

$ ./compile.sh 
commons-compress needs to be installed
--> libcommons-compress-java

$ ./compile.sh 
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: ../Dictionary/Util/src/com/hughes/util/CachingList.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

---> change compile.sh to have:
javac -g -Xlint:deprecation -Xlint:unchecked ../Dictionary/Util/src/com/hughes/util/*.java ../Dictionary/Util/src/com/hughes/util/raf/*.java ../Dictionary/src/com/hughes/android/dictionary/DictionaryInfo.java ../Dictionary/src/com/hughes/android/dictionary/engine/*.java ../Dictionary/src/com/hughes/android/dictionary/C.java src/com/hughes/util/*.java src/com/hughes/android/dictionary/*.java src/com/hughes/android/dictionary/*/*.java src/com/hughes/android/dictionary/*/*/*.java -classpath "$ICU4J:$JUNIT:$XERCES:$COMMONS:$COMMONS_COMPRESS"

$ ./compile.sh 
../Dictionary/Util/src/com/hughes/util/CachingList.java:32: warning: [unchecked] unchecked cast
        chunked = useChunked ? (ChunkedList<T>)list : null;
                                               ^
  required: ChunkedList<T>
  found:    List<T>
  where T is a type-variable:
    T extends Object declared in class CachingList
src/com/hughes/util/MapUtil.java:39: warning: [deprecation] newInstance() in Class has been deprecated
                map.put(key, valueClass.newInstance());
                                       ^
  where T is a type-variable:
    T extends Object declared in class Class
src/com/hughes/android/dictionary/parser/wiktionary/WholeSectionToHtmlParser.java:399: warning: [deprecation] StringEscapeUtils in org.apache.commons.lang3 has been deprecated
        final String htmlEscaped = StringEscapeUtils.escapeHtml3(plainText);
                                   ^
3 warnings


# Note: there are 2048m of RAM in the VM used to test this
$ ./WiktionarySplitter.sh 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.base/java.io.PipedInputStream.initPipe(PipedInputStream.java:161)
	at java.base/java.io.PipedInputStream.<init>(PipedInputStream.java:125)
	at com.hughes.android.dictionary.engine.WriteBuffer.<init>(WriteBuffer.java:28)
	at com.hughes.android.dictionary.engine.WiktionarySplitter.go(WiktionarySplitter.java:89)
	at com.hughes.android.dictionary.engine.WiktionarySplitter.main(WiktionarySplitter.java:60)
---> add '-Xmx2048m' to java command in ./WiktionarySplitter.sh
$ ./WiktionarySplitter.sh 
./WiktionarySplitter.sh: line 15:  7924 Killed                  "$JAVA" -Xverify:none -Xmx2048m -classpath src:../Util/src/:../Dictionary/src/:"$ICU4J":"$XERCES":"$COMMONS_COMPRESS" com.hughes.android.dictionary.engine.WiktionarySplitter "$@"
---> increase amount of ram in the VM to 4096m
---> add '-Xmx3072m' to java command in ./WiktionarySplitter.sh
$ ./WiktionarySplitter.sh

--> hitting issue #81

I don't think changing the compile.sh step makes sense, it just makes those warnings even more annoying, since at this point I don't intend or can't fix them.
As to the memory change: I don't think you can run this process with less than 8 GB of RAM anyway, which might be the reason I never needed to change the memory allocation for Java (Java only being able to use a fixed amount of RAM still is a really bad joke anyway).
I'll change it to use the same value as run.sh (and I ought to look into if it couldn't/shoudn't re-use run.sh anyway).

It would be helpful to indicate the requirements for development, maybe in the README.md, or in a new CONTRIBUTING.md.

Have things like:

Hardware requirements:

  • 8 GB RAM minimum (maybe 5 is enough?)
  • XX GB disk minimum

Software requirements:

  • OS: Linux:

    • tested to work: Ubuntu XXX(16.04, 18.04...)
  • Packages:

    • For Ubuntu 18.04
apt install 
  openjdk-11-jdk \
  libicu4j-49-java \
  junit \
  libxerces2-java \
  libcommons-lang3-java
  libcommons-compress-java

Maybe we could also make it easier with docker, have a Dockerfile to specify the build environment and use it to build:

# Add Dockerfile
$ mkdir docker
$ echo 'FROM ubuntu:18.04

RUN apt-get update \
    && apt-get install -y \
         openjdk-11-jdk \
         libicu4j-49-java \
         junit \
         libxerces2-java \
         libcommons-lang3-java \
         libcommons-compress-java' \
> docker/Dockerfile

# create the development environment
$ docker build -t dictionary_build_env --file=docker/Dockerfile docker

# build inside the development environment
$ docker run -it --rm \
     --volume $(pwd):/workspace \
     dictionary_build_env \
       bash -c \
         'cd /workspace/DictionaryPC/ && ./compile.sh'

Note: I had to add -encoding UTF-8 -Xlint:deprecation to compile.sh for it to work.

Sorry that I don't really have time to help much on this, except helping with specific issues.
Does docket by default pull an image that is not configured with a UTF-8 locale?
I can reproduce with "LANG=en_US ./compile.sh" but "LANG=en_US.UTF-8 ./compile.sh"
I guess I can add it, but I don't really intend to support Linux setups stuck in the 1990s...

Here's the Java code compiled to a native Linux binary.
The latest git code is changed to use that binary if it exists.
I have note tested compatibility with older Linux distributions etc. pp. or tested the scripts all that well, but if anyone wants to help, I think that's likely the best approach available for better usability.
I can't generate Windows binary so far, though in theory it should be possible. But it should be
DictionaryPC.zip
possible run under WSL on Linux.