Add more detailed steps to update the dictionaries
Opened this issue · 5 comments
Add software requirements and steps to update the dictionaries.
This should help potential extra maintainers
I managed to get WiktionarySplitter.sh
to start processing with the following steps.
(I'm dumping this here for now in case it helps someone)
Using an Ubuntu 18.04 VM
# clone Dictionary
# clone DictionaryPC
$ apt install openjdk-11-jdk
$ ./compile.sh
ICU4J needs to be installed
--> apt search ICU4J
--> apt install libicu4j-49-java
$ ./compile.sh
Junit needs to be installed
---> sudo apt install junit
$ ./compile.sh
Xerces needs to be installed
--> sudo apt install libxerces2-java
$ ./compile.sh
commons-lang needs to be installed
---> apt install libcommons-lang3-java
$ ./compile.sh
commons-compress needs to be installed
--> libcommons-compress-java
$ ./compile.sh
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: ../Dictionary/Util/src/com/hughes/util/CachingList.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
---> change compile.sh to have:
javac -g -Xlint:deprecation -Xlint:unchecked ../Dictionary/Util/src/com/hughes/util/*.java ../Dictionary/Util/src/com/hughes/util/raf/*.java ../Dictionary/src/com/hughes/android/dictionary/DictionaryInfo.java ../Dictionary/src/com/hughes/android/dictionary/engine/*.java ../Dictionary/src/com/hughes/android/dictionary/C.java src/com/hughes/util/*.java src/com/hughes/android/dictionary/*.java src/com/hughes/android/dictionary/*/*.java src/com/hughes/android/dictionary/*/*/*.java -classpath "$ICU4J:$JUNIT:$XERCES:$COMMONS:$COMMONS_COMPRESS"
$ ./compile.sh
../Dictionary/Util/src/com/hughes/util/CachingList.java:32: warning: [unchecked] unchecked cast
chunked = useChunked ? (ChunkedList<T>)list : null;
^
required: ChunkedList<T>
found: List<T>
where T is a type-variable:
T extends Object declared in class CachingList
src/com/hughes/util/MapUtil.java:39: warning: [deprecation] newInstance() in Class has been deprecated
map.put(key, valueClass.newInstance());
^
where T is a type-variable:
T extends Object declared in class Class
src/com/hughes/android/dictionary/parser/wiktionary/WholeSectionToHtmlParser.java:399: warning: [deprecation] StringEscapeUtils in org.apache.commons.lang3 has been deprecated
final String htmlEscaped = StringEscapeUtils.escapeHtml3(plainText);
^
3 warnings
# Note: there are 2048m of RAM in the VM used to test this
$ ./WiktionarySplitter.sh
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.io.PipedInputStream.initPipe(PipedInputStream.java:161)
at java.base/java.io.PipedInputStream.<init>(PipedInputStream.java:125)
at com.hughes.android.dictionary.engine.WriteBuffer.<init>(WriteBuffer.java:28)
at com.hughes.android.dictionary.engine.WiktionarySplitter.go(WiktionarySplitter.java:89)
at com.hughes.android.dictionary.engine.WiktionarySplitter.main(WiktionarySplitter.java:60)
---> add '-Xmx2048m' to java command in ./WiktionarySplitter.sh
$ ./WiktionarySplitter.sh
./WiktionarySplitter.sh: line 15: 7924 Killed "$JAVA" -Xverify:none -Xmx2048m -classpath src:../Util/src/:../Dictionary/src/:"$ICU4J":"$XERCES":"$COMMONS_COMPRESS" com.hughes.android.dictionary.engine.WiktionarySplitter "$@"
---> increase amount of ram in the VM to 4096m
---> add '-Xmx3072m' to java command in ./WiktionarySplitter.sh
$ ./WiktionarySplitter.sh
--> hitting issue #81
I don't think changing the compile.sh step makes sense, it just makes those warnings even more annoying, since at this point I don't intend or can't fix them.
As to the memory change: I don't think you can run this process with less than 8 GB of RAM anyway, which might be the reason I never needed to change the memory allocation for Java (Java only being able to use a fixed amount of RAM still is a really bad joke anyway).
I'll change it to use the same value as run.sh (and I ought to look into if it couldn't/shoudn't re-use run.sh anyway).
It would be helpful to indicate the requirements for development, maybe in the README.md
, or in a new CONTRIBUTING.md
.
Have things like:
Hardware requirements:
- 8 GB RAM minimum (maybe 5 is enough?)
- XX GB disk minimum
Software requirements:
-
OS: Linux:
- tested to work: Ubuntu XXX(16.04, 18.04...)
-
Packages:
- For Ubuntu 18.04
apt install
openjdk-11-jdk \
libicu4j-49-java \
junit \
libxerces2-java \
libcommons-lang3-java
libcommons-compress-java
Maybe we could also make it easier with docker, have a Dockerfile to specify the build environment and use it to build:
# Add Dockerfile
$ mkdir docker
$ echo 'FROM ubuntu:18.04
RUN apt-get update \
&& apt-get install -y \
openjdk-11-jdk \
libicu4j-49-java \
junit \
libxerces2-java \
libcommons-lang3-java \
libcommons-compress-java' \
> docker/Dockerfile
# create the development environment
$ docker build -t dictionary_build_env --file=docker/Dockerfile docker
# build inside the development environment
$ docker run -it --rm \
--volume $(pwd):/workspace \
dictionary_build_env \
bash -c \
'cd /workspace/DictionaryPC/ && ./compile.sh'
Note: I had to add -encoding UTF-8 -Xlint:deprecation
to compile.sh
for it to work.
Sorry that I don't really have time to help much on this, except helping with specific issues.
Does docket by default pull an image that is not configured with a UTF-8 locale?
I can reproduce with "LANG=en_US ./compile.sh" but "LANG=en_US.UTF-8 ./compile.sh"
I guess I can add it, but I don't really intend to support Linux setups stuck in the 1990s...
Here's the Java code compiled to a native Linux binary.
The latest git code is changed to use that binary if it exists.
I have note tested compatibility with older Linux distributions etc. pp. or tested the scripts all that well, but if anyone wants to help, I think that's likely the best approach available for better usability.
I can't generate Windows binary so far, though in theory it should be possible. But it should be
DictionaryPC.zip
possible run under WSL on Linux.