How to index CoNLL-U files?
fishfree opened this issue · 9 comments
In the doc here, and in this issue, it both seems saying that CoNLL-U files can be supported. However, when I run java -cp "blacklab-3.0.1.jar:lib" nl.inl.blacklab.tools.IndexTool create testcorpus test.conllu conll-u
, it showed:
Creating new index in testcorpus/ from ./test.conllu (using format conll-u)
16:22:25.792 [main] WARN nl.inl.blacklab.indexers.config.InputFormatReader - Name 'pos_getal-n' is not a valid XML element name; sanitized to 'pos_getal_n' in format file $BLACKLAB_JAR/formats/folia.blf.yaml
Cannot create new index in testcorpus with format conll-u: format not found
Please specify a correct format on the command line.
I noticed there is a docker-compose.conll-u.yml in this repo, but I didn't find the usage in the doc.
Sorry, the docker-compose.conll-u.yml
was an old test file that shouldn't have been committed. I've removed it.
The development branch (and the upcoming version 4.0) can index CoNLL-U files. This was added while implementing dependency relations.
Older versions of BlackLab such as 3.0.1 do not support dependency relations and don't come with builtin support for the CoNLL-U format.
@jan-niestadt Thank you, Jan. Could you tell me how to compile the blacklab-4.*.jar from the development branch? I tried:
git clone https://github.com/INL/BlackLab
cd BlackLab/build-tools
mvn install -DskipTests
cd target
java -cp "build-tools-4.0.0-SNAPSHOT.jar" nl.inl.blacklab.tools.IndexTool create testcorpus ~/test.conllu conll-u
It shows errors:
Error: Could not find or load main class nl.inl.blacklab.tools.IndexTool
Caused by: java.lang.ClassNotFoundException: nl.inl.blacklab.tools.IndexTool
Try this:
git clone https://github.com/INL/BlackLab
cd BlackLab
mvn clean package -DskipTests
cd tools/target
java -cp "./*:lib" nl.inl.blacklab.tools.IndexTool create testcorpus ~/test.conllu conll-u
(UPDATE: simplified IndexTool command line)
@jan-niestadt Following your instruction, the same error:
[INFO] BlackLab legacy DocIndexers ........................ SUCCESS [ 0.599 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:13 min
[INFO] Finished at: 2024-03-26T06:39:06+08:00
[INFO] ------------------------------------------------------------------------
(base) meme@ubuntugpu:~/BlackLab$ cd tools/target
(base) meme@ubuntugpu:~/BlackLab/tools/target$ java -cp "build-tools-4.0.0-SNAPSHOT.jar:lib" nl.inl.blacklab.tools.IndexTool create testcorpus ~/test.conllu conll-u
Error: Could not find or load main class nl.inl.blacklab.tools.IndexTool
Caused by: java.lang.ClassNotFoundException: nl.inl.blacklab.tools.IndexTool
Sorry, I didn't update the jar file name. But you can run it like this without typing the jar file name at all:
java -cp "./*:lib" nl.inl.blacklab.tools.IndexTool create testcorpus ~/test.conllu conll-u
@jan-niestadt Thank you! It works now. What's the exact jar file instead of the wildcard?
Using ls
, you should be able to see it's called blacklab-tools-4.0.0-SNAPSHOT.jar
@jan-niestadt BTW: Could you please update the image here to the 4.0 snapshot?
Done! (image version 4-alpha4)