INL/BlackLab

How to index CoNLL-U files?

fishfree opened this issue · 9 comments

In the doc here, and in this issue, it both seems saying that CoNLL-U files can be supported. However, when I run java -cp "blacklab-3.0.1.jar:lib" nl.inl.blacklab.tools.IndexTool create testcorpus test.conllu conll-u, it showed:

Creating new index in testcorpus/ from ./test.conllu (using format conll-u)
16:22:25.792 [main] WARN  nl.inl.blacklab.indexers.config.InputFormatReader - Name 'pos_getal-n' is not a valid XML element name; sanitized to 'pos_getal_n' in format file $BLACKLAB_JAR/formats/folia.blf.yaml
Cannot create new index in testcorpus with format conll-u: format not found
Please specify a correct format on the command line.

I noticed there is a docker-compose.conll-u.yml in this repo, but I didn't find the usage in the doc.

Sorry, the docker-compose.conll-u.yml was an old test file that shouldn't have been committed. I've removed it.

The development branch (and the upcoming version 4.0) can index CoNLL-U files. This was added while implementing dependency relations.

Older versions of BlackLab such as 3.0.1 do not support dependency relations and don't come with builtin support for the CoNLL-U format.

@jan-niestadt Thank you, Jan. Could you tell me how to compile the blacklab-4.*.jar from the development branch? I tried:

git clone https://github.com/INL/BlackLab
cd BlackLab/build-tools
mvn install  -DskipTests
cd target
java -cp "build-tools-4.0.0-SNAPSHOT.jar" nl.inl.blacklab.tools.IndexTool create testcorpus ~/test.conllu conll-u

It shows errors:

Error: Could not find or load main class nl.inl.blacklab.tools.IndexTool
Caused by: java.lang.ClassNotFoundException: nl.inl.blacklab.tools.IndexTool

Try this:

git clone https://github.com/INL/BlackLab
cd BlackLab
mvn clean package  -DskipTests
cd tools/target
java -cp "./*:lib" nl.inl.blacklab.tools.IndexTool create testcorpus ~/test.conllu conll-u

(UPDATE: simplified IndexTool command line)

@jan-niestadt Following your instruction, the same error:

[INFO] BlackLab legacy DocIndexers ........................ SUCCESS [  0.599 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  03:13 min
[INFO] Finished at: 2024-03-26T06:39:06+08:00
[INFO] ------------------------------------------------------------------------
(base) meme@ubuntugpu:~/BlackLab$ cd tools/target
(base) meme@ubuntugpu:~/BlackLab/tools/target$ java -cp "build-tools-4.0.0-SNAPSHOT.jar:lib" nl.inl.blacklab.tools.IndexTool create testcorpus ~/test.conllu conll-u
Error: Could not find or load main class nl.inl.blacklab.tools.IndexTool
Caused by: java.lang.ClassNotFoundException: nl.inl.blacklab.tools.IndexTool

Sorry, I didn't update the jar file name. But you can run it like this without typing the jar file name at all:

java -cp "./*:lib" nl.inl.blacklab.tools.IndexTool create testcorpus ~/test.conllu conll-u

@jan-niestadt Thank you! It works now. What's the exact jar file instead of the wildcard?

Using ls, you should be able to see it's called blacklab-tools-4.0.0-SNAPSHOT.jar

@jan-niestadt BTW: Could you please update the image here to the 4.0 snapshot?

Done! (image version 4-alpha4)