Attempting to index the xml version of the Brown Corpus
Closed this issue · 4 comments
I've managed to get through some of the "Getting Started" section, although the installation and setup of Apache Tomcat was bafflingly complex and some of the commands in the Blacklab tutorial were not correct. Now I am stuck here: The following command seems to index the Brown file, which is in the ../tmp directory, and successfully creates a test-index directory.
$ java -cp "blacklab-3.0.1.jar:lib" nl.inl.blacklab.tools.IndexTool create ./data/blacklab-corpora/test-index ../tmp/ tei-p5Creating new index in ./data/blacklab-corpora/test-index/ from ../tmp/ (using format tei-p5)
08:14:01.529 [main] WARN nl.inl.blacklab.indexers.config.InputFormatReader - Name 'pos_getal-n' is not a valid XML element name; sanitized to 'pos_getal_n' in format file $BLACKLAB_JAR/formats/folia.blf.yaml
254 docs (24 MB, 515.0k tokens); avg. 51.5k tok/s (2.4 MB/s); currently 51.5k tok/s (2.4 MB/s); 10 sec elapsed
500 docs (46 MB, 1.0M tokens); avg. 57.4k tok/s (2.7 MB/s); currently 55.6k tok/s (2.6 MB/s); 17 sec elapsed
Done. Elapsed time: 17 seconds
However, the blacklab server cannot see it.
I have put the blacklab-server.yaml
in the webapps folder in Tomcat, in the same folder as blacklab-3.0.1.jar, and in an etc/blacklab directory in the blacklab-core-3.0.1 directory.
In the blacklab-server.yaml
file, I have tried the relative path to /data/blacklab-corpora from the directory where blacklab-3.0.1.jar is located as well as the absolute path from the root of the file system. Neither seem to work.
What am I doing wrong?
Sorry to hear you're having a hard time getting up and running. I'd like to use your feedback to improve others' experience in the future.
However, the blacklab server cannot see it.
The suggested location for blacklab-server.yaml
is /etc/blacklab/
. But I guess this location wasn't accessible to you? In that case, see here for other options; you probably want to pass environment variable $BLACKLAB_CONFIG_DIR
to Tomcat with the full directory where you've placed blacklab-server.yaml
. I guess if /etc/blacklab/
isn't workable for some users, I should include other options in the tutorial.
In blacklab-server.yaml
, there should be a key indexLocations
pointing to the directory containing your corpora. So that's the parent directory of where your Brown corpus was indexed. If the Brown corpus index is in /data/blacklab/brown
, you config file should contain this:
indexLocations:
- /data/blacklab
I agree that BlackLab Server's error message could be clearer. I will see if I can make sure it tells you whether (a) it could not find blacklab-server.yaml
(plus a link to the docs) or (b) it did find blacklab-server.yaml
but couldn't find any valid corpus directories.
the installation and setup of Apache Tomcat was bafflingly complex
Other than the above, what sort of problems did you run into with Tomcat? Are you on Windows or Linux? Do you have any recommendations how to make the instructions clearer?
some of the commands in the Blacklab tutorial were not correct
Sorry about that. It might be that they were correct at one point, but something changed and I forgot to update them. If you let me know the ones that didn't work for you (other than the above), I will correct them.
By the way, instead of the tei-p5
format, it's better to index the Brown corpus with the tei-p5-legacy
format. This is indicated on the page, but I should make it clearer. If you use tei-p5
, it will work, but part of speech annotations aren't indexed.
Please let me know if this solves your problems. Suggestions to improve the documentation are appreciated.
Thanks for your quick response. I have it working for now. The issue seems to have been that initially, I wanted all the files to be in a specific directory, so I didn't create a new data
directory in the root and a new blacklab
directory in /etc
. After not having success with the .yaml in my desired directory, I copied it to the webapps directory in tomcat, with no change in the server response. Finally, I created the new directories in the root, ran the indexing command as above, sudo mv'd the test-index
directory thus created to /data/blacklab-corpora
and used the following as the .yaml file:
---
configVersion: 2
# Where BlackLab can find corpora
indexLocations:
- /data/blacklab-corpora/test-index
I think the key issue is that before, my .yaml was pointing to .../data/blacklab-corpora
, not .../data/blacklab-corpora/test-index
. Finally, it took me a while to realize that I needed to delete the other .yaml files I had used unsuccessfully and put in .../tomcat/webapps
.
It might be helpful if the tutorial contains a step-by-step walkthrough for the example Brown Corpus, including the exact commands needed to index it and the exact .yaml file needed for Blacklab to be able to find it.
For Tomcat, the documentation assumes familiarity with shell scripting, with daemons, services, systemctl, .conf files, etc., so it is a bit heavy going and takes a while to get right.
Thanks again for your help, and I'm sure I'll post more questions as I continue testing and continue to make dumb mistakes 😄
Good to hear it works!
indexLocations
should usually be set to /data/blacklab-corpora
, at least if you might want to have multiple (versions of) corpora in the same BlackLab Server instance. Does that not work?
(/data/blacklab-corpora
is just a suggestion of course. You can place your indexes wherever you want, as long as you refer to the directory from the config file and the directory can be read by Tomcat)
I've made some changes to the Getting Started page, adding specific commands for setting up BlackLab Server. Hopefully that makes things a bit clearer.
I will look at improving the error messages soon, see #444.
It appears that /data/blacklab-corpora
also works, but the user (maybe?) needs to stop the service and restart it for this change to go into effect. I did this in the "Tomcat Web Application Manager".