INL/BlackLab

Issue with Index Metadata Not Updating After Document Removal

Retr0327 opened this issue · 3 comments

After removing a document from the Lucene index, using java -cp "/jars/blacklab/WEB-INF/lib/*" nl.inl.blacklab.tools.IndexTool delete INDEX_DIR FILTER_QUERY , I have noticed that the index metadata does not automatically update to reflect these changes, leading to discrepancies in the index metadata. Is there a way to update indexmetadata.yaml after document removal?

Thanks for reading my question!

I assume you mean things like the document count, token count and field values? You're correct that these don't update automatically in v3 of BlackLab. For version 4, currently in development, we've added a new index type that integrates all the external files into the Lucene index, and changed it so these values are determined dynamically when the index is opened. If you're on the dev branch and using IndexTool to create a new index, you can pass --index-type integrated to get the new index type. With this, you shouldn't encounter any problems with metadata not updating when documents are deleted. (if you do, please let us know so we can improve it)

The integrated index will eventually be the default index type. We're already using it in-house and it works well for us.

The dev branch should be reasonably stable, and I'm actually planning to publish a 4.0-beta release next week.
I hope this solution works for you!

A snapshot version of v4 (with support for the integrated index format) is available here now: https://oss.sonatype.org/content/repositories/snapshots/nl/inl/blacklab/blacklab/4.0.0-SNAPSHOT/

You can enable using snapshot repositories in your Maven settings.xml file, see https://stackoverflow.com/a/7717234

I'll still look into publishing a "real" beta version soon as well.

Closing this as it is solved by switching to the new integrated index type as explained above.