stanfordnlp/CoreNLP

Add Automatic-Module-Name to MANIFEST.MF

NightOwl888 opened this issue · 22 comments

The Stanford CoreNLP JAR files from Maven that I have examined thus far lack a JDK9 Automatic-Module-Name entry in their MANIFEST.MF files and are thus undiscoverable as unique modules by tooling that expects module names complying with this specification. One such tool is ikvm-maven, which follows the JDK9 spec to provide unique names for packages on Maven.

As mentioned https://dev.java/learn/modules/automatic-module/ and in many other places, it is ideal that an Automatic-Module-Name entry be included in JAR files that are published publicly, so that tooling that requires it can operate properly.

In this particular case, for example, tooling is unable to locate the Automatic-Module-Name entry and thus falls back to the "inference" specification described at https://docs.oracle.com/javase/9/docs/api/java/lang/module/ModuleFinder.html. For the JARs published to Maven under the 'model' classier, these files names (stanford-opennlp-4.5.5-models.jar, etc.) would be inferred to the module name of "stanford.corenlp", which would overlap with the module name of the core library (stanford-corenlp-4.5.5.jar), thus causing a duplicate name. This could be resolved by including an explicit (and unique) entry in MANIFEST.MF of each of your Maven packages.

Without being able to resolve unique names, IKVM is unable to load resources from the models packages because the name collision means that no binary output that includes the packaged resources will be generated during compilation. In other words, without a patch for these packages to add Automatic-Module-Name our only choice is to provide all of the paths of the resource files through properties.

Reference: ikvmnet/ikvm-maven#51

It's probably fine to do this, but would you mind giving us a concrete explanation of what you need?

For IKVM, the only requirements are that the Automatic-Module-Name is present in each .jar file and that each .jar file has a globally unique Automatic-Module-Name.

Classifier is a Maven-specific field, so it isn't considered when generating a name, as not every .jar will be sourced from Maven. As a result, the IKVM equivalent of the below configuration will generate 2 .NET assemblies with the same name. It ends up excluding the assembly with the models in it, so we have to download them separately and specify them using properties.

<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>4.4.0</version>
</dependency>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>4.4.0</version>
    <classifier>models</classifier>
</dependency>

But, there appears to be more to Automatic-Module-Name than that: http://branchandbound.net/blog/java/2017/12/automatic-module-name/. There are many things to consider when choosing a name and sanity checks that need to be done, so that is probably something best left up to you.

IKVM doesn't require a module-info.java file, but that is something you may want to consider doing, as well.

Okay, this all seems pretty straightforward in terms of naming it, probably edu.stanford.nlp.corenlp

The question I have is then, do the model jar files also need unique names?

We'll look into it. BTW, we're CoreNLP not OpenNLP. :)

Okay, this all seems pretty straightforward in terms of naming it, probably edu.stanford.nlp.corenlp

The question I have is then, do the model jar files also need unique names?

Yes, that is the entire issue. The model package names are all exactly the same (and the same as the main package). So, each .jar should get a unique Automatic-Module-Name to distinguish them.

We'll look into it. BTW, we're CoreNLP not OpenNLP. :)

Ugh. Sorry, I mostly copied the response I was given when bringing this up to the IKVM project. I didn't realize they were referring to the wrong library :). But the solution to the issue is still the same.

Yes, that is the entire issue. The model package names are all exactly the same (and the same as the main package). So, each .jar should get a unique Automatic-Module-Name to distinguish them.

Got it. So, something like

edu.stanford.nlp.corenlp
edu.stanford.nlp.corenlp.english_models
edu.stanford.nlp.corenlp.english_extra_models
edu.stanford.nlp.corenlp.spanish_extra_models
etc etc

One of the suggestions on the page you linked was that the jar files not have anything in the root namespace. We currently have StanfordCoreNLP.properties at the top level, but this is not a class file. Do you know if this properties file will violate the root namespace constraint?

I wasn't sure, so I asked ChatGPT and Bard and both seem to agree that .properties files are not subject to the same restrictions as .class files. I wasn't able to get a URL to confirm this, but Bard cited part of the specification.

I apologize, I'm unable to provide live links within the ChatGPT environment. However, I can provide you with the full and accurate text reference for Section 4.7.33 "Resource Files" from the Java Platform Module System (JPMS) Specification:

Section 4.7.33 "Resource Files"

Definition:

A resource file is a file that is not a class file or a source file.

Location:

Resource files may be located in the top-level directory of a module or in any subdirectory of a package directory.

Visibility:

Resource files in the top-level directory of a module are encapsulated by the module. They are not visible to other modules unless the module exports the package containing the resource file.

Accessing resource files:

Resource files can be accessed using the getResource() and getResourceAsStream() methods of the ClassLoader and Class classes.

Oddly, Google searches for Section 4.7.33 "Resource Files" or even Java Platform Module System (JPMS) Specification return no official results.

https://www.youtube.com/watch?v=_n5E7feJHw0

Well at any rate, I put model names in each of the jar files except the long neglected caseless models file, and I'll put together a new release ASAP. Does it need to be on Maven Central for this to be of benefit to you?

https://www.youtube.com/watch?v=_n5E7feJHw0

Well at any rate, I put model names in each of the jar files except the long neglected caseless models file, and I'll put together a new release ASAP. Does it need to be on Maven Central for this to be of benefit to you?

Yes, we will need it on Maven Central for MavenReference to pick it up.

According to SonaType, there is now a release which has the module identifiers you want, 4.5.6

It's not uploaded to Maven Central yet, though. Perhaps that process takes a while

It now shows up on Maven. Please let us know if it solves your problem or if there's something else you needed

According to SonaType, there is now a release which has the module identifiers you want, 4.5.6

It's not uploaded to Maven Central yet, though. Perhaps that process takes a while

Thanks for the quick turnaround.

The issue with generating identical binaries is now resolved (ikvmnet/ikvm-maven#51 (comment)). Unfortunately, that alone wasn't enough to fix the issue because the IKVM ClassLoader doesn't seem to be able to find resources in separate assemblies. But, that is entirely an IKVM problem so we can consider this closed.

If there's something else we can do to help, there will always be another CoreNLP release in the future.

Well, probably not always. But I don't foresee it ending in 2024, at least.

@AngledLuffa

This issue is finally solved and CoreNLP 4.5.6 is now available to .NET community with ease using the latest ikvm.maven

I meant 4.5.6

@AngledLuffa

Are you aware of rules to infer e.g. indirect object based on POS and dependency parsing?

I suggest opening a new thread for asking about issues other than the manifest files