pkiraly/metadata-qa-api

Help with multiliguality

Opened this issue · 5 comments

Hi @pkiraly

this feature is super interesting! But

  • what does it do exactly? I got some idea from your docs, but not sure. It guesses the language of a field value?
  • how does it work?
  • can we get it into the schema language?

Cheers

Hi @mielvds,

it calculates some multilinguality metrics based on the language tag which is available in JSON or XML, such as

dc:subject: [ "library"@en, "bibliotheek"@nl ]

This is a multilingual field value with two languages.

The API calculates metrics on field level and and on record level.

Field level metrics:

  • Number of tagged literals
  • Number of distinct language tags
  • Number of tagged literals per language tag

Record level metrics:

  • Number of tagged literals
  • Number of distinct language tags
  • Number of tagged literals per language tag
  • Average number of languages per property for which there is at least one language-tagged literal

hmm ok, then I can't use it right now. There are no language tags.

I'm looking for language detection basically, because we have fields that are mixed and I want to figure out the distribution.

Once I did experience with language detection, and the code contains a dependency for lib in that area. I stopped it at a point because usually language detection did not worked well for very short text typical in metadata record, but we can restart playing with it.

    <dependency>
      <groupId>com.optimaize.languagedetector</groupId>
      <artifactId>language-detector</artifactId>
      <version>0.6</version>
    </dependency>

Seems this lib has not been developed since 2016 (https://github.com/optimaize/language-detector).

I've done something similar in python. I can see whether I can implement something similar here.

LanguageDetectionCalculator by implementing https://github.com/pkiraly/metadata-qa-api/blob/a104aa3457ff68ffb997615654d77f5f70de7167/src/main/java/de/gwdg/metadataqa/api/interfaces/Calculator.java? Could be an extension of the current LanguageCalculator as well, for example: new LanguageCalculator(schema, true) where true means languages are detected and not extracted from tags.

Sounds promissing. You might take a look on this discussion, which gives a comparision of some language detector libraries: optimaize/language-detector#107. I am not an expert in this, so there might be other relevant libraries.