Help with multiliguality

Question

Help with multiliguality

Opened this issue 4 years ago · 5 comments

mielvds commented 4 years ago

Hi @pkiraly

this feature is super interesting! But

what does it do exactly? I got some idea from your docs, but not sure. It guesses the language of a field value?
how does it work?
can we get it into the schema language?

Cheers

Answer 1 · 2020-11-18T15:39:49.000Z

Hi @mielvds,

it calculates some multilinguality metrics based on the language tag which is available in JSON or XML, such as

dc:subject: [ "library"@en, "bibliotheek"@nl ]

This is a multilingual field value with two languages.

The API calculates metrics on field level and and on record level.

Field level metrics:

Number of tagged literals
Number of distinct language tags
Number of tagged literals per language tag

Record level metrics:

Number of tagged literals
Number of distinct language tags
Number of tagged literals per language tag
Average number of languages per property for which there is at least one language-tagged literal

Answer 2 · 2020-11-18T16:41:20.000Z

hmm ok, then I can't use it right now. There are no language tags.

I'm looking for language detection basically, because we have fields that are mixed and I want to figure out the distribution.

Answer 3 · 2020-11-18T16:59:23.000Z

Once I did experience with language detection, and the code contains a dependency for lib in that area. I stopped it at a point because usually language detection did not worked well for very short text typical in metadata record, but we can restart playing with it.

    <dependency>
      <groupId>com.optimaize.languagedetector</groupId>
      <artifactId>language-detector</artifactId>
      <version>0.6</version>
    </dependency>

Seems this lib has not been developed since 2016 (https://github.com/optimaize/language-detector).

Answer 4 · 2020-11-25T08:18:40.000Z

I've done something similar in python. I can see whether I can implement something similar here.

LanguageDetectionCalculator by implementing https://github.com/pkiraly/metadata-qa-api/blob/a104aa3457ff68ffb997615654d77f5f70de7167/src/main/java/de/gwdg/metadataqa/api/interfaces/Calculator.java? Could be an extension of the current LanguageCalculator as well, for example: new LanguageCalculator(schema, true) where true means languages are detected and not extracted from tags.

Answer 5 · 2020-11-25T20:34:07.000Z

Sounds promissing. You might take a look on this discussion, which gives a comparision of some language detector libraries: optimaize/language-detector#107. I am not an expert in this, so there might be other relevant libraries.