inukshuk/anystyle

References in Danish (or other languages?)

fnatsger opened this issue · 2 comments

When referencing in Danish, we use "I" instead of "In". This causes the "I" to show up in titles etc. in the exported references, which you then have to manually remove.
A solution might be to either include other languages in the model, add a label to designate these translations or to have an option to delete them.

The model already includes different languages and we're happy to add more, since the default model aims for versatility. (Obviously, if you know your data set is guaranteed to be monolingual it may always be beneficial to use a custom model).

Ideally, we should add a handful of Danish references featuring the I in core.xml. If the usage is similar to the in of English references then the I will typically be at the start of editor or container-title tags so adding samples to the training set will help the model use I as good marker for both of these.

And then we should also make the names and title normalizers aware of this fact. Since "I " could easily occur at the start of titles or names in other languages this is a little more tricky. If you could post some real-world examples maybe we can come up with some good rules (e.g., in the editors tag we could strip it only in combination with common ways to designate editors; in container titles we could look for similar syntactical patterns).

In general, with the normalizers we don't need to worry too much either way, because they can be tweaked at runtime.

Here is an example:

Harvey, P. (2018). 3. Infrastructures in and out of Time: The Promise of Roads in Contemporary Peru. I N. Anand, A. Gupta, & H. Appel (Red.), Promise of Infrastructure (s. 80–101). Duke University Press.

N. Anand is imported as I.N. Anand