inukshuk/anystyle

Better documentation on creating training sets

samuelhaysom opened this issue · 14 comments

Sorry if this isn't the best place to put this but I am a bit confused about how the parser goes from the example training data to being able to extract "type" fields (e.g. article-journal, book, chapter, thesis). Currently the default model is quite accurately getting this info from the citations I need to parse (Google Scholar MLA citations) but I would like to improve it (currently it is failing to extract "journal" and "type" records for some bioRxiv papers).

However when I look at the provided example training sets (e.g. core.xml), I cannot find anywhere where the type is explicitly specified in training the model (can't see any records) so I am confused how retraining the model would improve it for my use case.

I think part of my confusion is not really understanding how anystyle uses the training data to parse the model. Would it be possible to add some more info to the documentation to make this clearer, particularly as to how the parser can extract information that it doesn't seem to have been trained to recognise?

"type" is not directly parsed from the training data, since there is usually nowhere in a citation that says "this is a journal article".

Rather it is inferred once the parsing is done based on the presence and content of fields that are found. The rules can be found here: https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/normalizer/type.rb

They are fairly simple and in my experience effective. It should help explain why your particular data are being assigned the wrong type, but feel free to post an example, it might help improve the assignment. If your example is failing to parse "journal" then that will explain why it is failing to identify it as a journal article since the logic is "if in a journal, it's a journal article".

To your point on the documentation: it is up to @inukshuk but my guess that is that we would want to keep the type normaliser internal and fix bugs where the type assignment is wrong.

If you know Ruby, it isn't very difficult at all to subclass the parser and replace the type normaliser, but that is probably something really only for users who have specific needs (e.g. a new document type) and are willing to get into the internals.

Ok, thanks for the info. This would be a good thing to point towards in the documentation. I don't think you would need to explain what the behind the hood rules are for coming up with the type but I think it would definitely be helpful to have a section explaining what each output field ("title", "journal", "authors" etc.) is and whether it comes from the citation or from interpretation of the results found for that citation.

Some examples of citations not being parsed correctly are:

Kamrad, Stephan, et al. "A natural variant of the sole pyruvate kinase of fission yeast lowers glycolytic flux triggering increased respiration and oxidative-stress resistance but decreased growth." bioRxiv (2019): 770768.

Caydasi, Ayse Koca, et al. "SWR1 Chromatin Remodeling Complex Prevents Mitotic Slippage during Spindle Position Checkpoint Arrest." bioRxiv (2019): 749440.

Looking at the resulting .json from parsing these is looks like the parser is failing to get the necessary fields that allow type to be determined, explaining why the type is not being returned.

[
  {
    "author": [
      {
        "family": "Kamrad",
        "given": "Stephan"
      },
      {
        "others": true
      }
    ],
    "title": [
      "A natural variant of the sole pyruvate kinase of fission yeast lowers glycolytic flux triggering increased respiration and oxidative-stress resistance but decreased growth"
    ],
    "note": [
      "bioRxiv (2019): 770768."
    ],
    "type": null
  },
  {
    "author": [
      {
        "family": "Caydasi",
        "given": "Ayse Koca"
      },
      {
        "others": true
      }
    ],
    "title": [
      "SWR1 Chromatin Remodeling Complex Prevents Mitotic Slippage during Spindle Position Checkpoint Arrest"
    ],
    "note": [
      "bioRxiv (2019): 749440."
    ],
    "type": null
  }
]

In order to train a new model which wouldn't fail for these references, would you recommend just training a new parser using the existing training data along with some examples like these?

Thank you for the failing examples. Yes, if the string "bioRxiv" were correctly labelled as the journal (and not, as here, as a note) in the first place, the problem with type assignment would resolve itself.

I would try training a model with the base set plus a few marked-up examples of bioRxiv references showing that it should be understood as a journal. If, using that model, you still find your set is being mis-parsed, please come back.

As a hint to how the parser works and why the standard model doesn't work here, it consider the location of capital letters as a feature. Almost all journal identifiers start with caps (either abbreviations, short forms or full names: BJS, Br. J. Soc., British Journal of Sociology). So there is a strong dispreference to identify a string starting with a lower case as a journal name.

We can consider adding the examples to the base set and revising the docs as you suggest.

PS - can you give us a hint how the final number should be interpreted here (e.g. volume, issue, page)?

Caydasi, Ayse Koca, et al. "SWR1 Chromatin Remodeling Complex Prevents Mitotic Slippage during Spindle Position Checkpoint Arrest." bioRxiv (2019): 749440.

Thanks for your helpful comments, I think I understand how everything works a lot better now. I will try retraining the parser as you suggest and see if that improves my results.

Looking at the bioRxiv pages for the two examples it looks like the final number is the final section of the DOI identifier for each article. Looking at the specification for MLA citations I think this is playing the role of a Location (https://www.bibme.org/mla) although I'm not sure why it isn't the full doi in that case.

Kamrad, Stephan, et al. "A natural variant of the sole pyruvate kinase of fission yeast lowers glycolytic flux triggering increased respiration and oxidative-stress resistance but decreased growth." bioRxiv (2019): 770768.

doi: https://doi.org/10.1101/770768

Caydasi, Ayse Koca, et al. "SWR1 Chromatin Remodeling Complex Prevents Mitotic Slippage during Spindle Position Checkpoint Arrest." bioRxiv (2019): 749440.

doi: https://doi.org/10.1101/749440

That's interesting! I'd definitely train this using the doi label. Do you know if it is common practice to omit the DOI prefix? In that case it would be good to add a few unmarked and prefix-less DOIs to the core training set; eventually a journal-or-publisher/DOI-prefix dictionary could be added to the normalizer to automatically add the missing prefix if it can be resolved.

But I can't believe that this is sound practice. I'm sure journals have switched publishers/owners in the past so I would not be surprised if DOI prefixes can change.

I have no idea, to be honest I'm not that familiar with the MLA citation style but am using it because I found that of the citation styles I could get out of google scholar, anystyle was best at parsing MLA out of the box. This might be something unique to however google scholar extracts citations for their records, which I imagine is done by some sort of sophisticated web scraper. I guess that scraper could be making a mistake in the case of bioRxiv pages but I'm not sure.

I could have a look through some of the other citations I've got from other journals and see if they also have a DOI Prefix if you want to see if it is a larger problem or just for bioRxiv.

The dictionary idea has made me think it might be good to have a list somewhere of which journals are known to be preprint servers. Then you could have a new type to distinguish between preprint articles and journal articles (as preprint articles and journal articles differ based on peer review).

Yes, we should consider either adding a type for preprints or adding eprint fields. Judging by the use of archivePrefix there I suppose it is a common practice and that at least these eprint archives do have stable DOI prefixes.

In any case, we should definitely add something like this, but we'll have to figure out what the best practices are to make sure further processing with Zotrero or BibTeX tools works smoothly.

I've found another issue with classification of Google Scholar MLA citations for dissertations. Most of these have the form:

Romila, Catalina-Andreea. High-Throughput Chronological Lifespan Screening of the Fission Yeast Deletion Library Using Barcode Sequencing. Diss. UCL (University College London), 2019.

Ledesma Fernandez, M. E. Functional Genomics Characterisation of the Genetic Pathways that Control Kinetochore Function. Diss. (UCL) University College London, 2016.

These are being variously classified as Chapters, Books, Theses and article-journal

Looking at lib/anystyle/normalizer/type.rb, I think if the Diss. bit is segmented as a note, the paper should be getting assigned a thesis "type" but this doesn't always seem to be happening. Can you confirm whether I've got that right and if so any potential reasons why the classification wouldn't be very accurate?

Are you parsing this on anystyle.io or using the default model? Just asking because if you're aiming for consistent results it's best to use the default model (or your own) because the public model on the website is obviously subject to change and not very consistent.

As for the default model (and in general) I'd label the Diss. as genre -- this would be consistent with the conventions used in our core training set. For an example, see this sequence here. Labeling this as note should also work as far as the type classifier is concerned, but I think using genre for dissertations and theses is more indicative since note is used for a range of other stuff that's difficult to generalize.

If references similar to those above get classified as chapters they must have a container-title; they should not be assigned book if either note or genre is present. In any case, the classification is purely deterministic given a certain set of labels. It could definitely be improved, but for those references above I think it should be easy to get type thesis consistently if the segmentation is sufficiently good.

Hi @inukshuk @samuelhaysom @a-fent I would appreciate help, because I've read the documentation and dozens of comments, but I haven't figured out how to train my own model.
I believed that I have to provide each reference in original plain text format (as input to parse) and in XML format with the labels (as I want to get in output after parsing). However I cannot find where to put plain text file data.txt in command anystyle train data.xml my-model.mod

You don't need to provide the original references for training: the original input can be re-constructed from the XML.

Hello
I'd appreciate your help
where should i put my custom mappings in the dictionary file ?
my main idea is parsing in arabic language. i trained my model using anystyle train command and its working as expected. However there are some things that can be solved using dictionary. I understoond the concept but the documentation is a bit high level and i didnt understand where should i put the dictionary ? is it found somewhere and i update it or something?

Thanks in advance