yuzhimanhua/MATCH

Using another label hierarchy as metadata to predict other labels

Closed this issue · 4 comments

I want to use MATCH to do multi-label text classification on scientific papers using a hierarchical biomimicry label taxonomy I have. Is there a way to use the MeSH labels and MAG fields of study as metadata to improve predictions?

Our metadata refers to author, venue, and reference information of each paper. Do you have those fields in your own collection?

Following your question, one possibility is to use our trained classifier to predict some MAG or MeSH labels of your own papers, and treat those predicted labels as metadata. However, I'm not sure whether it could help improve the performance.

Yes, our collection will include authors, venue, and reference information for each paper. Many of the papers in our labelled data set will likely already have MAG and MeSH labels, I would simply like to use those as I believe it may improve the performance. For example, one label in our biomimicry taxonomy is "protect from temperature". If a paper has a MAG label of "thermal resistance" then that should increase the classifier's confidence in assigning that label.

So would this work to add MAG and MeSH labels as metadata?

transform_data.py

mag = ' '.join(['MAG_'+x for x in data['mag']])
mesh = ' '.join(['MESH_'+x for x in data['mesh']])
text = mag + ' ' + mesh + ' ' + venue + ' ' + author + ' ' + reference + ' ' + data['text']

BIOM.json

{
  "paper": "020-134-448-948-932",
  "mag": [
    "102602991", "311688"
  ],
  "mesh": [
    "D048429", "D000431"
  ],
  "venue": "Current biology",
  "author": [
    "2305659199", "2275630009", "2294310593", "1706693917", "2152058803"
  ],
  "reference": [
    "020-720-960-216-820", "052-873-952-181-099", "000-849-951-902-070"
  ],
  "text": "microtubule assembly dynamics at the nanoscale background the labile nature of microtubules is critical for establishing cellular morphology and motility yet the molecular basis of assembly remains unclear here we use optical tweezers to track microtubule polymerization against microfabricated barriers permitting unprecedented spatial resolution",
  "label": [
    "change_size_or_color", "move", "physically_assemble/disassemble", "maintain_ecological_community"
  ]
}

Yes, this should work.

If you would like to include mag and mesh labels into metadata-aware embedding pre-training, you may need to modify joint/Preprocess.py and joint/run.sh slightly as well. That one, however, does not change the performance significantly (according to Figure 4 in our paper), so you can ignore it.

We will consider modifying the code so that users can define the metadata fields by themselves. Thank you for raising this issue.