OscarPDR/labman_ud

Show related publications

Opened this issue · 6 comments

It would be interesting to list 3-5 related publications, based on the tags.

A simple implementation might not be too difficult or inefficient. Basically, in 2 queries (assuming that we already have the list of tag_ids of the current publication):

# Group tags by publication identifier
tags_by_publication_id = defaultdict(set)
for publication_id, tag_id in PublicationTag.objects.values("publication_id", "tag_id"):
      tags_by_publication[publication_id].add(tag_id)

current_tag_ids = set(THE_TAG_IDS_OF_THE_CURRENT_PUBLICATION)

sorted_publications = [] # ( publication_id, value )
for publication_id, set_of_tags in tags_by_publication_id.:
     same_tags = set_of_tags.intersection(current_tag_ids)
     # This could be something more advance:
     # One tag that appears twice might (or might not) have a higher value
     # than one that appears 10 times
     value = len(same_tags)
     sorted_publications.append( (publication_id, value )

if sorted_publications:
    sorted_publications.sort(lambda (publication_id1, value1), (publication_id2, value2) : cmp(value2, value1) )

    target_publication_ids = [ publication_id for publication_id, value in sorted_publications[:5] ]

    # Retrieve data from target_publication_ids
    publications = Publication.objects.filter(id__in = target_publication_ids)
else:
    publications = []

Does anybody think the algorithm should be something more complex (e.g., counting also authors, or assigning different values to the different tags)?

We can put the authors and tags in a set and compute the Jaccard index (http://en.wikipedia.org/wiki/Jaccard_index) between the papers. Its also easy to implement

If I understand it, the difference is that instead of doing:

     same_tags = set_of_tags.intersection(current_tag_ids)
     # This could be something more advance:
     # One tag that appears twice might (or might not) have a higher value
     # than one that appears 10 times
     value = len(same_tags)

We do:

     intersection = set_of_tags.intersection(current_tag_ids)
     union = set_of_tags.union(current_tag_ids)
     value = intersection / union

Is that right?

Yes, but to take also the authors into account we can add their IDs to the set (used_set = author_ids + tag_ids)

I'm thinking that maybe I'll implement a number of options and provide them as options with queries (e.g., publications/<publication_slug>/?related=withauthors&related_method=jaccard), and even not show related papers (only with those methods). Then, we can evaluate all this with publications and see which options are better for tuning it.

We can do something similar to what we have done to the related persons in #78