Show related publications

Question

Show related publications

Opened this issue 10 years ago · 6 comments

It would be interesting to list 3-5 related publications, based on the tags.

A simple implementation might not be too difficult or inefficient. Basically, in 2 queries (assuming that we already have the list of tag_ids of the current publication):

# Group tags by publication identifier
tags_by_publication_id = defaultdict(set)
for publication_id, tag_id in PublicationTag.objects.values("publication_id", "tag_id"):
      tags_by_publication[publication_id].add(tag_id)

current_tag_ids = set(THE_TAG_IDS_OF_THE_CURRENT_PUBLICATION)

sorted_publications = [] # ( publication_id, value )
for publication_id, set_of_tags in tags_by_publication_id.:
     same_tags = set_of_tags.intersection(current_tag_ids)
     # This could be something more advance:
     # One tag that appears twice might (or might not) have a higher value
     # than one that appears 10 times
     value = len(same_tags)
     sorted_publications.append( (publication_id, value )

if sorted_publications:
    sorted_publications.sort(lambda (publication_id1, value1), (publication_id2, value2) : cmp(value2, value1) )

    target_publication_ids = [ publication_id for publication_id, value in sorted_publications[:5] ]

    # Retrieve data from target_publication_ids
    publications = Publication.objects.filter(id__in = target_publication_ids)
else:
    publications = []

Answer 1 · 2014-04-16T07:34:26.000Z

Does anybody think the algorithm should be something more complex (e.g., counting also authors, or assigning different values to the different tags)?

Answer 2 · 2014-04-16T07:41:22.000Z

We can put the authors and tags in a set and compute the Jaccard index (http://en.wikipedia.org/wiki/Jaccard_index) between the papers. Its also easy to implement

Answer 3 · 2014-04-16T08:08:21.000Z

If I understand it, the difference is that instead of doing:

     same_tags = set_of_tags.intersection(current_tag_ids)
     # This could be something more advance:
     # One tag that appears twice might (or might not) have a higher value
     # than one that appears 10 times
     value = len(same_tags)

We do:

     intersection = set_of_tags.intersection(current_tag_ids)
     union = set_of_tags.union(current_tag_ids)
     value = intersection / union

Is that right?

Answer 4 · 2014-04-16T09:13:19.000Z

Yes, but to take also the authors into account we can add their IDs to the set (used_set = author_ids + tag_ids)

Answer 5 · 2014-04-16T10:05:32.000Z

I'm thinking that maybe I'll implement a number of options and provide them as options with queries (e.g., publications/<publication_slug>/?related=withauthors&related_method=jaccard), and even not show related papers (only with those methods). Then, we can evaluate all this with publications and see which options are better for tuning it.

Answer 6 · 2014-06-20T16:31:21.000Z

We can do something similar to what we have done to the related persons in #78