ahmia/ahmia-site

Adjust results ordering based on popularity (backlinks)

chamalis opened this issue · 2 comments

Include a popularity metrics algorithm, that will have some influence on the results ordering.

This will use the backlinks from onion addresses. If spam link farmers are detected, we could use some sort of detection mechanism to reduce the influence of those spam websites.

We have to find an appropriate formula to combine that rating with the Elasticsearch's ordering score that's already applied.

Compare with the current results to find out if we managed to improve results ordering.

I have implemented an efficient method for page popularity (PagePop using sparse matrices based on this function that produces a normalized probability vector, let's say v, where v(i) is the probability that a person randomly clicking on links will arrive at website v. Thus argmax v(i) is the most popular webpage.

This is the most well-known approach to popularity metrics, where multiple outbound links from one page to another page are treated as a single link.

I plan to perform the PagePop on the whole index once per day, as well as on each query's results (few pages, thus it shouldn't be a significant overhead). Then those scores should be combined with the 'score' attribute that each document (hit) returned by ES's search() query has. A lot of testing needs to be done to adjust that scoring formula (function) appropriately.

If there are any thoughts on this, please let me know, since it's still R&D :)

Current results are not great. To improve them, I would suggest moving towards these directions:

  • Try to count backlinks from www as well. That needs research, I am not sure if feasible (e.g through duckduckgo).
  • Change algorithm to count score for each page, instead each domain.
  • Calculate (at runtime) a third score with lower influence, that is PagePopScore of each of the results, considering the backlinks from only the result pages.
  • Detect links between mirror webpages (I can't think of an efficient way to do that), as well as search engine spam-links (hidden links) and ignore them.