Extraction of page summaries
Closed this issue · 0 comments
dnmilne commented
Can deal with all page links, category links, redirects etc without big memory requirements by treating it as a graph resolution problem. In each map, deal with one node in the graph (a page), emit that node again plus any information that needs to be communicated to adjacent nodes. Then in each reduce, collapse all of the partial information about each node into a complete picture of that node.