EarthNLP/ClimateScholar

Graph Summarization

Opened this issue · 2 comments

Hevia commented

Implement graph summarization method similar to: https://github.com/mswellhao/PacSum

Required Tasks:

  1. Tokenize by Sentence, and create Sentence nodes that connect to a Document node
  2. Add functionality to SentenceGraph to support sentence/node mapping
  3. Add previous/next sentence relations for sentences in a document
  4. Create sentence similarity relations if sentences meet a threshold (may or may not be worth saving all edge weights)
  5. Research augmentations you can make to make this method suitable for MDS
  6. Implement the PACSUM extractor algorithm (This might be worth implementing in raw Neo4J as opposed to computing at the API level)

Helpful links
PACSUM extractor code: https://github.com/mswellhao/PacSum/blob/master/code/extractor.py

Hevia commented

So ideally we write this using Cypher + APOC: https://github.com/neo4j-contrib/neo4j-apoc-procedures

Looks like the two functions we need to copy are:

Will be worth writing some pseudo code here. Will help narrow down the Cypher required

Hevia commented

Looks like this is also important: https://github.com/mswellhao/PacSum/blob/67cc8ad370eac160ede997b7c32eb74907728bf8/code/extractor.py#L107

Algorithm:

Inputs: A list of sentence nodes, beta, lambda1, lambda2

  1. Get the minimum, and maximum edge weight
  2. Use those values + a provided beta value to compute the minimum edge threshold
  3. We then compute the forward and backward scores (after playing with the code, I have a better idea of how/why this works)
  4. Add each nodes forward and backward score together (multiply each respect score by a lambda beforehand). Append this result to a list along with the associated node
  5. PACSUM randomly shuffles the list to avoid any bias, sort the list by the highest scores, extract top K sentences from the shuffled/sorted list

This will be relatively easy to implement in Python, my concern would be grabbing all the sentence nodes from the associated documents using Cypher