Graph Summarization
Opened this issue · 2 comments
Implement graph summarization method similar to: https://github.com/mswellhao/PacSum
Required Tasks:
- Tokenize by Sentence, and create Sentence nodes that connect to a Document node
- Add functionality to SentenceGraph to support sentence/node mapping
- Add previous/next sentence relations for sentences in a document
- Create sentence similarity relations if sentences meet a threshold (may or may not be worth saving all edge weights)
- Research augmentations you can make to make this method suitable for MDS
- Implement the PACSUM extractor algorithm (This might be worth implementing in raw Neo4J as opposed to computing at the API level)
Helpful links
PACSUM extractor code: https://github.com/mswellhao/PacSum/blob/master/code/extractor.py
So ideally we write this using Cypher + APOC: https://github.com/neo4j-contrib/neo4j-apoc-procedures
Looks like the two functions we need to copy are:
- https://github.com/mswellhao/PacSum/blob/67cc8ad370eac160ede997b7c32eb74907728bf8/code/extractor.py#L25
- https://github.com/mswellhao/PacSum/blob/67cc8ad370eac160ede997b7c32eb74907728bf8/code/extractor.py#L86
Will be worth writing some pseudo code here. Will help narrow down the Cypher required
Looks like this is also important: https://github.com/mswellhao/PacSum/blob/67cc8ad370eac160ede997b7c32eb74907728bf8/code/extractor.py#L107
Algorithm:
Inputs: A list of sentence nodes, beta, lambda1, lambda2
- Get the minimum, and maximum edge weight
- Use those values + a provided beta value to compute the minimum edge threshold
- We then compute the forward and backward scores (after playing with the code, I have a better idea of how/why this works)
- Add each nodes forward and backward score together (multiply each respect score by a lambda beforehand). Append this result to a list along with the associated node
- PACSUM randomly shuffles the list to avoid any bias, sort the list by the highest scores, extract top K sentences from the shuffled/sorted list
This will be relatively easy to implement in Python, my concern would be grabbing all the sentence nodes from the associated documents using Cypher