/news_summarization

Module for automatic summarization of text documents. 新闻自动文本摘要模块。

Primary LanguagePython

Automatic text summarizer for news

Simple library for extracting summary from Deepmind news dataset or plain texts. The package also contains simple evaluation framework for text summaries. Inspired by:

Installation

python setup.py install

Usage

from newssum.parsers import PlaintextParser
from newssum.summarizers import CoreRank

TEXT = "Thomas appeared in 15 games (14 starts) for Cleveland this season, averaging 14.7 points, 4.5 assists and 2.1 rebounds in 27.1 minutes. The two-time NBA All-Star (2015-17) owns career averages of 19.0 points (.441 FG%), 5.1 assists, 2.6 rebounds and 1.0 steals in 456 career games (323 starts). In 2016-17, Thomas earned All-NBA Second Team honors when he averaged a career-high 28.9 points (.463 FG%) per game."

if __name__ == "__main__":
    parser = PlaintextParser(TEXT)
    cr_summarizer = CoreRank(parser)
    summary = cr_summarizer.get_best_sents(w_threshold=25)
    print(summary)

InfoRank Features Introduction

1. Surface Features

Surface features are based on structure of documents or sentences.

Name Description
Position 1/sentence no.
Doc_First Whether it is the first sentence of a document
Para_First Whether it is the first sentence of a paragraph
Length The number of words in a sentence
Quote The number of quoted words in a sentence

2. Content Features

Name Description
Position 1/sentence no.
Doc_First Whether it is the first sentence of a document
Para_First Whether it is the first sentence of a paragraph
Length The number of words in a sentence
Quote The number of quoted words in a sentence

3. Relevance Features

Relevance features are incorporated to exploit inter-sentence relationships.

Name Description
Position 1/sentence no.
Doc_First Whether it is the first sentence of a document
Para_First Whether it is the first sentence of a paragraph
Length The number of words in a sentence
Quote The number of quoted words in a sentence

CoreRank Notes

Since the CoreRank algorithm need get core number for each vertex considering the weight of each edge and networkX itself doesn't take it into account. The networkX source code need to be modified.

The modified file had been place at $news_summarization_INSTALLATION_HOME/newssum/models/core.py, you don't need to modify source code of networkX which may cause running error when using networkX for other jobs.

def core_number(G, weight=None):
    """Return the core number for each vertex.

    A k-core is a maximal subgraph that contains nodes of degree k or more.

    The core number of a node is the largest value k of a k-core containing
    that node.

    Parameters
    ----------
    G : NetworkX graph
       A graph or directed graph

    Returns
    -------
    core_number : dictionary
       A dictionary keyed by node to the core number.

    Raises
    ------
    NetworkXError
        The k-core is not defined for graphs with self loops or parallel edges.

    Notes
    -----
    Not implemented for graphs with parallel edges or self loops.

    For directed graphs the node degree is defined to be the
    in-degree + out-degree.

    References
    ----------
    .. [1] An O(m) Algorithm for Cores Decomposition of Networks
       Vladimir Batagelj and Matjaz Zaversnik, 2003.
       http://arxiv.org/abs/cs.DS/0310049
    """
    if G.is_multigraph():
        raise nx.NetworkXError(
                'MultiGraph and MultiDiGraph types not supported.')

    if G.number_of_selfloops()>0:
        raise nx.NetworkXError(
                'Input graph has self loops; the core number is not defined.',
                'Consider using G.remove_edges_from(G.selfloop_edges()).')

    if G.is_directed():
        import itertools
        def neighbors(v):
            return itertools.chain.from_iterable([G.predecessors_iter(v),
                                                  G.successors_iter(v)])
    else:
        neighbors=G.neighbors_iter
    # modifed start
    degrees=G.degree(weight=weight)
    if weight:
        for k in degrees:
            degrees[k] = int(degrees[k])
    # modifed end

    # sort nodes by degree
    nodes=sorted(degrees,key=degrees.get)
    bin_boundaries=[0]
    curr_degree=0
    for i,v in enumerate(nodes):
        if degrees[v]>curr_degree:
            bin_boundaries.extend([i]*(degrees[v]-curr_degree))
            curr_degree=degrees[v]
    node_pos = dict((v,pos) for pos,v in enumerate(nodes))
    # initial guesses for core is degree
    core=degrees
    nbrs=dict((v,set(neighbors(v))) for v in G)
    for v in nodes:
        for u in nbrs[v]:
            if core[u] > core[v]:
                nbrs[u].remove(v)
                pos=node_pos[u]
                bin_start=bin_boundaries[core[u]]
                node_pos[u]=bin_start
                node_pos[nodes[bin_start]]=pos
                nodes[bin_start],nodes[pos]=nodes[pos],nodes[bin_start]
                bin_boundaries[core[u]]+=1
                core[u]-=1
    return core

Complete Project Structure

├───.idea
├───build
├───data               <- The original, immutable data dump.
├───dist
├───external
├───figures            <- Figures saved by notebooks and scripts.
├───newssum            <- Python package with source code.
│   ├───evaluation
│   ├───feature_extraction
│   ├───models
│   ├───parsers
│   ├───summarizers
├───newssum.egg-info
├───notebooks
├───output             <- Processed data, models, logs, etc.
├───tests              <- Tests for Python package.
├── README.md          <- README with info of the project.
├── server.py          <- Simple server for online demo.
└── setup.py           <- Install and distribute module.