- A lot of real world applications on daily produce text data like articles, news, sport commentary, movie subtitles, scientific research etc. Summarization involves extracting the key summary from the big text document.
- Task of generating intelligent and accurate summaries for long text is very popular area of research in Natural Language Processing
- Mainly there are two methods for text summarization as given below
- Extractive Text summarization
- Abstractive Text summarization
- The project is based on doing extractive text summarization on text document using the algorithm named TextRank. It is unsupervised learning algorithm which ranks the sentences based on similarity for summarization.
- TextRank is graph based algorithm for Natural Langauge processing that can be used for keyword and sentence extraction. Algorithm works on similar line of Pagerank algorithm used by google for displaying web pages.
-
Text Clening : First tokenize all the text document into sentences. We can tokensize on words as well although since for summarization sentence weight are needed.
-
Sentence Vectorization : Inorder to run textrank algorithm we first convert the sentence format into numbers, without lossing any of the information. For this we vectorize now each sentences into unique vector which represents the information stored in the sentence.
- Document Term Matrix : matrix in which each row is sentence and each column is word. The value any particular row and column comprise of the frequency of the word given by column in the sentence given row.
- Now since the frequency can arbitary any number, the Document Term matrix is normalized
- First create term document matrix which is normalized according to TF-IDF which reflects how much important the words in the vectorized sentence. Depending upon the importance we normailze the word into 0 or 1.
-
Graph of sentences : Generate graph where each node is sentence and edge between the sentence defines the similarity in between the sentences. The similiary in between the two sentences is basically the number of common words they have so simiply take the dot product of the two sentences.
-
Apply Page Rank Algorithm :