nltk

It shows the parse tree for 10 sentences

Downloads different webpages and perform the following:

Tokenize
Lowercase
Remove stop-words
Plot the frequency distribution of the words in the different documents
Calculate the tf-idf for each document in your corpus
For each document, show the top 10 words with the highest tf.idf values
Use cosine similarity to find the most similar documents in your text. HINT create a matrix of similarities where each document is compared to every other sentence. Give the top 5 most similar documents in your corpus. You should see that documents that talk about the same topic are more similar to documents that talk about different topics.

Kevnlan/nltk