/nltk

Primary LanguagePython

nltk

It shows the parse tree for 10 sentences

Downloads different webpages and perform the following:

  1. Tokenize
  2. Lowercase
  3. Remove stop-words
  4. Plot the frequency distribution of the words in the different documents
  5. Calculate the tf-idf for each document in your corpus
  6. For each document, show the top 10 words with the highest tf.idf values
  7. Use cosine similarity to find the most similar documents in your text. HINT create a matrix of similarities where each document is compared to every other sentence. Give the top 5 most similar documents in your corpus. You should see that documents that talk about the same topic are more similar to documents that talk about different topics.