It shows the parse tree for 10 sentences
Downloads different webpages and perform the following:
- Tokenize
- Lowercase
- Remove stop-words
- Plot the frequency distribution of the words in the different documents
- Calculate the tf-idf for each document in your corpus
- For each document, show the top 10 words with the highest tf.idf values
- Use cosine similarity to find the most similar documents in your text. HINT create a matrix of similarities where each document is compared to every other sentence. Give the top 5 most similar documents in your corpus. You should see that documents that talk about the same topic are more similar to documents that talk about different topics.