I have often felt that the search function in news sites aren’t adequate. Especially to a journalist who is looking for data/information about a certain topic. For this project I decided to create a search function for the CNN website that will produce the following results apart from showing the links to a query:
- List the total number of results of the query
- List the top authors who wrote about that query
- List the top five most common words that appear in articles about the query
The program will take the following input from the user:
- Query [ limited to one word as of now]
- Sections of the website in which the user wants to search for the query. E.g. Search for articles about Trump in the entertainment section only.
- The year of the query
- The month of the query
The algorithm of the program:
- Takes in the inputs from the user
- Looks at CNN’s sitemap for the articles in the date range provided by the user
- Takes the query from the user and converts it into a regular expression
- Looks for all the articles within the section provided by the user and matches the regular expression to find all the articles about the query – All articles with the query word in the link-
- Puts all the articles into a list ‘finalist’
- If ‘finalist’ is empty, asks the user to change their query
- Takes each link and extracts the text, and authors from the article
- Tokenizes the article text
- Puts the authors in a list
- Performs a Frequency distribution of the tokens from all the articles
- Performs a Frequency distribution on the author's list to get the top authors
- Prints the results