I'm working as a research assistant at American University this semester and this is a project I've been working on.
The purpose of this project is to find a number keywords in hundreds of thousands of articles throughout a number of websites. From there I need to count how many articles each of the keywords appeared in, and the date of each relevant article.
- csv
- datetime
- re
- collections
- naftemporiki_links
- in_links
- requests
- bs4
- twilio.rest
The usage of this program is very simple. Just run main.py and you will be prompted with an input option to enter which news website you want to get the output of.
-
Build in more websites. This should be relatively simple and will be started on Oct. 22 after my midterm exams.
-
Automate the output so that every website is scraped by running main.py.
-
Build in parallel processing - the processes being run are not computationally complex or intensive, but take a long time due to how much information is pulled from websites and how long that takes. Parallel processing would allow the requests to be pulled in unison while still not reaching the limit of processing power.
-
Update version history.
-
Change nested loop in
check_for_keywords
to increase efficiency.