The main entry of this project is NewsCrawlerMain.py, after configuring the environment and the corresponding parameters in config folder, use the following shell command and enter the dice password to run it for long period.
longjob -28day -c ./job.sh
config
NewsConfig.py : get your Twitter API credentials and enter them here and path settings, etc.
developer_config.py : Twitter API class
data
TwitterNewsAgencies.csv: the accounts we choose to crawl
handler
get_articles.py: use extracted URLs to crawl
get_tweets_timeline.py: get tweets timeline job
save_to_solr.py: save articles and tweets to solr
job.sh : running job
NewsCrawlerMain.py: main entry of this project