A simple scrapper using python and newspaper library
Step wise:
-
When we enter “python3 src/news_scrapper.py --root_dir scraper/Articles --source_list scraper/news_source.txt” then python going to execute the file src/news_scrapper.py
-
--root_dir and --source_list is passed as an argument to get the store location and news source respectively. .
-
Then execute the application accordingly as given below:
- newspaper3k==0.2.8
- jsonpickle==1.1.0
- simplejson==3.16.0
- pymongo==3.9.0
- dnspython==1.16.0
- environs==6.0.0
- cryptography==2.7
- pylint==2.3.0
- git-pylint-commit-hook==2.5.1
- pycodestyle==2.5.0
- mypy==0.730
- mockito==1.1.1
- flake8==3.7.6
- pep8-naming==0.5.0
We want to build a small newsfeed management system which will accept a simple text file as input containing news sites such as:
http://slate.com
https://www.reuters.com/places/india
...
-
The system will crawl through all of these sites, extract individual news and store them as JSON files/documents. Functionality
-
Download the news feed from different sources in parallel. This feature is natively provided by the library used for scraping in this app.
-
Store all news feeds as JSON files to the local file system. For example, if the app is run for date 2020-09-16, it will create a folder named 2020-09-16 and place all json files there. The format of the json file is "_.json". Example - reuters_2020-09-16T12.05.32.json
-
Dump all news feeds to MongoDB as json documents. This app uses a free cloud hosted version of MongoDB (MongoDB Atlas or mLab).
-
Check for duplicity of feed by combination of (title, publish_date). If present, it skips the insertion to MongoDB.
- Create a summary file (summary.txt) in the output folder, e.g, 2019-10-08. The file summarizes the count of all downloaded articles from all sources.
- Create an error log file (error_logs.txt) in the output folder, e.g, 2019-10-08. The log file contains details of all the news feeds that errored out during parsing/building. It stores the stack trace which can be used for debugging purposes.
Make sure python3.7 is installed and added to path.
python --version
or
python3.7 --version
Then create a virtual environment for the project.
virtualenv -p python venv
. venv/bin/activate
or
virtualenv -p python3.7 venv
. venv/bin/activate
Or
python3 -m pip venv .
cd scripts/activate
Then on all platforms install the python dependencies:
pip install -r requirements-dev.txt
Note: there is a separate requirements.txt file that excludes all but the dependencies required for deployment. Any production dependencies should be added to both files.
sh scripts/lint.sh
The scraping can be run by running the python program.
python3 src/news_scrapper.py --root_dir <output-directory> --source_list <path-to-source-file>
python3 src/news_scrapper.py --root_dir F:/Scraper/Articles --source_list F:/Scraper/news_source.txt
Make sure the file news_source.txt exists in the specified location.
A sample news_source.txt has been provided in the repo. All the output files (json files, summary.txt and error_logs.txt) will be generated in the specified output directory.
To make things easier, the above command is wrapped by a shell script news_feed.sh.
The script contains the above python command to run the app with a default output directory and relative location to the source file present in the repo.
We can just run the script or we can edit it and provide our custom path.
To do this, open the script in a text editor and replace the values for root_dir and source_list with custom values where we want to have our input/output.
sh scripts/news_feed.sh
The script will run for some time (~mins) depending on the count of sources we provided and the total number of articles present on those sources for the particular day the app is run.
For functionality testing, It is advised to provide a single source containing a small number of articles. http://slate.com
is one preferred input for which it takes about 3 mins to scrape all articles on any given day.
For testing multiple sources,
http://slate.com
https://www.reuters.com/places/india
are two example values and it takes about 8 mins to scrape articles from these 2 sources combined.
Once the script execution completes, the output can be found in the local file system and MongoDB cluster.