Given scraper can extract text and media from any website, just specify its start url as a command line argument. In addition, data is written into a file in XML format (path to file is specified as a command line argument, defaults to data.xml file in current directory). You can also provide number of pages you need to scrape (defaults to 20).
- Python 3.6 or higher.
- git clone https://github.com/vsmetansky/extractor.git
- cd extractor
- pip3 install .
- extractor [-h] [-f FILE_NAME] [-n PAGE_NUM] url