This custom routine scrapes a single link. I have scrapped story title, story date, and the main story from https://www.anandabazar.com/sport/bwf-world-championships-final-pv-sindhu-vs-nozomi-okuhara-dgtl-1.1036258. Right now in order to scrap urls with this tool, user need to have a bit of scripting knowledge, because you need to identify the story title body segments . Will try to make the tool more flexible and human intervension should be as less as possible.
-
Install Scrapy. https://docs.scrapy.org/en/latest/intro/install.html
-
Clone the repo.
-
The file you are looking for is
global_scrape.py
insidelanguage_crawl/language_crawl/spiders/
. Please go through the file. It should be straight forward. -
You can see my scrapped result data in
abp_scrap.csv
. -
To replicate my results, please open your terminal, go to the
language_crawl/language_crawl
directory and runscrapy crawl my_global_scraper -o abp_scrap.csv
. But before that please go through the article once.