Using Scrapy for crawling data from dantri, vietnamnet...
Using git bash or linux terminal for running bash pipe_crawl_vnn.bash
and bash pipe_crawl_dantri.bash
(in a folder src/crawl_paper
). After running these commands we have a new folder src/crawl_paper/raw_data
containing the raw dataset
Articles in dantri include 18 categories
Articles in vietnamnet include 14 categoriesS
Each article is saved as json file and includes 4 features (url, title, abstract and html_content)
Note that all collected data has not been preprocessed