Text-Summarization-for-Vietnamese-Newpapers

1. Data Collection

Using Scrapy for crawling data from dantri, vietnamnet...

Using git bash or linux terminal for running bash pipe_crawl_vnn.bash and bash pipe_crawl_dantri.bash (in a folder src/crawl_paper). After running these commands we have a new folder src/crawl_paper/raw_data containing the raw dataset

Articles in dantri include 18 categories

Articles in vietnamnet include 14 categoriesS

Each article is saved as json file and includes 4 features (url, title, abstract and html_content)

Note that all collected data has not been preprocessed

2. Preprocessing