Data crawl from multiple big name news papers in Vietnam: Dantri
, Tuoitre
, Thanhnien
, Vietnamnet
,Vnexpress
, Vtv
All text is tokenized using VnCoreNLP
Each line contain a list of labels, followed by the corresponding title
All label start with __label__
prefix
Apply word segmentation on each label, space got replace with -
__label__cà_phê-Trung_Nguyên __label__Buôn_Ma_Thuột __label__Đặng_Lê_Nguyên_Vũ __label__Giấc-mơ __label__Trung_Nguyên Hành_trình ông chủ Trung_Nguyên mang giấc mơ từ quê nghèo ra thế_giới
- large amount of records are unusable, which are titles with bad tags, not related tags
- some tags are just too long, contain many smaller tags but are treated as one single tag
- no category fields in original crawled data file (possible bad crawling)
- many tags are just copy of the titles, some time add an extra string
- VnExpress Đời sống
- some tags are just copy of the titles, but randomly split into smaller string and
then add
- VnExpress Đời sống
to the last smaller string, and those smaller string become tags for the title - some title are just broken, very short, make no sense, contain half a word, or a single character, ... titles might be cut off after the dash symbol, e.g: máy bay F-22 => máy bay F (possibly bad crawling)
- contain some english news
- some titles have irrelevant, non related marking tags, e.g:
vnn
,vietnamnet doc bao
, ... - one/ a few(?) record(s) have tag which is just a url
- lots of tags are conjoined (eg: "thương mại" is written as "thươngmại")
- lots of titles with no tags
- some titles are just 'Noname'