[TOC]
- 595,037 documents
- article : id | url | title | kicker | author | published_date | contents | type | sourse
- contents
- (None)
- kicker : a section header indicating the publication category,irrelevant if is one of "Opinion", "Letters to the Editor","The Post's View"
- title
- image : image URL and full caption
- byline : by + author(s)
- paragraph : plain (text) | html (with html style < …… >)
- (author_info)
- type : 'article' / 'blog'
- sourse : 'The Washington Post'
- contents
- remove irrelevant article according to kicker
- remove ['type'] ,['sourse'] rom article
- remove [byline] ,[title] ,[author_info] ,[image] from contents if exist
- remove empty content from contents
- remove html code from content
- group contents into an article (plain text)
- 571,963 docs remained
- article : id | url | title | kicker | author | date | contents(long string)
- insert:id + other parts
- topics : BeautifulSoup --> id + num
- source article : id in ES
- search : title and 10 keywords(Rake) respectively
- sort : score normalization --> weighted(from Rake) sum
nearly irrelevant
-
- relevant:2018 relevance judgments,labels:0-16
- irrelevant : add 10000 random sample,labels:-1
-
- id——>text