简介

新闻事件挖掘。通过聚合公开的新闻数据，聚合描述相同事件的新闻并生成相关事件信息。

协作说明

请保持 DataLoader.py、Text2Vector.py、Cluster.py、EventExtractor.py 这四个文件尽量简洁。不要在这些文件里实现具体算法。在其他地方实现，在这些文件中 import 后调用。比如，EventExtractor.py 是对聚类结果提取事件信息，目前实现了一个 ToyExtractor，其具体实现在 Extractor 文件夹下，EventExtractor.py 只是调用该文件。

数据库

目前有两个表结构：原始新闻表（news) 存储原始新闻信息，事件信息表（event）存储聚类分析后的事件信息。

原始新闻表(news)

表结构：

Field	Type	Null	Key	Extra
news_id	int(10) unsigned	NO	PRI	auto_increment
source	varchar(1000)	YES
author	varchar(1000)	YES
title	varchar(1000)	YES
queryKeyWord	varchar(100)	YES
description	varchar(2000)	YES
url	varchar(1000)	YES
urlToImage	varchar(1000)	YES
publishedAt	datetime	YES
content	text	YES

字段说明：

字段	说明	示例
news_id		106511
source	The identifier display name for the source this article came from	"The New York Times"
author	The author of the article	"Michael Levenson"
title	The headline or title of the article
queryKeyWord	Keywords or phrases to search for in the article title and body	"Donald Trump"
description	A description or snippet from the article
url	The direct URL to the article
urlToImage	The URL to a relevant image for the article
publishedAt	The date and time the article was published	2019-12-17 11:26:36
content	The unformatted content of the article. This is truncated to 260 chars for Developer plan users

事件信息表(event)

表结构：

Field	Type	Null	Key
label	varchar(20)	NO	PRI
newsid	varchar(2000)	NO
title	varchar(1000)	YES
keyWord	varchar(100)	YES
time	datetime	YES
abstract	varchar(2000)	YES
content	text	YES

字段说明：

字段	说明	示例
label	簇标记/事件id
newsid	该事件包含的news_id，用空格分隔	"106511 106522"
title	事件标题
keyWord	事件关键字，多个关键字用 \| 分隔	“ keyword1 \| keyword2"
time	事件发生时间	2019-12-17 11:26:36
abstract	事件摘要
content	事件详细描述

zinhoo/NewsCluster

简介

协作说明

数据库

原始新闻表(news)

事件信息表(event)