暂时使用搜狗文档
key:词项 value:包含该词项的文本书
key:文本名k1 value{HashMap<String, Double> key:词项 value:在文本k1出现的次数}
key:词项 value{HashMap<String, Double> key:类型 value:在此类文本中出现的次数}
###单个文件的TF-IDF值
HashMap<String, Double>
calculate (HashMap<String, Double>
d0,HashMap<String, Double>
v1, int sum )d0 = wordDistribution
v1 = wordFreguency 的 value
sum = 文件数量
返回此文件的TF-IDF值
TF_IDF(
HashMap<String, Double>
wordDistribution,HashMap<String, HashMap<String, Double>>
wordFreguency)wordDistribution = wordDistribution
wordFreguency = wordFreguency
for(所有文本){
去重 前20个
}
return tfidfHashMap