Is the TFIDF function right?
adrlil opened this issue · 2 comments
adrlil commented
Can't see any TF part in the tfidf's candidate_weighting function. Am I wrong?
def candidate_weighting(self, df=None):
if df is None:
logging.warning('LoadFile._df_counts is hard coded to {}'.format(
self._df_counts))
df = load_document_frequency_file(self._df_counts, delimiter='\t')
# initialize the number of documents as --NB_DOC-- + 1 (current)
N = 1 + df.get('--NB_DOC--', 0)
# loop throught the candidates
for k, v in self.candidates.items():
# get candidate document frequency
candidate_df = 1 + df.get(k, 0)
# compute the idf score
idf = math.log(N / candidate_df, 2)
# add the idf score to the weights container
self.weights[k] = len(v.surface_forms) * idf`
boudinfl commented
Hi @adrlil
v.surface_forms
is a list of the surface forms of the keyphrase candidate, so len(v.surface_forms)
would be the number of times that the candidate appears in the document which is the TF.
So, len(v.surface_forms) * idf
is TF * idf
.
f.
adrlil commented
Thanks for your answer!