Is the TFIDF function right?

Question

Is the TFIDF function right?

adrlil opened this issue 3 years ago · 2 comments

Can't see any TF part in the tfidf's candidate_weighting function. Am I wrong?

def candidate_weighting(self, df=None):

    if df is None:
        logging.warning('LoadFile._df_counts is hard coded to {}'.format(
            self._df_counts))
        df = load_document_frequency_file(self._df_counts, delimiter='\t')

    # initialize the number of documents as --NB_DOC-- + 1 (current)
    N = 1 + df.get('--NB_DOC--', 0)

    # loop throught the candidates
    for k, v in self.candidates.items():

        # get candidate document frequency
        candidate_df = 1 + df.get(k, 0)

        # compute the idf score
        idf = math.log(N / candidate_df, 2)

        # add the idf score to the weights container
        self.weights[k] = len(v.surface_forms) * idf`

Answer 1 · 2022-02-22T15:39:13.000Z

Hi @adrlil

v.surface_forms is a list of the surface forms of the keyphrase candidate, so len(v.surface_forms) would be the number of times that the candidate appears in the document which is the TF.

So, len(v.surface_forms) * idf is TF * idf.

f.

Answer 2 · 2022-02-25T12:51:09.000Z

Thanks for your answer！