boudinfl/pke

Is the TFIDF function right?

adrlil opened this issue · 2 comments

Can't see any TF part in the tfidf's candidate_weighting function. Am I wrong?

def candidate_weighting(self, df=None):

    if df is None:
        logging.warning('LoadFile._df_counts is hard coded to {}'.format(
            self._df_counts))
        df = load_document_frequency_file(self._df_counts, delimiter='\t')

    # initialize the number of documents as --NB_DOC-- + 1 (current)
    N = 1 + df.get('--NB_DOC--', 0)

    # loop throught the candidates
    for k, v in self.candidates.items():

        # get candidate document frequency
        candidate_df = 1 + df.get(k, 0)

        # compute the idf score
        idf = math.log(N / candidate_df, 2)

        # add the idf score to the weights container
        self.weights[k] = len(v.surface_forms) * idf`

Hi @adrlil

v.surface_forms is a list of the surface forms of the keyphrase candidate, so len(v.surface_forms) would be the number of times that the candidate appears in the document which is the TF.

So, len(v.surface_forms) * idf is TF * idf.

f.

Thanks for your answer!