Problem in KP-Miner candid weighting

Question

Problem in KP-Miner candid weighting

Closed this issue 4 years ago · 4 comments

in KP-Miner method when compute candidate_weighting all the candidate are multiplied by a fixed number ( B boosting factor). Shouldn't this factor be calculated for unigram phrase only ? and df compute for all candidate ?

Answer 1 · 2020-04-03T13:43:54.000Z

Hi, are you saying this because of this sentence in the paper (2.2, p. 191) ?

So, a boosting factor is needed for compound terms in order to balance this bias towards single terms.

Are you saying that because the "boosting factor is needed for compound terms" it should only be applied to compound terms and not for single words ?

Answer 2 · 2020-04-09T13:09:36.000Z

Yes, exactly
because in your kpminer implementation boosting factor (B) multiplied to all candidate phrase and not any meaningful affect in candidate weighting

  # compute the boosting factor
    B = min(N_d / (P_d * alpha), sigma)

    # loop throught the candidates
    for k, v in self.candidates.items():

        # get candidate document frequency
        candidate_df = 1

        # get the df for unigram only
        if len(v.lexical_form) == 1:
            candidate_df += df.get(k, 0)

        # compute the idf score
        idf = math.log(N / candidate_df, 2)

        self.weights[k] = len(v.surface_forms) * B * idf

Answer 3 · 2020-04-28T22:32:11.000Z

Yes I agree but in the article it is also stated that:

the following equation is used to calculate the weight of candidate keyphrases whether single or compound: wij = tfij* idf * Bi* Pf

Which contradicts the previous statement :

So, a boosting factor is needed for compound terms in order to balance this bias towards single terms.

I evaluated the actual implementation and a modified implementation (see below).

if len(v.lexical_form) == 1:
    self.weights[k] = len(v.surface_forms) * idf
else:
     self.weights[k] = len(v.surface_forms) * B * idf

The evaluation is performed on the SemEval-2010 test set (100 document) against the combined (reader + author) reference. Every keyphrase is stemmed for evaluation.

Method	P@15	R@15	F@15
actual	21.1	22.0	21.4
modified	23.3	24.0	23.4

The modified implementation (not applying boosting factor to single word keyphrases) yields better results.
I'll make a commit to change that. Thanks for your input.

Answer 4 · 2020-04-29T07:59:42.000Z

Fixed in #128