boudinfl/pke

Enriching current DF with additional documents instead of replacing it

MeTaNoV opened this issue · 3 comments

compute_document_frequency generates a new DF, but is it possible to enrich the current DF with additional documents? (if it makes sense...)

ygorg commented

Hi, this is possible by using a new function (using compute_document_frequency and load_document_frequency_file).
I give such a function below (please note that it is untested, but the general idea would be it).

Though I don't know what your use case is ? The goal of this library is to make available keyphrase extraction methods presented in scientific papers for comparison and evaluation purposes.
Merging document frequencies would result in creating new dataset thus not being comparable to previous works (unless the goal is to create a new method). If your purpose is to get new words in the DF and refine the weighting scheme, please be aware of the documents that you use (that they are from the same domain if this is important for your application for example).

Code
def enrich_document_frequency(
      input_dir, output_file, previous_df_file, extension='xml',
      language='en', normalization="stemming", stoplist=None,
      delimiter='\t', n=3, max_length=None, encoding=None):
  """Compute the n-gram document frequencies from a set of input
  documents, and adds it to a previously computed document
  frequency. An extra row is added to the output file for
  specifying the number of documents from which the document
  frequencies were computed (--NB_DOC-- tab XXX). The output file
  is compressed using gzip.
  Args:
      input_dir (str): the input directory.
      output_file (str): the output file.
      extension (str): file extension for input documents, defaults
          to xml.
      language (str): language of the input documents (used for
          computing the n-stem or n-lemma forms), defaults to 'en'
          (english).
      normalization (str): word normalization method, defaults to
          'stemming'. Other possible values are 'lemmatization' or
          'None' for using word surface forms instead of
          stems/lemmas.
      stoplist (list): the stop words for filtering n-grams,
          default to None.
      delimiter (str): the delimiter between n-grams and document
          frequencies, defaults to tabulation (\t).
      n (int): the size of the n-grams, defaults to 3.
      encoding (str): encoding of files in input_dir, default to
          None.
  """

  # document frequency container
  frequencies = load_document_frequency_file(
      previous_df_file, delimiter=delimiter)

  # initialize number of documents
  nb_documents = frequencies['--NB_DOC--']

  # Remove this entry from frequencies sot it is not written
  #  twice at the end
  del frequencies['--NB_DOC--']

  # loop through the documents
  for input_file in glob.iglob(input_dir + os.sep + '*.' + extension):

      # initialize load file object
      doc = LoadFile()

      # read the input file
      doc.load_document(
          input=input_file, language=language,
          normalization=normalization, max_length=max_length,
          encoding=encoding)

      # candidate selection
      doc.ngram_selection(n=n)

      # filter candidates containing punctuation marks
      doc.candidate_filtering(stoplist=stoplist)

      # loop through candidates
      for lexical_form in doc.candidates:
          frequencies[lexical_form] += 1

      nb_documents += 1

      if nb_documents % 1000 == 0:
          logging.info("{} docs, memory used: {} mb".format(
              nb_documents,
              sys.getsizeof(frequencies) / 1024 / 1024 ))

  # create directories from path if not exists
  if os.path.dirname(output_file):
      os.makedirs(os.path.dirname(output_file), exist_ok=True)

  # dump the df container
  with gzip.open(output_file, 'wt', encoding='utf-8') as f:

      # add the number of documents as special token
      first_line = '--NB_DOC--' + delimiter + str(nb_documents)
      f.write(first_line + '\n')

      for ngram in frequencies:
          line = ngram + delimiter + str(frequencies[ngram])
          f.write(line + '\n')

Thanks @ygorg for your quick reply. It is indeed to get new words in the DF and weighting refinement. And yes, we will do that with the content of our customer, each customer having their own domain)

ygorg commented

I'm closing this issue, please reopen it if you encounter a problem.