Enriching current DF with additional documents instead of replacing it
MeTaNoV opened this issue · 3 comments
compute_document_frequency
generates a new DF, but is it possible to enrich the current DF with additional documents? (if it makes sense...)
Hi, this is possible by using a new function (using compute_document_frequency
and load_document_frequency_file
).
I give such a function below (please note that it is untested, but the general idea would be it).
Though I don't know what your use case is ? The goal of this library is to make available keyphrase extraction methods presented in scientific papers for comparison and evaluation purposes.
Merging document frequencies would result in creating new dataset thus not being comparable to previous works (unless the goal is to create a new method). If your purpose is to get new words in the DF and refine the weighting scheme, please be aware of the documents that you use (that they are from the same domain if this is important for your application for example).
Code
def enrich_document_frequency(
input_dir, output_file, previous_df_file, extension='xml',
language='en', normalization="stemming", stoplist=None,
delimiter='\t', n=3, max_length=None, encoding=None):
"""Compute the n-gram document frequencies from a set of input
documents, and adds it to a previously computed document
frequency. An extra row is added to the output file for
specifying the number of documents from which the document
frequencies were computed (--NB_DOC-- tab XXX). The output file
is compressed using gzip.
Args:
input_dir (str): the input directory.
output_file (str): the output file.
extension (str): file extension for input documents, defaults
to xml.
language (str): language of the input documents (used for
computing the n-stem or n-lemma forms), defaults to 'en'
(english).
normalization (str): word normalization method, defaults to
'stemming'. Other possible values are 'lemmatization' or
'None' for using word surface forms instead of
stems/lemmas.
stoplist (list): the stop words for filtering n-grams,
default to None.
delimiter (str): the delimiter between n-grams and document
frequencies, defaults to tabulation (\t).
n (int): the size of the n-grams, defaults to 3.
encoding (str): encoding of files in input_dir, default to
None.
"""
# document frequency container
frequencies = load_document_frequency_file(
previous_df_file, delimiter=delimiter)
# initialize number of documents
nb_documents = frequencies['--NB_DOC--']
# Remove this entry from frequencies sot it is not written
# twice at the end
del frequencies['--NB_DOC--']
# loop through the documents
for input_file in glob.iglob(input_dir + os.sep + '*.' + extension):
# initialize load file object
doc = LoadFile()
# read the input file
doc.load_document(
input=input_file, language=language,
normalization=normalization, max_length=max_length,
encoding=encoding)
# candidate selection
doc.ngram_selection(n=n)
# filter candidates containing punctuation marks
doc.candidate_filtering(stoplist=stoplist)
# loop through candidates
for lexical_form in doc.candidates:
frequencies[lexical_form] += 1
nb_documents += 1
if nb_documents % 1000 == 0:
logging.info("{} docs, memory used: {} mb".format(
nb_documents,
sys.getsizeof(frequencies) / 1024 / 1024 ))
# create directories from path if not exists
if os.path.dirname(output_file):
os.makedirs(os.path.dirname(output_file), exist_ok=True)
# dump the df container
with gzip.open(output_file, 'wt', encoding='utf-8') as f:
# add the number of documents as special token
first_line = '--NB_DOC--' + delimiter + str(nb_documents)
f.write(first_line + '\n')
for ngram in frequencies:
line = ngram + delimiter + str(frequencies[ngram])
f.write(line + '\n')
Thanks @ygorg for your quick reply. It is indeed to get new words in the DF and weighting refinement. And yes, we will do that with the content of our customer, each customer having their own domain)
I'm closing this issue, please reopen it if you encounter a problem.