nicolay-r/AREkit

Pipelines -- Batching sentences in document parser [ARElight backlog]

nicolay-r opened this issue · 1 comments

This is originates from NER application. (nicolay-r/ARElight#118)
The snippet below illustrates that we apply text processing pipeline separately for each sentence (text_parser.run).
If we want to enhance the document processing performance, there is a need to switch from a single sentence to list of sentences. The latter denotes to support batching.

parsed_sentences = [text_parser.run(input_data=DocumentParser.__get_sent(doc, sent_ind).Text,
params_dict=DocumentParser.__create_ppl_params(doc=doc, sent_ind=sent_ind),
parent_ctx=parent_ppl_ctx)
for sent_ind in range(doc.SentencesCount)]
return ParsedDocument(doc_id=doc.ID,
parsed_sentences=parsed_sentences)

Proposal for the pipeline core refactoring:

image