Pipelines -- Batching sentences in document parser [ARElight backlog]

This is originates from NER application. (nicolay-r/ARElight#118)
The snippet below illustrates that we apply text processing pipeline separately for each sentence (text_parser.run).
If we want to enhance the document processing performance, there is a need to switch from a single sentence to list of sentences. The latter denotes to support batching.

AREkit/arekit/common/docs/parser.py

Lines 19 to 25 in 4c577cb

    
           parsed_sentences = [text_parser.run(input_data=DocumentParser.__get_sent(doc, sent_ind).Text, 
        
                                               params_dict=DocumentParser.__create_ppl_params(doc=doc, sent_ind=sent_ind), 
        
                                               parent_ctx=parent_ppl_ctx) 
        
                               for sent_ind in range(doc.SentencesCount)] 
        
           return ParsedDocument(doc_id=doc.ID, 
        
                                 parsed_sentences=parsed_sentences)

❌ These parameters could be removed:

AREkit/arekit/common/docs/parser.py

Lines 31 to 32 in 4c577cb

    
           "s_ind": sent_ind,                                     # sentence index. (as Metadata) 
        
           "doc_id": doc.ID,                                      # document index. (as Metadata)

The following in actually required and cited to the related parameter in context:

AREkit/arekit/contrib/source/brat/entities/parser.py

Line 10 in 4c577cb

KEY = "sentence"

Proposal for the pipeline core refactoring:

	parsed_sentences = [text_parser.run(input_data=DocumentParser.__get_sent(doc, sent_ind).Text,
	params_dict=DocumentParser.__create_ppl_params(doc=doc, sent_ind=sent_ind),
	parent_ctx=parent_ppl_ctx)
	for sent_ind in range(doc.SentencesCount)]

	return ParsedDocument(doc_id=doc.ID,
	parsed_sentences=parsed_sentences)

	"s_ind": sent_ind, # sentence index. (as Metadata)
	"doc_id": doc.ID, # document index. (as Metadata)