A Compare-and-contrast Multistage Pipeline for Uncovering Financial Signals in Financial Reports

This repo is the temporary anonymous repositary for double-blind reviews.

We releasd our FINAL (FINancial-ALpha) dataset, including the pseudo-labeled training data and the humna-annotated labels.

FINAL_v1.0 Dataset

The data used in this paper includes one (pseudo) labeled training dataset and two set of evaluation data.

The parsed dataset is based on Software Repositary for Accounting and Finance, which its source contents are officially released from SEC/EDGAR.

The data definition and data statisitcs

Split	Type	Descrption	Number of Pairs
Train	Revised	Pseudo-label	30000
Eval	Revised $\mathcal(T)^{\alpha}_1$	Human-annotation	200
Eval	Mismatched $\mathcal(T)^{\alpha}_2$	Human-annotation	200

Data example Note that we use the 'jsonl' format; each line in files is an instance. An instance is compiled into a 'dict' object as one line in the file.

Key	contents	Descrption
`sentA`	raw text (string)	the `reference` segment in a report.
`sentB`	raw text (string)	the `target` segment in a report.
`wordsA`	a list of strings	splitted tokens of `sentA`.
`wordsB`	a list of strings	splitted tokens of `sentB`.
`words`	A list of strings	splitted tokens of `sentB` and `sentB`, seperated by `<tag>`.
`labels`	A list of labels (binary).	Human annotation: final binary labeling is based on agreement of annotators.
`probs`	A list of labels (float).	Human annotation: final fine-grained labeling is based on the average of annontated binary `labels`.
`keywordsA`	a list of strings	the annotated tokens.
`keywordsB`	a list of strings	the annotated tokens.

{
    "sentA": "Net loss for fiscal year 2014 was $836 thousand ...", 
    "sentB": "Net income for fiscal year 2015 was $364 thousand ...",
    "type": 1, 
    "words": ["<tag1>", "Net", "loss", "for", "fiscal", "year", "2014", "was", "$836", "thousand", ..., ".", "<tag2>", "Net", "income", "for", "fiscal", "year", "2015", "was", "$364", "thousand", ..., ".", "<tag3>"], 
    "wordsA": ["Net", "loss", "for", "fiscal", "year", "2014", "was", "$836", "thousand", ..., "."], 
    "wordsB": ["Net", "income", "for", "fiscal", "year", "2015", "was", "$364", "thousand", ..., "."], 
    "keywordsA": [], 
    "keywordsB": ["Net", "income", "$364", "thousand", "increase", "of", "$1.2", "million"], 
    "labels": [-1, 0, 0, 0, , -1, 1, 1, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 3, 3, 0, -1], 
    "probs": [-1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0, 0.3333333333333333, 0.3333333333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.3333333333333333, 1.0, 1.0, 0.0, -1.0]
}

Financial signal highlighting

Formally, we are focusing tackle the financial signal highlighting task. In document-level, we adopted the multistage pipeline.

Phase	Descrption	Summary
S_0	Document segmetation	Using Cross-seg BERT to separate document (actually aggregate sentences into a segment)
S_1	Relation recognition	Using ROUGE and SBERT cosine score to identify the relationship of each semgnet pairs.
S_2 & S_2+	In-domain/Out-domain fine-tuning	Two-stage domain-adaptive training using out-domain e-SNLI dataset and pseudo-labeld pairs with "revised" relations.

Document Segmentation

TBD

Segments Alignment

TBD

Sentence Highlighting See highlighting for detail.

cnclabs/codes.fin.highlight

A Compare-and-contrast Multistage Pipeline for Uncovering Financial Signals in Financial Reports

FINAL_v1.0 Dataset

Financial signal highlighting