Unstructured-IO/pipeline-sec-filings

Improve Narrative Section Extraction

cragwolfe opened this issue · 0 comments

Right now, a SECSection regex is used to identify a TOC section in get_section_narrative. That generally works pretty well. The matching TOC title text is then used to look for the section in the content but rather than sticking with the original regex, a more lenient match condition is ultimately used in 10-K’s and 10-Q’s with match_10k_toc_title_to_section. The better thing to do is likely stick with the original matching regex.

The lenient post-TOC match is why the EHC test fails for the BUSINESS section, and may be the reason for other failures as well.

Definition of Done

  • Updated section extraction logic such that fewer tests are marked as xfailed, in particular the EHC case mentioned above.