Missing hierarchical information of section heads
jacklxc opened this issue · 2 comments
The current released version is greedily using the immediate/lowest-level section head as each paragraph's section head. For example, if there are any sub-sections or paragraph heads under the "Related Work" section, it becomes hard to extract the entire "Related Work" section using string matching of the section heads.
Unfortunately this is the case in the current released version. In the updated s2orc-doc2json utility (which we use to create S2ORC JSON), we now preserve hierarchical section headers when possible (see here).
For future S2ORC releases, this will be standard. If you really need nested section headers currently, you could use s2orc-doc2json to reprocess those papers of interest to you. I know that's not the most satisfying answer, but hopefully provides some interim options.
Thank you, Lucy.