appledora/mwparserfromhtml

Split plaintext by sections and paragraphs

appledora opened this issue · 0 comments

In GitLab by @geohci on Aug 25, 2022, 19:48

Splitting on sections is easy but we'll want to identify all the different HTML elements that indicate a new paragraph (new line) so that we can return a more structured plaintext result. This will include the <p> tags but also list items and likely other types of new HTML nodes. This will provide better support for people who e.g., only want the first paragraph of the article or want to break it into chunks for input into language models.