200ok-ch/org-parser

Parse semantic blocks where appropriate

schoettl opened this issue · 0 comments

In #7, we came to the conclusion that it's good to parse semantic blocks (instead of only line-based parsing), but only if it's possible and clean in EBNF/instaparse.

Here is a list of some semantic blocks that would need changes in EBNF:

  • property drawers
  • drawers
  • blocks (#+BEGIN_xxx)
  • dynamic blocks (#+BEGIN:)
  • tables
  • fixed-width areas (: sample code)
  • footnotes (can span multiple lines)
  • text paragraphs (maybe?)
  • … (?)

The following elements can not be parsed as semantic elements:

  • bullet lists #34
  • bullet list items #34

Some of them are already defined in EBNF but not yet "activated".


Quoting from #11:

In this branch, I work on the higher level syntax according to https://orgmode.org/worg/dev/org-syntax.html

Specifically, I want to check out, if we can move away from line-based parsing towards more semantical blocks, called "elements". The orgmode parser used for export is also called org-element.el.

The spec says, that most elements of the syntax are not context-free and the categories for these elements are

“Greater elements”, “elements”, and “objects”

Greater elements are e.g. #+BEGIN_EXAMPLE blocks. Some of these blocks contain raw text (EXAMPLE, SRC, COMMENT, ...), others can contain formatted text (CENTER, QUOTE, ...). Hence, it's better to parse context-aware and parse the multi-line raw content in EXAMPLE but formatted text in CENTER block.

Also, paragraphs, multi-line footnote definitions, lists, tables, property drawers are maybe better parsed as units instead of line-based.


Parsing semantic blocks can later be enabled by changing EBNF:

- <line> = (headline / drawer-begin-line / drawer-end-line / … / content-line) eol
+ <line> = (headline / drawer / … / content-line) eol