/org-to-xml

Library to convert Emacs org-mode files to XML

Primary LanguageEmacs LispGNU General Public License v3.0GPL-3.0

org-to-xml

This project contains a library to convert Emacs org-mode files to XML: om-to-xml.

The goal is a complete and accurate translation of the internal org-mode data structures to XML. This produces XML that isn’t especially pretty, but the assumption is that downstream XML processing tools can be used to transform it.

What’s new?

04 August 2020

  • Renamed default branch to main.
  • Updated dependency (the om.el package was renamed org-ml)

om-to-xml

This projects started began with org-to-xml, but development stalled when I realized that I didn’t really have a solid understanding of the underlying Org data structures and what I really needed to write first was a library that provided a consistent API that I could use in the conversion. (This is not a criticism of the Org developers; there are still corners of Emacs lisp that I don’t understand very well.)

Lo and behold! Nate Dwarshuis has written just such a library! The om-to-xml library is a rewrite of my XML conversion on top of that library.

It does a more complete and accurate job and has a few extensibility mechanisms that org-to-xml did not:

  • Whitespace handling
  • Attribute handling
  • Custom element processing

How to use it

Install the library somewhere. I use straight.el, but you can just download the library and put it on your load path any way you like.

Run m-x om-to-xml in a buffer where you’re viewing an Org mode file. An XML representation will be constructed and saved. If the Org mode file is notes.org, the XML file will be saved in notes.xml.

How it works

For the curious, here’s how it works.

Consider, an org file:

#+TITLE: Some Title
#+AUTHOR: Norman Walsh
#+DATE: 2019-02-19

A paragraph with <markup> in it. This isn’t intended to be meaningful
or useful.

* First level heading
  :PROPERTIES:
  :CUSTOM_ID: first
  :END:

** TODO This is an example TODO item.
   DEADLINE: <2019-02-26 Tue +1w>
   :PROPERTIES:
   :CREATED:  [2019-02-19 Tue 06:39]
   :SRC:      [[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]]
   :END:

See [[https://orgmode.org/][org-mode]] for more information about ~org-mode~.
  1. First, it’s parsed by om, swaths of which I’ve elided:
    ((keyword (:key "TITLE" :value "Some Title"))
     (keyword (:key "AUTHOR" :value "Norman Walsh"))
     (keyword (:key "DATE" :value "2019-02-19"))
     (paragraph
      (:parent (section (:post-affiliated 64) #1))
      #("A paragraph with <markup> in it. This isn’t intended to be meaningful
    or useful.
    " 0 81 (:parent #1)))
     (headline
      (:raw-value "First level heading" :level 1 :CUSTOM_ID "first"
       :title (#("First level heading" 0 19 (:parent #1))))
     (section
      (:parent #1)
      (property-drawer
       (:parent #2)
       (node-property (:key "CUSTOM_ID" :value "first" :parent #3))))
     (headline
      (:raw-value "This is an example TODO item." :level 2
       :todo-keyword #("TODO" 0 4 (fontified t face org-todo))
       :todo-type todo
       :deadline (timestamp
                  (:type active :raw-value "<2019-02-26 Tue +1w>"
                   :year-start 2019 :month-start 2 :day-start 26
                   :year-end 2019 :month-end 2 :day-end 26
                   :repeater-type cumulate :repeater-value 1
                   :repeater-unit week))
       :SRC "[[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]]"
       :CREATED "[2019-02-19 Tue 06:39]"
       :title (#("This is an example TODO item." 0 29 (:parent #2)))
       :parent #1)
      (section
       (:parent #2)
       (planning
        (:closed nil
         :deadline (timestamp
                    (:type active :raw-value "<2019-02-26 Tue +1w>"
                           :year-start 2019 :month-start 2 :day-start 26
                           :year-end 2019 :month-end 2 :day-end 26
                           :repeater-type cumulate :repeater-value 1
                           :repeater-unit week)) :parent #3))
       (property-drawer
        (:parent #3)
        (node-property (:key "CREATED" :value "[2019-02-19 Tue 06:39]" :parent #4))
        (node-property (:key "SRC" :value "[[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]]" :parent #4)))
       (paragraph
        (:parent #3)
        #("See " 0 4 (:parent #4))
        (link (:type "https" :path "//orgmode.org/" :format bracket
               :raw-link "https://orgmode.org/" :parent #4)
              #("org-mode" 0 8 (:parent #5)))
        #("for more information about " 0 27 (:parent #4))
        (code (:value "org-mode" :parent #4)) #(".
    " 0 2 (:parent #4)))))))
        
  2. A buffer is created to store the XML, and the om data is traversed emiting XML elements for each node. The node properties become attributes, except for ignored properties and properties that come from a properties drawer which are always ignored.
  3. If a post-processing function has been provided, it is run.
  4. Save the file.
    <?xml version="1.0"?>
    <!-- Converted from org-mode to XML by om-to-xml version 0.0.1 -->
    <!-- See https://github.com/ndw/org-to-xml -->
    <document xmlns="https://nwalsh.com/ns/org-to-xml">
    <keyword key="TITLE" value="Some Title"/>
    <keyword key="AUTHOR" value="Norman Walsh"/>
    <keyword key="DATE" value="2019-02-19"/>
    
    <paragraph>A paragraph with &lt;markup&gt; in it. This isn’t intended to be meaningful
    or useful.
    </paragraph>
    <headline raw-value="First level heading" level="1"><title>First level heading</title>
    <section>
    <property-drawer>
    <node-property key="CUSTOM_ID" value="first"/>
    </property-drawer>
    </section>
    
    <headline raw-value="This is an example TODO item." level="2" todo-keyword="TODO" todo-type="todo" SRC="[[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]]" CREATED="[2019-02-19 Tue 06:39]"><deadline><timestamp type="active" raw-value="&lt;2019-02-26 Tue +1w&gt;" year-start="2019" month-start="2" day-start="26" year-end="2019" month-end="2" day-end="26" repeater-type="cumulate" repeater-value="1" repeater-unit="week"/></deadline><title>This is an example TODO item.</title>
    <section><planning/>
    <property-drawer>
    <node-property key="CREATED" value="[2019-02-19 Tue 06:39]"/>
    <node-property key="SRC" value="[[file:/projects/emacs/org-to-xml/README.md::For%20the%20curious,%20here%E2%80%99s%20how%20it%20works.]]"/>
    </property-drawer>
    
    <paragraph>See <link type="https" path="//orgmode.org/" format="bracket" raw-link="https://orgmode.org/">org-mode</link>
    for more information about <code value="org-mode"/>.
    </paragraph>
    </section>
    </headline>
    </headline></document>
        

It’s been twenty years since I tried to do anything much more interesting than a keybinding in elisp. I expect the code, especially the tree walking, is embarrassingly crude. Suggestions for improvement, or simply pointers to the bits of the elisp manual I should read again, most humbly solicited.

I also confess, I’m completely winging it on current function naming/namspacing conventions.

Pros and Cons

There are two obvious ways to approach the problem of converting .org files to .xml.

  1. Use the ox framework.
  2. Do it the hard way.

My goal in this project is to have a complete dump of the org structures in XML. That rules out the ox framework. The ox framework is definitely the place to start if you want to convert from an unknown org file and extract the information that you know about. But it flattens structures like the property drawer so that it’s impossible to extract everything with fidelity, even the things you don’t know about.

So this code attempts to do it the hard way. But I’m also lying when I say I want a complete dump of the org structures. I want a dump of the meaningful structures. One person’s meaning is another person’s pointless cruft, however.

Examples of structures I don’t consider meaningful:

  • The pre-blank and post-blank properties that the org data structures use to encode spaces in some circumstances.
  • Leading blanks in code blocks.
  • Leading spaces in paragraphs.

It’s likely that this list will grow as I learn more about the Org data strutures. Unless I give up on this project altogether, of course.

org-to-xml (obsolete)

You can still find org-to-xml.el in this repository’s history (at tag 0.0.5, for example). I’ve removed it from the master branch because I really think it’s a dead end and it caused confusion for at least some users.