Interface with pandoc
rossant opened this issue · 10 comments
Hi, thanks for taking over this project!
I'm currently writing a pandoc-compatible Python library for converting markup documents. It implements a language-independent AST that is almost the same as pandoc's. For the markdown parser I'm considering using CommonMark-py (I'd rather use CommonMark rather than another Markdown parser). What I need to do is convert the CommonMark-py AST to the pandoc AST. I have a couple of questions in this respect:
- It seems that Spaces are not detected like in pandoc? see e.g. the pandoc AST for
hello world:
echo "hello world" | pandoc -f markdown -t json | python -m json.tool
[
{
"unMeta": {}
},
[
{
"t": "Para",
"c": [
{
"t": "Str",
"c": "hello"
},
{
"t": "Space",
"c": []
},
{
"t": "Str",
"c": "world"
}
]
}
]
]To be compared with CommonMark-py (stripped-down output):
{
"t": "Document",
"children": [
{
"inline_content": [
{
"c": "hello world",
"t": "Str"
}
],
"t": "Paragraph",
"strings": [
"hello world"
]
}
]
}Is it intended for CommonMark not to explicitly parse spaces?
- It seems like the internal tree representation in CommonMark is still changing (which I totally understand as this project seems to be a work-in-progress, so is my library). Is there a plan to make this representation stable and public in the future? If not, what would be the most sensible way for me to take the CommonMark AST and convert it to the pandoc AST? You can have a look at how I've done it so far. It only works on the latest pip version, not master, as it seems like
node.childrenhas disappeared.
I don't know if this helps or makes it more difficult, but as of version 0.6.0 of this library, the AST has gone through a radical change, and now has the same format as the latest commonmark.js.
Each node is now a doubly-linked list: instead of an array of children, there are nxt and prv references on each node.
Also, instead of a strings array, you get the node's string via node.literal.
You can see the new node structure here:
https://github.com/rtfd/CommonMark-py/blob/master/CommonMark/node.py#L59
This internal AST representation is available to the public at the moment, but it's not stable, and I expect it to change in future versions, probably until version 1.0.
It's true there is no Space node type in CommonMark-py.
@jgm might be able to answer your question more thoroughly, because I think he designed the AST for both pandoc and CommonMark.
@rossant @nikolas I echo Cyrille's thanks for taking over this project. Thanks Cyrille for the ongoing work with IPython/Jupyter. I'm working with Project Jupyter at Cal Poly, and we have been looking at ways to make our Sphinx docs more visually appealing with Markdown and .ipynb sources as well as build more cleanly in Sphinx. If I can help with this effort or others related to improving display of notebooks, please let me know. Thanks.
@willingc if you have time to recommonmark to use the new layout would be awesome =) (see readthedocs/recommonmark#24)
@lu-zero I will give it a look. I was actually running into this issue yesterday.
@willingc ipymd already lets you convert between .ipynb <-> .md. The project I'm working on will supersede it and will allow more conversion options. It will be compatible with pandoc, so you'll be able to do .ipynb <-> .md <-> .rst. If I can find a good ReST parser in pure Python, you won't even need to have pandoc installed.
@nikolas thanks, I'll have a look. I think it should work fine. No problem to update my code when the internal CommonMark AST changes.
For Spaces I think I should be able to parse Str elements and create them manually.
@rossant That would be awesome. I played around with ipymd a few weeks ago. It's on my radar :-)
@rossant It might be worth looking at https://github.com/rtfd/recommonmark -- which does the mapping of the Commonmark-Py AST into the docutils (RST) AST. It's using the old version of commonmark, and hasn't been updated to 0.6 yet.
I'm definitely interested in this though, as I'd like to get more support for other markup languages into Sphinx itself. recommonmark is our approach to using Commonmark, but I'd love to support asciidoc and other markup languages through a common AST.
@nikolas In my Haskell bindings (cmark-hs), I convert between the cmark AST and the pandoc AST. You might find this helpful as a guide: https://github.com/jgm/cmark-hs/blob/master/CMark.hsc
The conversion is bidirectional: see toNode and fromNode.
I chose not to represent Spaces explicitly in the AST because stopping at every space slows down parsing significantly. But getting this back is just a matter of splitting a string on spaces and inserting Space elements.