nchase/memetica

document serialization

Closed this issue · 2 comments

"just suck it back in, man"

OK, seriously: there's a decolumnize branch where I just added a module that does the reverse of the columnize module.

in an ideal world, pandoc is installed on the deploy machine, and we can send the output of the deserialization pipeline (read-file | pandoc-convert-to-markdown | de-columnize) back to wherever these files are being stored.

it would maybe better if the steps inside of columnize could just be reversed to accomplish de-columnize. exactly how similar are they? investigate.

basic implementation works.

transformed short (~700 word) document to/from serialized/deserialized formats back-and-forth repeatedly. there's a small amount of loss in conversion. have yet to determine whether or not it's acceptable. here's an example command that compares the initial output vs the desiccated, then reconstituted output:

diff

<( 
  cat src/demo.md
  | node columnize.js
  | pandoc --from=markdown --to=html
)

<(
  cat src/demo.md
  | node columnize.js 
  | pandoc --from=markdown --to=html
  | pandoc --from=html --to=markdown --atx-headers 
  | node decolumnize.js 
  | node columnize.js
  | pandoc --from=markdown --to=html
)

and here's the diff:

6c6
< <p>I want you to you look me in my eyes; I haven't slept a peaceful night in more than a seventeen years. I am incapable of any kind of human connection. I am constantly in danger of drifting into total mental oblivion. These eyes, they looked upon the earth and saw an inconsequential particle in an incomprehensible, infinite universe. You think the jets have a shot this season? I walked on the fucking moon. Thanks for the drink. <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a></p>
---
> <p>I want you to you look me in my eyes; I haven't slept a peaceful night in more than a seventeen years. I am incapable of any kind of human connection. I am constantly in danger of drifting into total mental oblivion. These eyes, they looked upon the earth and saw an inconsequential particle in an incomprehensible, infinite universe. You think the jets have a shot this season? I walked on the fucking moon. Thanks for the drink. <span id="fnref1">[^1^](#fn1)</span></p>
8,11c8,12
< <p><img class="img" src="http://cloud.ahfr.org/22e5b4d0a9872511881f.jpg" /></p>
< <p class="u-center">
< So you're probably asking yourself, &quot;what the fuck does some writer with the right stuff have to do with The God MC?&quot;
< </p>
---
> <div class="figure">
> <img src="http://cloud.ahfr.org/22e5b4d0a9872511881f.jpg" alt="" />
> 
> </div>
> <p>So you're probably asking yourself, &quot;what the fuck does some writer with the right stuff have to do with The God MC?&quot;</p>
17c18,21
< <p><img class="img img--right" src="http://cloud.ahfr.org/f24929a0407db76c9747.jpg" /></p>
---
> <div class="figure">
> <img src="http://cloud.ahfr.org/f24929a0407db76c9747.jpg" alt="" />
> 
> </div>
23c27,30
< <p><img class="img img--small" src="http://cloud.ahfr.org/fa1a767d0c98ea5d33f4.jpg" /></p>
---
> <div class="figure">
> <img src="http://cloud.ahfr.org/fa1a767d0c98ea5d33f4.jpg" alt="" />
> 
> </div>
34d40
< <!-- We're missing the people we need in key places in order to get ahead and truly dominate, but that's OK, because our competitors don't have them either, so no one really gets ahead, and the customers end up with the same middling products over and over. -->
37,38c43,47
< <ol>
< <li id="fn1"><p>Source: <a href="http://gabesaidwereintomovements.blogspot.com/" class="uri">http://gabesaidwereintomovements.blogspot.com/</a>]<a href="#fnref1">↩</a></p></li>
---
> <ol style="list-style-type: decimal">
> <li><div id="fn1">
> 
> </div>
> Source: <a href="http://gabesaidwereintomovements.blogspot.com/" class="uri">http://gabesaidwereintomovements.blogspot.com/</a>]<a href="#fnref1">↩</a></li>

so that's a fairly small diff. overall the document looks very good, significantly better than expected.

big takeaways here:

  • some superscript tags (from footnotes in the source document) are getting lost
  • paragraphs that wrap <img> tags are getting converted to <div class="figure">

it's entirely possible that this can all be handled with flags to pandoc that I haven't read about. that's the next thing to investigate when time permits.

on the first note above: the footnotes from the source document getting lost when this stuff is stored is [potentially] this issue: https://groups.google.com/forum/#!topic/pandoc-discuss/fBLsIk4DRKo – looks like it's tricky to fix. long-term solution might be to serialize the footnotes separately and fold them in upon re-constitution. if this turns out to be something that pandoc hasn't solved, that's probably the approach I'd take.

on the second note above: reconstituting with the -implicit_figures tag prevents the second note above (images getting wrapped by class="figure"). but actually, that's something that might be useful, so noting it here. e.g.

diff

<( 
  cat src/demo.md
  | node columnize.js
  | pandoc --from=markdown --to=html
)

<(
  cat src/demo.md
  | node columnize.js 
  | pandoc --from=markdown --to=html
  | pandoc --from=html --to=markdown-implicit_figures --atx-headers 
  | node decolumnize.js 
  | node columnize.js
  | pandoc --from=markdown-implicit_figures --to=html
)