Make AST package available for generic data manipulation

Hi,

I'm the author of yq (github.com/mikefarah/yq) which is a popular command line tool for manipulating data files (yml, json, xml, properties). I'd love to add TOML support (mikefarah/yq#1364) - and this libary seems like it'd be a best fit...if the internal AST package was made public.

That way I could (attempt) to preserve comments, map key order etc.

Is this something you would consider? Happy to help if I can...

I'd love to support this. That kind of use case is exactly why I designed it with an intermediate AST as opposed to building structures straight from the parser.

The main reason interal/ast is not public at the moment is that I haven't felt confident that its interface was stable enough to meet the backward compatibility guarantees as the rest of the go-toml package. Practically speaking it hasn't moved much recently, but still. If you're ok with me exposing it and trying to not break things on that API (but with no guarantees), I'm happy to!

Sounds like a plan!

…

________________________________ From: Thomas Pelletier ***@***.***> Sent: Friday, October 7, 2022 11:20:54 PM To: pelletier/go-toml ***@***.***> Cc: Mike Farah ***@***.***>; Author ***@***.***> Subject: Re: [pelletier/go-toml] Make AST package available for generic data manipulation (Issue #822) I'd love to support this. That kind of use case is exactly why I designed it with an intermediate AST as opposed to building structures straight from the parser. The main reason interal/ast is not public at the moment is that I haven't felt confident that its interface was stable enough to meet the backward compatibility guarantees as the rest of the go-toml package. Practically speaking it hasn't moved much recently, but still. If you're ok with me exposing it and trying to not break things on that API (but with no guarantees), I'm happy to! — Reply to this email directly, view it on GitHub<#822 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAIZHNM2626LZSVQ7YYVPG3WCAISNANCNFSM6AAAAAAQ7IHVQQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

Diving deeper into this - my code is becoming a little unwieldy - it would help if there was a peek function that lets me see what the next expression is without consuming it. Is this possible to add? The issue I'm having is while building a KV map, I need to call NextExpression - eventually that will return something that's not in the KV map - and it's a bit of a wrangle to ensure I then process the 'currentExp' and not skip over it to the NextExp...

Also - can you make the comments available on your Node struct?

Final question (sorry) - how do I write back out as a toml file using this lib?

The issue I'm having is while building a KV map, I need to call NextExpression - eventually that will return something that's not in the KV map - and it's a bit of a wrangle to ensure I then process the 'currentExp' and not skip over it to the NextExp...

The way I've been doing it in the unmarshaler is to "stash" the expression, so that my next call to NextExpression returns the last read exception and toggles a flag to resume parsing with the call after that:

go-toml/unmarshaler.go

Line 568 in 58a592b

d.stashExpr()

We could probably create a some tooling or abstraction over that. Happy to take ideas of what an interface over that would look like.

Can you make the comments available on your Node struct?

It's possible, but at the moment the parser skips over them (because unmarshaler wouldn't need it). I'm unclear how to represent them though.

how do I write back out as a toml file using this lib?

Easiest today is to create a structure that represents the document you want to emit, and use toml.Marshal.

Ideally it'd be something like parser.Peek() and it would return a *Node just like the Expression() function. Not a big deal - I've kinda of worked around the issue - but it would simplify the code if it existed.

Regarding the comments - the way the go-yaml parser handles them is by having 3 string fields on its Node struct: HeadComment, LineComment and FootComment.

So something like

# hi
a: cat # things
# blah

the a (key) node has the headcomment 'hi' and foot comment 'blah', and 'cat' the value node has a line comment 'things'. Would something like that work in the toml parser?

Thinking about being able to roundtrip a file with comments - I'd need some sort of way of re-encoding them 🤔 Would it be possible to expose an interface that encoded the Node interface you have at the moment?

I'm also very interested this as I'm currently improve the encoding/decoding support in dasel.

Good work here @mikefarah , thanks for taking it on. I too like the way go-yaml's nodes are exposed.

Sorry it took me so long to respond -- life gets in the way.

Regarding storing comments: we can probably do something similar as go-yaml, though we would need to clarify a few rules for cases like:

a = 1
# is this comment a's foot or b's head?
b = 2

c = [ # what does this comment belong to?
# how about this one?
]

but we may just wing it as we implement it.

I'm not 100% sure about storing the comments directly in the Node struct, as this potentially would be quite the overhead for the unmarshaller that doesn't care about comments at all. Would need to be benchmarked. I thought about a separate structure provided by the Parser that maps from a Node to its comments, maybe using their unsafe address or a way to enumerate them.

It may be possible to add the support of encoding into the existing Node. On top of my mind, I think a few things need to be considered:

Do whitespaces need to be preserved? If so, how would they be tracked?
Currently, the Parser/Node implementation assumes a Node is fully processed before NextExpression() is called. This is because the Parser reuses the memory allocated for the previous expression. I think as a result there would need a sort of copy step to preserve the nodes. Probably not that big a deal though.
An encoder would need to encode the whole document at onces, instead of just a given node, because it needs to keep track of things like indentation.

Overall I think it's doable, and can be made good enough, and eventually we can probably get a fair amount of encoding code reused between this and the marshaler.

@mikefarah @TomWright I might have time this coming weekend to take a stab at having the parser keep comments and expose them on the AST (no promises 😇 ). If you have any thoughts about the above (especially about the actual usefulness to preserve whitespace), it'd welcome it!

Thanks @pelletier, I've definitely had comments raised with dasel that users would like whitespace to be preserved 👍

Sweet! Whitespace would be awesome. Not sure how it could be stored - perhaps similarly to HeadComment/FootComment, you could potentially have a Leading/Trailing whitespace string.

In terms of writing back to toml - how would I go about doing that using the library? Once I've parsed the nodes from toml and the ast.Node package - they get converted to another data structure before being processed. So I need some sort of mechanism to convert it back to ast.Node (I guess..?) and pass it to some sort of encoder....

Sweet! Whitespace would be awesome. Not sure how it could be stored - perhaps similarly to HeadComment/FootComment, you could potentially have a Leading/Trailing whitespace string.

I'm going to play around with it and submit a PR to get your feedback. My goal is to be able to retain that information (comments and whitespaces) in a way that doesn't impact the performance of the unmarshaler.

In terms of writing back to toml - how would I go about doing that using the library? Once I've parsed the nodes from toml and the ast.Node package - they get converted to another data structure before being processed. So I need some sort of mechanism to convert it back to ast.Node (I guess..?) and pass it to some sort of encoder....

Today there is no mechanism to encode ast.Node directly. The only way I see would be to build an object (maybe out of map[string]interface{}s?) that you would then encode with toml.Marshal / toml.Encoder. Being able to serialize a set of ast.Node is something I'd like to work on after preserving the comments and whitespace (but if you or anyone reading this wants to tackle it in parallel I'd be thrilled to review a PR :)).

I follow a similar pattern of unmarshaling into a custom structure, and then marshaling back into the original structure based on the target file format. If we could provide a MarshalTOML func or something on our types that would be ideal. The only issue with a standard map is that it isn't ordered, and wouldn't provide a way for us to re-marshal while maintaining comments + whitespace.

Thanks @pelletier for tagging #860!
I would like to join @mikefarah and @TomWright on the wish list of those who would like to see a Marshal function that takes into account the parsed AST with the comments when encoding the modified data and writing them back into the TOML file. :-)