clj-commons/clj-yaml

How to efficiently return full yaml file with updated values?

mleveck opened this issue · 23 comments

If there is a better forum for this question please let me know. Also I'm relatively new to Clojure, so apologies in advance.

Minimal example yaml file (the actual files I'm dealing with are larger):

apiVersion: v1
clusters:
- cluster:
  server: http://myserver
  name: myname

when parsed gives back:

user=> (yaml/parse-string (slurp (io/file "/tmp/example.yaml")) :load-all)
#ordered/map ([:apiVersion "v1"] [:clusters (#ordered/map ([:cluster nil] [:server "http://myserver"] [:name "myname"]))])

I'd like to get a copy of that map with the cluster name changed from my "myname" to "myname2". But...

The type of clusters is a lazy seq

user=> (type (:clusters (yaml/parse-string (slurp (io/file "/tmp/example.yaml")) :load-all)))
clojure.lang.LazySeq

This seems to prevent using assoc-in or update-in, as a lazy seq isn't an an associative data structure. I can obviously navigate into to data structure threading it through a series of keyword and calls like first. But then I end up returning just the inner most map and not the full file.

I feel like what I want to do can't be uncommon, so I must be missing something. Can you give me some guidance?

lread commented

Hi @mleveck!

So, if I understand correctly, you were hoping to do something like:

(-> "example.yaml"
    slurp
    yaml/parse-string
    (assoc-in [:clusters 0 :name] "myname2"))

But this fails because :clusters is lazy.

If you really want to use assoc-in, you could transform any lazy sequence into a vector.

(require '[clojure.walk :as walk])

(defn unlazify [x]
  (walk/postwalk #(if (= clojure.lang.LazySeq (type %))
                    (vec %)
                    %)
                 x))

(-> "example.yaml"
    slurp
    yaml/parse-string
    unlazify
    (assoc-in [:clusters 0 :name] "myname2"))
;; => {:apiVersion "v1",
;;     :clusters [{:cluster nil, :server "http://myserver", :name "myname2"}]}

But this might be naive, not sure. Perhaps others will chime in.

Another good place to ask this kind of question is on Clojurians Slack in the #beginners channel.

Thanks @lread . Very helpful. Yes, that is exactly what I was wanting to do, and pointing out Clojure.walk is very helpful. After I posted this question, I implemented something like it(and likely less efficient) by hand to get past the issue.

I just expected that someone might say "Oh, we don't use assoc-in to do this. We do it another way with this library".

If there isn't a better solution, it seems like having the initial parse optionally decode as a vec could be an ok feature. Would the project be open to a PR that adds a option to decode as a vec rather than a seq (with default behavior remaining as is)?

If the project is open to a PR like that ^ I think there would need to be a discussion of what the behavior would be if a user set it to true and :allow-recursive-keys true, as they would likely be incompatible options.

lread commented

Glad it helped!

Thank you for your kind offer, but I don't think we are interested in such a PR at this time.
(I'll wait for @borkdude to chime in with his thoughts before closing this one).

it seems like having the initial parse optionally decode as a vec could be an ok feature

This has come up so often that I think it's an appropriate addition to this library.

I propose just an option :strict-arrays true which defaults to false (subject to better naming).

Inconsistency with :allow-recursive-keys etc is just the user's responsibility, clj-yaml won't protect the user from infinite loops, etc.

The relevant bit of code would be here?
https://github.com/clj-commons/clj-yaml/blob/e847e3e9ab9b847bb21806ad378273ccf209486b/src/clojure/clj_yaml/core.clj#L237C3-L237C22

Another idea would be to let the user pass a function that gets wrapped around the (map ...) call, but I don't know if this would help checking of recursive stuff.

Most YAML you typically parse for CI etc. is not recursive. Having the simple option to support the 99% use case makes sense IMO.

lread commented

Huh, that's the benefit of having more than one maintainer!
I was not aware that folks were suffering from this that much.
@borkdude have you found folks suffering specifically when trying to use assoc-in on the returned lazy seqs? Or are there other problems these lazy-seqs cause?

As for naming, :strict-arrays feels like some lint/validation thing to me.
How do you feel about :realize-sequences? I think YAML uses sequence in its terminology, which is a nice coincidence. Does "realize" nicely convey what we are doing here?

I'm a bit on the naive side on how recursive YAML can fail.
I'll follow up with a concrete example here after I've understood.

Also I can recap our options, as I see them, if that's helpful.

I've seen the postwalk solution come up often enough (e.g. in the babashka channel, or in an issue here). Some people just expect a vector. You can find an old issue here: #29

:realize-sequences doesn't convey the meaning that you can rely an array to be turned into a vector, maybe it would be nice to have this in the name.

lread commented

Thanks @borkdude, the reverted #18 is very informative.
I now understand this has been a longstanding pain point (or at least annoyance).

naming

On naming: yes, agreed. I think "array" might be a SnakeYAML implementation detail for a YAML sequence. How about :lazy-sequences (default true) or maybe :vector-sequences (default false) or maybe :to-vectors (default false).
Hmmm... that last one is nice and brief. Watcha think?

understanding recursive YAML

I don't yet understand valid use cases for circular YAML, ex:

recursive:
  name: "A node"
  child: &ref_node
    name: "Child node"
    child: *ref_node

I'm going to pester Ingy to learn more and will follow up.

lread commented

Or... :seqs-as-vectors (default false)

Or :seq-as-vector (singular)?

(it'd be good to check names of existing options and go for some consistency, I haven't checked yet)

lread commented

(it'd be good to check names of existing options and go for some consistency, I haven't checked yet)

Yeah seems to fit well. Existing boolean options, for example, do not use question marks.

lread commented

circular structures in clj-yaml

I've had an initial peek at existing behaviour, all without any vectorization.

allow-recursive-keys - what does it do?

I did not realize the limited scenario :allow-recursive-keys covers.
It only covers, like its name actually implies, recursive keys.

If we steal a sample from our tests:

(def recursive-yaml "&A
- *A: *A")

We see parsing fails as expected if we do not specify :allow-recursive-keys:

(def r1 (yaml/parse-string recursive-yaml))
;; => Execution error (YAMLException) at org.yaml.snakeyaml.constructor.SafeConstructor/processDuplicateKeys (SafeConstructor.java:109).
;;    Recursive key for mapping is detected but it is not configured to be allowed.

And it passes parsing as expected with :allow-recursive-keys:

(def r2 (yaml/parse-string recursive-yaml :allow-recursive-keys true))
;; => #'user/r2

But if we try to evaluate the result, we get a stack overflow error:

r2
;; ....{({({({Error printing return value (StackOverflowError) at clojure.core/deref (core.clj:2337).
;;    null

other circular YAML

We can attempt to parse other arbitrary recursive YAML outside the scope of the :allow-recursive-keys scenario.

This YAML fails immediately with a stack overflow:

(def circular-yaml "
recursive:
  name: A node
  child: &ref_node
    name: Child node
    child: *ref_node")

(def c1 (yaml/parse-string circular-yaml))
;; => Execution error (StackOverflowError) at clj-yaml.core/eval9075$fn$iter$fn (core.clj:230).
;;    null

We can grab the first element from this next YAML, but it blows the stack on further attempts:

(def partial-circular-yaml "
- one
- two
- three
- four
- &x { x: *x }
- billy")

(def p1 (yaml/parse-string partial-circular-yaml))

(-> p1 first)
;; => "one"

(-> p1 second)
;; => Execution error (StackOverflowError) at clj-yaml.core/eval9075$fn$iter$fn (core.clj:230).
;;    null

What about SnakeYAML?

Can SnakeYAML load the above examples without issue? I'll go to Java to explore:

package org.example;

import org.yaml.snakeyaml.LoaderOptions;
import org.yaml.snakeyaml.Yaml;

public class Main {

    public static void yamlTest(String yamlIn, LoaderOptions opts) {
        System.out.println("\nyaml:\n" + yamlIn);
        try {
            Yaml yaml = new Yaml(opts);
            Object loaded = yaml.load(yamlIn);
            System.out.println("dump:\n" + yaml.dump(loaded));
        } catch (Throwable e) {
            System.out.println("ex: " + e.getMessage());
        };
    }

    public static void yamlTest(String yamlIn) {
        yamlTest(yamlIn, new LoaderOptions());
    }

    public static void main(String[] args) {
        // this first one should fail without allow recursive keys
        yamlTest("""
                    &A
                    - *A: *A""");

        // the rest should pass
        LoaderOptions options = new LoaderOptions();
        options.setAllowRecursiveKeys(true);
        yamlTest("""
                    &A
                    - *A: *A""",
                options);

        yamlTest("""
                recursive:
                  name: A node
                  child: &ref_node
                    name: Child node
                    child: *ref_node
                """);
        
        yamlTest("""
                - one
                - two
                - three
                - four
                - &x { x: *x }
                - billy
                """);
    }
}

Outputs:

yaml:
&A
- *A: *A
ex: Recursive key for mapping is detected but it is not configured to be allowed.

yaml:
&A
- *A: *A
dump:
&id001
- *id001: *id001


yaml:
recursive:
  name: A node
  child: &ref_node
    name: Child node
    child: *ref_node

dump:
recursive:
  name: A node
  child: &id001
    name: Child node
    child: *id001


yaml:
- one
- two
- three
- four
- &x { x: *x }
- billy

dump:
- one
- two
- three
- four
- &id001
  x: *id001
- billy

So, yes, it seems to.

Use case for circular YAML

My usage of YAML is limited to reading config files.
So I had a hard time imagining a use for circular YAML.
But Ingy helped me remember the world is bigger than config files.
For example, today on Slack someone was asking (not in the context of YAML) about modeling circular constructs like:

  • Web Maps - Site A can link to Site B which also links back to Site A
  • Paths in general - Symlinks can pull nodes from other sections of the forest of paths.

I'm guessing that most users of clj-yaml will not bump into circular YAML, but I do not know.

Initial thoughts

If my experiments above are sound, clj-yaml doesn't handle recursive YAML.
A user might be able to get an initial non-recursive element, but if there are recursive elements, the user will eventually suffer a stack overflow exception.

Since clj-yaml doesn't currently handle recursive YAML well, it seems that adding :seq-as-vector would not worsen the user experience here.

Watcha think?

I think for the circular case it's best to go back to the test that was added to support this.

Watcha think?

Yes, agreed.

lread commented

The allow-recursive-keys option was added from #13.
It seems to me it reflected new options added to SnakeYAML because they were security-related.
In the SnakeYAML source, there is a security-related comment:

  // https://en.wikipedia.org/wiki/Billion_laughs_attack
  private boolean allowRecursiveKeys = false;

Here's the relevant SnakeYAML commit.

lread commented

Notice that the clj-yaml test does not evaluate the result.
Doing so would trigger an exception.

Do you mean "realize"? What happens if you only take the first n elements, does that work? Just out of curiosity, not that I really need this feature myself. Perhaps the option got added to be able to parse recursive YAML, while not using the recursive field and be able to read the rest of the YAML. Who knows.

But anyway, adding the option still seems good to me. Or are you suggesting making a breaking change and swapping defaults?

lread commented

Do you mean "realize"? What happens if you only take the first n elements, does that work?

In the unit test YAML, no:

user=> (require '[clj-yaml.core :as yaml])
nil
user=> (def recursive-yaml "&A
- *A: *A")
#'user/recursive-yaml
user=> (first (yaml/parse-string recursive-yaml :allow-recursive-keys true))
#ordered/map ([(#ordered/map ........([(Error printing return value (StackOverflowError) at java.util.LinkedHashMap/entrySet (LinkedHashMap.java:674).
null

But see partial-circular-yaml example in my explorations above.

Just out of curiosity, not that I really need this feature myself.

Yeah, me neither. Do you know of anybody who has wanted to parse circular YAML with cl-yaml?

Perhaps the option got added to be able to parse recursive YAML, while not using the recursive field and be able to read the rest of the YAML. Who knows.

I guess that allow-recursive-keys was brought over to cl-yaml because it was a security-related feature with the thought that security-related features are important, dunno either.

I know there is a bunch of text in my exploration... but I'll restate that SnakeYAML will still load circular YAML when setAllowRecursiveKeys is false, just not the specific pattern our unit test covers. And clj-yaml doesn't seem to handle any of it.

I don't know why the SnakeYAML setAllowRecursiveKeys implementation only covers a single recursive pattern. Maybe a CVE was raised against this pattern? Dunno.

But anyway, adding the option still seems good to me. Or are you suggesting making a breaking change and swapping defaults?

I did not propose this but thought about it too.
If we were creating clj-yaml today, I think we would return vectors by default.
If we don't return vectors by default, people will still suffer the confusion and ask on Slack or need to read the docs.

But we might accidentally break existing code in ways we did not anticipate if we change the default. So, I was thinking of updating the user guide to always use :seq-to-vec true with a single comment as to why.

But really, still on the fence on this.

So, I was thinking of updating the user guide to always use :seq-to-vec true with a single comment as to why.

I'm leaning to this as well.

ieugen commented

I think I am also in a situation where I would need vectors instead of seqs.
I am loading a yaml file configuration and I am merging in values from CLI and ENV to overwrite some values.

I also don't think I need ordered collections since I don't think I need ordering, but not a big issue IMO.

The ordering is important for round-tripping. Agree on vectors being more useful format. You can use the clojure.walk solution above until this is implemented.