Add safe_load

Question

Add safe_load

postmodern opened this issue 12 years ago · 129 comments

In lieu of the recent Rails YAML RCE vulnerability, Psych should provide a safe_load equivalent method, that only loads Ruby primitives.

Answer 1 · 2013-01-13T20:21:24.000Z

Can you please be more specific about the desired behavior? Is it acceptable to load Symbols? Are subclasses of Ruby primitives considered "primitive"? If someone sets an instance variable on a string, then dumps / loads that string, should the instance variable survive?

Can you please demonstrate how to use YAML dump / load to execute arbitrary code? AFAIK it depends on objects other than YAML.

Answer 2 · 2013-01-14T02:19:10.000Z

Symbols, sure they are primitives. In order to prevent a DoS via Symbols, perhaps there should be a configuration mechanism to specify exactly what types/tags should be allowed.
Subclasses of primitives are no longer primitives. In safe mode the custom class could be ignored and the primitive class used instead.
Instance variables in primitives should not survive.

Can you please demonstrate how to use YAML dump / load to execute arbitrary code?

rails_rce.rb

AFAIK it depends on objects other than YAML.

Which is why I believe there should be a safe_load method or a configuration mechanism that restricts YAML to only loading primitives or explicitly allowed Classes.

Answer 3 · 2013-01-14T02:21:10.000Z

@postmodern does the rails_rce.rb script depend on code other than YAML? I'm confused.

Answer 4 · 2013-01-14T02:22:36.000Z

Symbols, sure they are primitives. In order to prevent a DoS via Symbols, perhaps there should be a configuration > mechanism to specify exactly what types/tags should be allowed.

Subclasses of primitives are no longer primitives. In safe mode the custom class could be ignored and the primitive > class used instead.

Instance variables in primitives should not survive.

Given these points, why not just use JSON?

Answer 5 · 2013-01-14T02:29:00.000Z

JSON does not support Symbols, does not support non-String-keyed Hashes and does not have special formatting such as:

summary: |
  bla bla bla bla bla bla
  bla bla bla bla

    indented stuff

  bla bla bla bla

Is there any reason YAML should not have a secure mode?

Answer 6 · 2013-01-14T02:33:50.000Z

@tenderlove rails_rce.rb uses !ruby/hash:MyClass and ActionController::Routing::RouteSet::NamedRouteCollection.

@charliesome's PoC used ActiveSupport::Deprecation::DeprecatedInstanceVariableProxy, ERB, Gem::Requirement and Rack::Session::Cookie::Base64::Marshal.

These exploits would not work if the YAML parser was configured to only allow primitives (aka safe_load).

Answer 7 · 2013-01-14T02:42:09.000Z

+1. Only primitives that YAML can deserialize without using ! directives should be allowed.

No symbols IMO. I can't see a use case where you would want to allow symbols but still restrict other deserialization.

Answer 8 · 2013-01-14T02:47:09.000Z

👍 on what @postmodern and @charliesome are saying. I really wish it supported a way for it to skip over everything and raise what isn't in the specs by default, and certainly like what @charliesome said about skipping over symbols too... the symbols part is kind of important to me because it would make it easier to keep a consistent configuration file to API without having to run to_s on all the keys myself before I create an indifferent Hash.

Answer 9 · 2013-01-14T02:50:51.000Z

Instead of having a big case/when statement of all the tags in to_ruby.rb, Psych could have a table of parser methods/procs. This way Psych could be configured to run in "safe mode". When an unknown tag is encountered, it could skip over it (and it's children) or raise an exception.

Answer 10 · 2013-01-14T02:52:04.000Z

There is no other option other than raise an exception so it 💣 and somebody knows they did bad.

Answer 11 · 2013-01-14T03:17:53.000Z

@postmodern so YAML didn't execute the code, but Rails did?

Is there any reason YAML should not have a secure mode?

I don't particularly care whether it does or not. We just have to be specific about what it means. Also, we need someone to write the patch. :-)

Answer 12 · 2013-01-14T03:26:30.000Z

@tenderlove YAML just happened to call #[]= on an instance of the Class we specified. ActionController::Routing::RouteSet::NamedRouteCollection just so happens to pass the key value from #[]= down to module_eval. We could have used other non-Rails Classes that somehow pass our data into instance_eval or send. CVE-2013-0156 is not really specific to Rails, other applications that accept arbitrary YAML are also vulnerable; such as chef-server which uses extlib.

Please see the Rails PoC write up for a full walk through.

Answer 13 · 2013-01-14T06:29:23.000Z

Ah, so it's not specific to YAML, but anything that will feed strings to eval. Makes sense.

Aaron Patterson
http://tenderlovemaking.com/
I'm on an iPhone so I apologize for top posting.

On Jan 13, 2013, at 7:26 PM, Postmodern notifications@github.com wrote:

@tenderlove YAML just happened to call #[]= on an instance of the Class we specified. ActionController::Routing::RouteSet::NamedRouteCollection just so happens to pass the key value from #[]= down to module_eval. We could have used other non-Rails Classes that somehow pass our data into instance_eval or send. CVE-2013-0156 is not really specific to Rails, other applications that accept arbitrary YAML are also vulnerable; such as chef-server which uses extlib.

Please refer to the Rails PoC write up for a full walk through.

�
Reply to this email directly or view it on GitHub.

Answer 14 · 2013-01-14T09:48:07.000Z

Marshal.load has this problem as well and so would any serialization mechanism that does the following with arbitrary Foo classes and arbitrary field values.

obj = Foo.allocate
obj.instance_variable_set(field, value)

Answer 15 · 2013-01-14T10:32:06.000Z

@benmmurphy it's in Marshal's nature to be like this by simple definition of what it is, YAML however, it's not.

Answer 16 · 2013-01-15T22:32:41.000Z

I think safe load can have a very simple definition: It simply loads the YAML as if no ! tags are present. The only exceptions would be the YAML standard types which are not the implicit types (e.g. omap).

Psych can add a yaml_tag method so a program can see what the tag is. This would allow a program to look at the tag and decide what to do with it if need be, or it can just ignore the tag if there is no need. Having the tag available is important though. In fact, a fundamental design principle of YAML is the ability to "round-trip" --a YAML document that is loaded and then re-emitted should be substantively the same. So we don't want to loose that information even if it is being ignored in a safe load.

Answer 17 · 2013-01-15T22:37:28.000Z

👍 on @trans tag callback idea.

Answer 18 · 2013-01-15T22:41:09.000Z

👍 @trans. The definition is easy to grok.

Answer 19 · 2013-01-21T03:08:00.000Z

I have been toying with my own ideas for a framework to do rpc/serialize objects for a while. I really don't know if this is the correct approach. Attribute assignment safety is not, and, in a language such as ruby, cannot ever be global. Indeed, some of the high-level classes (drb, tk, and the like) are the sorts of things that would make me nervous unless I examined them. (I count 82 instances of []= in /usr/local/lib/ruby/2.0.0.) On the other hand, there are a huge number of user classes for which []= is perfectly safe.

My approach is to whitelist. You can examine the existing ruby base and stdlib classes and determine if there is a problem with any of them. There are several ways to proceed from there, but most likely, the data should be held in a data structure inside Psyche. All classes not registered as safe then are unsafe. A facility would be provided to register a class as safe. (HashWithIndifferentAccess comes to mind) Presumably, Psyche might probe a previously unknown class to see if it lists itself as safe.

The next issue is attribute assignment generally. Again, if a whitelist approach is taken, (and exact matches required), it become fairly simple to proceed. Indeed, both of these facilities might well specify that some inputs are safe but not others.

With this approach, you have the full ability to do attribute assignment where it is known to be safe. And if a maintainer does not want to bother, then their classes are simply not safe.

A global switch that says, "while processing the following string, consider everything as safe" would bring back the existing behaviour--which is exactly what would be needed in some cases.

Answer 20 · 2013-01-21T03:52:25.000Z

@Student Protecting setters is your problem, not Pysch's problem. Psych is not your object-guardian. The goal is to block unserializing attributes. Example: If people don't want symbols then they come out as ":value" rather than :value, as far as []= and "protecting it"... that is your job to protect. The rest of your argument, I'm not even gonna touch, I'll leave that to other folk to debate.

Answer 21 · 2013-01-21T07:51:02.000Z

The question to me seems to be how much pain a user must endure in order to make use of a tool. If YAML is to compete with JSON, then YAML cannot require the user to audit their entire object library for []= weirdness. What's more, I doubt that I'm the only one that did not know that YAML was calling []= until last week--people won't even know about the threat. You can sit back at say nasty things about people who do weird stuff with []= or you can make a library that doesn't regularly get implicated in security problems.

Again, what I'm saying is that safety requires that unsafe actions be screened out by default. To me, that suggests an audit of Ruby's base classes and a facility for classes to register themselves as safe to whatever degree.

Answer 22 · 2013-01-21T08:28:32.000Z

@Student: Why does this have to be so complicated? @trans's approach is the way to go - deserializing the YAML as if the ! directives didn't exist ensures that only 'primitive' Ruby types are created.

Answer 23 · 2013-01-21T10:40:07.000Z

Do we really needs this ? I have a strange feeling that making safe_load will be really hard and in the end nobody will use it because it will be so limited. Lets just not use YAML as something we accept from requests/external sources and just parse. If someone wants to do something like this he should explicitly say he wants to expose himself to problems that it carries with.

In my opinion we should use only/mostly JSON and Ruby. For config files ruby is so expressive that we can have instead of databas.yml like things something that will just load .rb file. aka

MyBaddAssWebAPP::Application.database do
  production do
    host: "localhost" 
     ....
  end
end

Or something similar to this. Do we need YAML at all ?

Answer 24 · 2013-01-21T15:44:38.000Z

@JakubOboza How is it going to be limited? The fact of the matter is it already has an unserializer to move non-primates into their Objects, at that point it's a matter of having that Object tell you what kind of non-primative the Object wants to be and then you deciding if you want that non-primative, or giving you the ability to block non-primatives (like symbols) right off the bat. Since it has to have a decision engine for them to begin with (see lib/psych/visitors/to_ruby.rb) the logic for rejection is not really all that hard to add in. The callbacks might be hard, but only in the sense of how do you do it properly, not how do you implement it.

Answer 25 · 2013-01-21T16:55:54.000Z

Yeah right. So first of all do symbols get ever garbage collected ? No. So basically someone can makes series of reuqests with dozens of symbols and make your VM grow to insane size and die. Performing this DoS attack is not so hard.

So making load_safe or safe_load don't solve a problem but probably creates new ones :)

Answer 26 · 2013-01-21T16:56:28.000Z

@JakubOboza Using Ruby is the exact opposite of the solution to this issue. It would create even more security issues --insurmountable ones.

@Student Is your whitelist solution more complicated than we need? I wonder if whitelisting prior to loading is necessary when we can just safe-load plain YAML and then instantiate any nodes we need proactively. In other words, we can simply apply our own whitelist solution after loading given the plain YAML. That way Psych doesn't need to handle it and all the additional code and api it would entail.

However, I could see a nice general-purpose method for traversing arrays and hashes to do these conversions would be helpful to make that dead simple.

Answer 27 · 2013-01-21T16:59:09.000Z

Why ? if you use ruby in config files and never autoparse yaml in requests ? I gave the example just to show that we don't need yaml.

Everything you need from serialization format is list of objects thingy [ ], hash/map/object { } and string "" and i know working is not perfect but thats all and thats the stuff that you have in JSON.

Answer 28 · 2013-01-21T16:59:19.000Z

@JakubOboza I think you should take a second, recollect your thoughts and come back when you are more rational and not spewing trash. Psych is what makes the non-primitives. So what you are saying is broken and flawed logic because safe_load would prevent the symbols from ever being created and would keep them as primitive strings.

Answer 29 · 2013-01-22T00:21:52.000Z

First, an apology. I forgot that in general YAML is creating instances of objects, then making calls on these instances or setting attributes on them directly. Some of my first comments were coloured by that error. Clearly, it is safe to set attributes on a newly created instance, so there is no concern there.

However, if a class uses a singleton object, this is no longer safe, as global state is being overwritten. (I observe no occurrences of "ingleton" in my ruby 1.9.3p327 source.)

[]=, is another matter entirely, however. []= is no more a setter than a method ending in ? is a boolean probe. In fact, I would argue that it is less so. There is nothing at all wrong with having some sort of generic class that implements []= as sending messages to other classes. It just breaks some assumptions.

Now as I mentioned before, someone who wants to use gem X (which depends on gem Y) should not have to do research to figure out if it is safe to do so with YAML parsing external input. And just because I need to use a gem that needs a gem that implements an obscure class that is not safe, why should I be prevented from using YAML in the normal way for my entire application?

I audited the rails lib classes yesterday, and as of 2.0.0.preview2, they are clean. It is easy enough to create a structure to hold this data. When a unknown classes is encountered, it can be queried to determine if it has a yaml_safe method. If it returns false, then not even attributes will be set. If it returns true, then []= will be called with abandon. If it returns a symbol, that method will be called instead of []=. If it returns something that responds to call.... Otherwise, the class is registered as unknown, and attributes will be set but []= will not be called.

Of course, the global settings can also be used, and the end user can explicitly register whatever sort of bizarre rules make sense for them. Let the user decide!

With such a facility, the community has an easy way to register the external-input safety of their classes, and 99.9% of the people can proceed in ignorant bliss.

I REALLY want to be able to use YAML for parsing external input. JSON is entirely too lowest-common-denominator. XML is stilted at best. Marshall practically requires both ends to be running the same everything. DrB is for internal use only.

Answer 30 · 2013-01-22T00:39:44.000Z

This shit is bananas. Seriously though, this thread turned from useful into a bag full of "come again say what?"

Answer 31 · 2013-01-22T01:08:58.000Z

@Student You are still making this more confusing then it is. More importantly, having developers add a yaml_safe? method to their classes does nothing to ensure they are actually safe.

Getting back to your whitelist notion however, I've been thinking about this and it might have some merit. Not exactly in the way you described (which again I think is overly complex) but just as an option passed into the safe load method. e.g. something like:

    YAML.load_file(file, :safe=>true, :whitelist=>{'!foo'=>Foo})

It's not strictly necessary, as I mention in my previous post. But it would be very convenient.

Answer 32 · 2013-01-24T16:17:55.000Z

I looked at the code a bit. Am I right in concluding the heart of this matter is Visitors::ToRuby#deserialize (here)? If so the main of the solution is to create a #deserialize_safe method that removes the unwanted tag resolutions of #deserialize (some can remain like !binary and !float).

The hardest part looks like it's just getting the safe flag down to the deserialize call so it can be conditioned.

Answer 33 · 2013-01-24T16:26:40.000Z

Float and Binary are primitives so shouldn't they be excluded by default without prejudice?

Edit: My 2 pence worth is that maybe there should be a SafeDeserializers and UnsafeDeserializers, where we hold ones that are considered always safe and anything else gets put into unsafe, if you run deserialize_safe then it will exclude all UnsafeDeserializers but if you pass in say :method then it will deserialize anything that applies.

module Deserializers
  module Safe
    module_function
    def my_unsafe_deserializer
      # => Do Work
    end
  end
end

YAML.load(data).safe_deserialize # => String
YAML.load(data).safe_deserialize(:my_unsafe_deserializer) # => Object

Or maybe so the logic makes more sense:

YAML.load(data).deserialize_all # => Objects
YAML.load(data) # => String
YAML.load(data).deserialize(:my_unsafe_deserializer) # => Object

I absolutely like the idea of having to be forced to choose to deserialize everything. It makes people think about what they are doing, where we have to be honest, most programmers now days don't really think twice about what they are doing... so I always like situations that make me think twice.

Answer 34 · 2013-01-24T16:55:05.000Z

@envygeeks String, Hash and Array are primitives too, but we can't exclude them. But maybe I am misunderstanding?

As a general rule of thumb, I am thinking it is okay to accept any class for which Ruby supports a literal constructor. Not sure about !range, but we do need to support all YAML standard types and that includes Float. As for Binary, I think (correct me if I am wrong) it's really just a String, so it should be okay. I know Symbol has been mentioned above. Is the only security risk with symbols that of a memory overrun? If so, YAML's probably not the right place to be concerned with that. Rather incoming file size should be limited. (Is there a way to set a size limit on the stream parser?)

Answer 35 · 2013-01-24T17:17:37.000Z

@trans I'm saying that Floats should be allowed to be parsed into their Objects by default.

The problem with the stream size limit is that even if you limit it, if this YAML comes over the public or public but private wire... and don't ask me why it would in the case of non-internal communication but at that point I would probably Marshal the object into a memory storage but we won't even get into that. Anyways, the point is that if it's coming from the internet it's only a matter of changing the symbol each time and a new symbol is created so the stream limit is nearly useless and it will always come down to being able to either white list objects that can be unmarshaled/deserialized or blacklist (the bad option IMO).

Ranges can be easy to safely handle:

str = "1..2"
str = Range.new($1.to_i, $3.to_i, $2.length == 3) if str =~ /^(\d+)(\.{2,3})(\d+)$/

While ghetto built it works.

Answer 36 · 2013-01-24T22:58:17.000Z

Symbols are a really ugly case. The only way to prevent DOS is to check to see if a symbols is already defined before deserializing it. I have no idea how that might be done. Easier would be to have a list of symbols that are allowed for the call--but very ugly. Furthermore, in practice, if I were wanting to fling arguments for ActiveRecordFind, with its nested conditions, the list of acceptable values changes as we navigate the tree. And if you thought the list idea was ugly before, now you KNOW it is.

There is an option that I have often wanted for Array.flatten, and it strikes me as useful here as well--the ability to specify a depth limit--only deserialize one level, or two, or five.

Answer 37 · 2013-01-24T23:00:28.000Z

@Student Occam's razor. You either accept all symbols or no symbols.

Answer 38 · 2013-01-28T21:25:32.000Z

To highlight the need for this feature, arbitrary YAML deserialization strikes Rails again.

Answer 39 · 2013-01-28T21:36:53.000Z

turning YAML into JSON? Utopia

Answer 40 · 2013-01-28T21:46:15.000Z

I actually prefer it when peeps turn my JSON into YAML. 😉

Answer 41 · 2013-01-29T05:32:31.000Z

@tenderlove could you possibly look at safe_yaml and maybe talk to @dtao about merging it's behaviour into Psych?

Answer 42 · 2013-01-29T05:48:16.000Z

@tenderlove and @postmodern: FWIW, my aim w/ the safe_yaml gem was/is pretty consistent with what @trans suggested: to provide a method which "loads the YAML as if no ! tags are present." In its current form the gem actually clobbers YAML.load, but I'd be totally fine w/ switching back to having a separate YAML.safe_load method.

Answer 43 · 2013-01-29T05:50:52.000Z

@dtao Having two functions creates a cleaner API with expected results but having load accept a second argument for safe_load as well allows for pure backwards compatibility. /cc @tenderlove @postmodern @trans

Answer 44 · 2013-01-29T10:04:35.000Z

+1 for separate function. OR load('---..', safe: true)

Answer 45 · 2013-01-30T02:52:02.000Z

+1

More vulnerabilities will be reported by various gems the coming days and weeks. Basically anything that uses the YAML gem to load arbitrary user input is at risk. There are several gems out there that do this unfortunately.

Given rails security strategy that it should be secure by default and the multitude of (breaking) changes that were made in the 3.x series to achieve this, I would logically think YAML.load should not be any different and be secure by default (if it followed the same strategy).

But I'm against breaking working apps. Would the following perhaps work? :

YAML.load works default in unsafe mode.
Calling YAML.safe! will make all subsequent YAML.load calls safe by default.
YAML.load can receive an option to override the global safety setting.

The downside is that any gem might call YAML.safe! for no valid reason. They should instead use YAML.load("---..", :safe => true)

Meanwhile I would advise all ruby developers to use the safe_yaml gem from @dtao, run all your tests, and if nothing breaks, keep on using safe_yaml until this issue is resolved.

Answer 46 · 2013-01-30T02:54:12.000Z

@lawrencepit I'm -1 on anything like YAML.safe! that introduces global state. I think YAML.safe_load would be best as it makes grepping easier.

Answer 47 · 2013-01-30T03:33:52.000Z

@lawrencepit Yeah, changing YAML.load's default behaviour would break pretty much everything (esp RubyGems). I'm against introducing complex state, it should be an explicit method-call or option which indicates that a specific YAML blob should be parsed in safe-mode. Explicit > implicit.

Answer 48 · 2013-01-30T03:49:36.000Z

@envygeeks @postmodern and @charliesome: I'm kind of torn on this one. I agree in general with avoiding global state and maintaining backwards compatibility; but I actually like @lawrencepit's suggestion a lot from an ease of use standpoint. App developers can change YAML.load to YAML.safe_load everywhere in their own application; but what if their apps depend on gems that also use YAML.load? The benefit of a permanent toggle like YAML.safe! (or I was thinking: YAML.disable_arbitrary_object_deserialization!) is that it can be called once—e.g., in a Rails initializer—and render the rest of an app secure against this particular exploit by default.

Yes, it may prove to be a breaking change in certain cases; but I'd argue those cases are likely to be more rare than the default case where user input needs to be deserialized safely. In these cases app developers could explicitly call YAML.orig_load or YAML.load("--- ...", :unsafe => true), or whatever, when they need to deserialize trusted input to arbitrary objects.

I'm thinking I probably will provide this option in safe_yaml in the next version. (This is probably strictly better than the gem's default behavior now, which is to clobber YAML.load no matter what. In retrospect I see that's probably not the best approach.)

Of course, we all know that the long-term solution is not a standalone gem but a safe option provided through Psych itself, hence this thread.

Answer 49 · 2013-01-30T03:59:30.000Z

I would make it so that the initiator of the global state could fix the problem on behalf of the software he or she breaks, in that if Gemcutter absolutely needs to deserialize an Object, we can whitelist it in safe_load!. Something like YAML.safe_load!(Object1, Object2) At the same time I can't disagree with the dislike of it, it introduces the need to create extra unit's and it introduces new code complexity, it also introduces the ability for one lib to break another lib by abusing safe_load! by not knowing that it's for the implementer of the lib, not the creator of the lib. That will happen eventually and it's gonna create some huge friction.

Answer 50 · 2013-01-30T04:06:29.000Z

The problem with a global switch, is that other software (RubyGems) may need to deserialize arbitrary YAML from trusted sources after safe_load! is called. Also there is no control on when safe_load! can be called. Developers should first determine whether the YAML input is from a trusted source (ex: not an arbitrary user). I sort of wish we tainted all user-input.

Whitelisting certain Classes is acceptable; perhaps maybe have a global whitelist?

Answer 51 · 2013-01-30T04:15:39.000Z

Pinging @drbrain so he throw in his opinion from the side of RubyGem.

Answer 52 · 2013-01-30T04:46:43.000Z

@envygeeks I hadn't thought about Gemcutter, but that is an excellent case for the need for a white-list (Gem::Version, Gem::Dependency, etc). Given that Gemcutter is a Rails 3.2 app and uses RubyGems to unpack/deserialize the gziped YAML metadata, rubygems.org is probably vulnerable right now.

Answer 53 · 2013-01-30T04:47:55.000Z

FWIW, Gem::Requirement was a vector in my RCE PoC. You should be very careful when defining a whitelist, especially for classes defining yaml_initialize

Answer 54 · 2013-01-30T05:02:48.000Z

@charliesome you're exploit also used a bunch of Rack and ActiveSupport classes.

Answer 55 · 2013-01-30T05:12:23.000Z

Point still stands though. Whitelisting introduces too much complexity for my liking. All I really want is a way to load YAML as if the ! parts didn't exist. Nothing more, nothing less.

Answer 56 · 2013-01-30T06:02:25.000Z

@postmodern @envygeeks I told the Gemcutter developers more than a week ago now. I suppose they've deployed a patch as for github's jekyll. Anyway, would you mind proposing a patch here? It's probably time to stop all of this and you seem to have sufficient knowledge for a first proposal.

Answer 57 · 2013-01-30T07:14:05.000Z

@blambeau is there an issue open or a CVE registered? It looks like they only upgraded to Rails 3.2.11, which fixes the Rails vulnerabilities. However, rubygems.org is still reading the metadata of pushed gems.

Answer 58 · 2013-01-30T07:18:12.000Z

@postmodern Not that I know, but I haven't be much involved. As far as I know, they were working on a patch.

Answer 59 · 2013-01-30T07:19:43.000Z

@blambeau who is they? If it's Github that's already fixed... the patch you saw for Jekyll was an after the fact upstream pull request of something that was already patched in, if it's RubyGems somebody needs to get on their ass about it because RubyGems already has enough problems trying to stay up, this will only make it worse.

Answer 60 · 2013-01-30T07:32:53.000Z

I was thinking about the way Psych handles user defined tags (which it basically inherited from Syck), and the idea of a whitelist. The two don't really mesh. Psych currently has these methods for globally defining tags.

add_builtin_type(tag_type, &block)
add_domain_type(domain, tag_type, &block)
add_private_type(tag_type, &block)
add_ruby_type(tag_type)

There's also add_tag(tag, klass) though I am not exactly sure how that works compared to the others. The above four all work basically the same, they just adjust the tag_type to meet different naming conventions.

I can understand the original intention for creating global interfaces for defining tag types --after all, they are analogous to classes and classes are defined globally. But, while that might had made sense when _why originally put together Syck, it's become clear, via this thread, that the reality is actually specific to the use case and not safely global at all. It's clear because using a safe_load would make all such type definitions impotent, and a whitelist would use them selectively. When you really get down to it, it's not good to have global type definitions at all. These types are generally application specific and crossing them between various dependencies is nothing but a recipe for an eventual pain in the ass.

So it strikes me that the answer to this whole issues lies in re-factoring the way Psych handles types altogether. By default YAML.load would work as it already does handling YAML core tags, the yaml extra tags (like omap) and the Ruby tags. For refined loading, definitions for the later could all be provided as constants. Then we would do, for example,

    myapp_tags = YAML::RUBY_TAGS.merge(
      :foo => Foo,
      '!tag:hospital.com,2003:patent' => Patent.method(:new),
      '!bar' => ->{ |val| Bar[val] },
      /(x|y)/ => { |type, val| Factory(type, val) }
    )

    YAML.load_file('foo.yml', :tags=>myapp_tags)

In this way, every application that uses YAML has full control over how that app handles type instantiation without any chance of global conflicts. And it is easy to use. In the example I tried to provide each possible way a tag could be defined --using Symbol, String or Regexp for the tag, and a Class, Proc/lambda or Method instance for the instantiation.

Also, this actually should not be too difficult to code. For the most part, it just means that the load method would dynamically alter the internal type tables and restore them to the default after each use. To optimize we could have a YAML::Loader instance that accepts the tags hash, and can be reused, e.g.

    yaml_loader = YAML::Loader.new(:tags=>myapp_tags)
    yaml_loader.load_file('foo.yml')

That would also make it easier for some projects to refactor their code.

Answer 61 · 2013-01-30T08:05:22.000Z

@envygeeks "they" are guys who work for the whole ruby community on a daily basis, mostly for free. I know that the issue here is important, but "getting on their ass" is not necessarily appropriate. That being said, I'm very concerned with this issue too which is the reason why I called them in the first place. But don't get to a new drama, please. I suppose this should make the job. @tenderlove @drbrain @qrush @evan @evanphx, could we have the current status on your side? Any help/free time needed? Please call if it's the case.

Answer 62 · 2013-01-30T08:13:17.000Z

@blambeau You are creating the drama not me. You are reading into something negatively and then telling somebody it's wrong when you should have just kept your fingers away from the keyboard. But I would expect nothing more than everybody to confuse it with "oh hey, must be mad" right?

Also, how about instead of pinging them you file a ticket on their projects? Or should I get off my ass to do it?

Answer 63 · 2013-01-30T08:27:59.000Z

@envygeeks all my apologies if I miss the right way to handle it then. Be my guess, I think you'll probably do better (the ruby ecosystem is sooooo strange. any way you try something, it's always the wrong one; strange).

Answer 64 · 2013-01-30T08:32:36.000Z

I don't think you did it wrong, I just think it might be better to file a ticket or email them (probably better to email since this could be critical) some people (especially people with major libs) get far too many pings via Github so sometimes they don't catch the pings. My ping to them earlier was ok if it got ignored, but this one might turn out to be important. Maybe @postmodern or @charliesome knows how to get ahold of them via email.

Answer 65 · 2013-01-30T08:34:19.000Z

@envygeeks That's what I said. I had a (email) contact about this with them a week ago. As far as I know, they were working on a patch.

Answer 66 · 2013-01-30T08:37:31.000Z

@blambeau ah okay, that's why I asked who "they" were and my "get on their ass" was just a hint to ping them again to see what the status is and see if they need help and where, I'm more than willing to try and help though I don't know if they want it.

Answer 67 · 2013-01-30T19:54:13.000Z

I wrote a dirty monkey-patch gem that does this today: https://github.com/rapid7/psych_shield

Answer 68 · 2013-01-30T20:24:49.000Z

Also, the load method should probably be renamed unsafe_load or at least put a clear warning somewhere. Right now it's pretty hidden that psych is not for parsing untrusted input.

Answer 69 · 2013-01-30T23:03:15.000Z

@jordan-thoms that's an API breaking behavior. While I agree with you, I cannot agree with you because it breaks the API and it's expected behavior one should never break their spec unless absolutely necessary.

Answer 70 · 2013-01-30T23:07:01.000Z

I think he was suggesting to output a deprecation warning when using YAML.load, which behaviour would remain exactly the same otherwise. The warning would tell you to use YAML.safe_load or YAML.unsafe_load.

Answer 71 · 2013-01-30T23:07:54.000Z

Can we expedite this issue a little? The recent rubygems.org fiasco shows that this is something that's really necessary.

The rubygems.org team knew about the vulnerability for about a week before the exploit gem was uploaded. If YAML.safe_load existed, it would have been easier for them to fix the issue and prevent the hack from ever happening.

Answer 72 · 2013-01-30T23:09:59.000Z

I would fight acceptance unless the deprecation also mentioned that you could use YAML.load("---", safe: true) and vice verse and only if you do not include the second argument with safe in the hash. Since the safe could be a backwards trigger to YAML.safe and YAML.unsafe

Answer 73 · 2013-01-30T23:16:48.000Z

@charliesome and @envygeeks: I will admit it's an imperfect solution, but for now any site w/ YAML.load-related vulnerabilities can use safe_yaml and protect against deserialization exploits. Where the old behavior is needed they can change YAML.load calls to YAML.orig_load.

I think tonight I'll implement the interface @envygeeks just suggested as part of safe_yaml. Maybe Psych can then go ahead and do the same or something similar.

Answer 74 · 2013-01-30T23:47:37.000Z

@envygeeks Yes, it's an API breaking behavior, and the API needs to be broken.

The expectation that parsing a data serialization format won't eval its contents as code is so much more fundamental than the expectation that methods will continue to exist with the same names.

Answer 75 · 2013-01-31T06:23:37.000Z

@charliesome Totally agree. Developers need an easy fix.

Now I am starting to think that a safe/allow option would be the least intrusive solution. This would allow enabling safe-mode with YAML.load and YAML.load_file. The option would accept either a Boolean or an Array of whitelisted Classes.

I might take a stab at shoehorning a whitelist into Psych, based on the existing whitelist implementations.

Answer 76 · 2013-01-31T07:26:09.000Z

FWIW, I've implemented several of the suggestions here in safe_yaml. I don't think my design decisions are likely to make everybody happy; but at least the library is there for applications that need a quick security patch.

Answer 77 · 2013-01-31T07:42:08.000Z

@charliesome @dtao @postmodern @envygeeks As far as I understand the safe_yaml/safe_load proposal here, it won't be enough for rubygems.org. Gem metadata use object deserialization.

Answer 78 · 2013-01-31T07:46:36.000Z

Yeah they briefly mentioned that earlier when I suggested we use @dtao's gem for the sake of quickness but there was so much going on I didn't ask why because they also said that Evan and Tenderlove were all over it trying to get a patch up ASAP and I was trying to help in other places too.

Answer 79 · 2013-01-31T08:01:20.000Z

It's looks like rubygems.org is going to use this custom whitelist. However, it will take a while for a new version of Psych to be released, so safe_load is a valid option for other apps.

Answer 80 · 2013-01-31T09:29:16.000Z

+1 for whitelist of basic ruby classes

Answer 81 · 2013-01-31T16:19:11.000Z

I'm definitely -1 for safe_load. To me it is clear that Yaml is not mainly supposed to be used as a Marshal library. So it should be safe by default. load and all other related methods should be safe by default. If you want to use an unsafe version (for dump/marshalling purposes) you should be explicit about it like load_unsafe or use dump and marshal since those names are known to be insecure by definition when used with untrusted data.

For those who mentioned using JSON, this format is not nearly as rich as YAML. It doesn't even support date/time types.

Answer 82 · 2013-01-31T16:44:54.000Z

+1 @rosenfeld I can't agree more.

EDIT: -1, therefore

Answer 83 · 2013-01-31T19:08:32.000Z

+1 @rosenfeld and others for a whitelist by default.
A whitelist is the best way to go, since the problem is deserializing a class whose writer never thought about deserialization.

Initial whitelist = basic core types

Optional parameter specify whitelist, either as an ok Class, or as a filter using ===.

As an extreme example, to load basic types, plus OkClass, descendants of OkModule or any class defined in MyNamescope but disallow Floats:

MY_WHITE_LIST = [
  OkClass,
  OkModule,
  ->(klass){ klass.name.start_with? 'MyNamescope::' },
  ->(klass){ raise "No floats buddy" if klass == Float },
]
YAML.load(str, MY_WHITE_LIST)

to load anything (like currently):

YAML.load(str, Kernel)

Precise list of basic core types is as @dtao's, including symbols:
Hash, Array, String, Numerics, Date, Time, Symbol, true/false/nil
I'd include Set too.

Answer 84 · 2013-01-31T19:48:46.000Z

There is a misconception running through this thread: Classes do not have a one-to-one correspondence with tags.

Classes have a default tag, but you could define any tag you like and have it resolve to any class you like. So any white lists have to be by tag, not by class.

But it's even deeper then that b/c: Your tags might not be my tags; my tags may not be your tags. Different apps can use the same tags for different classes (actually it's instances, not classes, but I think Psych could be simplified to limit it to classes).

Answer 85 · 2013-01-31T20:17:36.000Z

Hey. So, I've gotten pretty far in refactoring Psych's deserialization code, solving this issue in full. You can see my current progress here.

But is there someone familiar with the code that I can rap with? There are some sticking points that I need feedback on. (The first being, what's up with encode_with and init_with?)

Answer 86 · 2013-01-31T23:00:51.000Z

This still feels like inversion of control to me. It is not YAML's job to try to guess which classes are safe. It is not some poor dev 10 levels down the include chain's. It is the class's responsibility. Certainly, we want the caller to be able to override default behaviour, but default should be up to the class.

By the way, YAML "safeness" is NOT an inheritable attribute of a class! Ruby has dirty inheritance, so just because []= is safe for C does not mean that it is safe for D < C.

Trans indicates that there is actually a level of indirection, which is fine. YAML is still instantiating instances of classes, and it is the class which knows what happens when methods are called on its instances. Not the tag.

Answer 87 · 2013-01-31T23:02:20.000Z

Are we really back on setters?

Answer 88 · 2013-01-31T23:39:53.000Z

@envygeeks Elucidate, please.

Answer 89 · 2013-01-31T23:44:15.000Z

Not worth it.

Answer 90 · 2013-02-01T00:25:07.000Z

So my work is based on the specification on Schemas. To create a "whitelist" you instantiate and flesh out a new Schema instance.

  my_schema = YAML::Schema.new do |s|
    s.tag '!foo', Foo
    s.tag '!bar', Bar
    s.tag /!baz:(.*?)/ do |tag, md|
       Factory(md[1])   # must return a class
    end
  end

Then you can pass that schema in as the white list.

  YAML.load_file('foo.yml', :schema=>my_schema)

Answer 91 · 2013-02-01T00:41:06.000Z

@trans I like that a lot

Answer 92 · 2013-02-01T02:17:38.000Z

If this goes the way of a whitelist, keep in mind not all classes in the list may be present at runtime, so stringification (as in psych_shield) is preferable.

Answer 93 · 2013-02-01T02:54:18.000Z

@hmoore-r7 In the case of a safe-load method/option, if the Classes are not loaded when YAML.load is called then Psych will raise an error when it tries to resolve the Class. Also, I worry about allowing arbitrary Strings into a whitelist.

Answer 94 · 2013-02-01T04:13:13.000Z

@postmodern I was referring to a solution above that provided the schema/whitelist at load time versus YAML.load time. I may have misread the context, but it is an important distinction when creating a whitelist.

EDIT: It looks like @trans's approach wouldn't eval until YAML.load was called, I misread, thanks

Answer 95 · 2013-02-01T08:58:49.000Z

Not to add more noise to the thread, but I have added safe-loading support with whitelisting. We override resolve_class and added a revive_symbol method, to control all access to Classes/Modules and interment of Symbols. @tenderlove also requested that I add safe_load, safe_load_file and safe_load_stream methods, instead of shoe-horning in a :whitelist option.

Feedback welcome. The more implementations the better I say.

Answer 96 · 2013-02-01T09:17:02.000Z

👎 to @tenderlove's suggestion, I really wish the whitelist option was there too :(

Answer 97 · 2013-02-01T13:20:46.000Z

@postmodern I wouldn't characterize it as "shoe-horn". It really wasn't difficult to add the option. I am concerned about adding additional public interface methods b/c then going forward they have to be supported --but when we get down to doing things as they really should be, they aren't necessary.

What do I mean by this? Your WhitelistedToRuby solution is nice. It's straight-forward, it fits cleanly into Psych's design and, of course, addresses the immediate issue at hand. But... this issue brings to light deeper issues in Psych's design. I have tried to make clear these issues in my previous posts. As it turns out, all of the issues derive from one core issue: Psych does not follow the YAML specification with regards to Schemas. For example, from the spec:

The failsafe schema is guaranteed to work with any YAML document. It is therefore the recommended schema
for generic YAML tools. A YAML processor should therefore support this schema, at least as an option.

It is clear from the specification that what schema is used is meant to be an option. By supporting this, not only is the immediate issue at hand addressed, but all the other issues I mention as well.

I know its a more difficult line of development to implement schema support, and requires more extensive changes to the code, but we can also take it as an opportunity to really improve some of the design. And, it seems to me, it's an opportune time to do it with Ruby 2.0 and all.

Answer 98 · 2013-02-01T13:25:00.000Z

It's gonna take a lot of work to get that kinda change into Ruby 2.0 considering it's release date is supposed to be a couple of weeks away, I mean I understand things could be delayed but isn't it already feature frozen?

Answer 99 · 2013-02-01T13:27:00.000Z

@envygeeks is correct. There's no way this will make its way into Ruby 2.0.0 (unless @tenderlove pulls some strings, but given it's a few days away from code freeze, that's unlikely)

Answer 100 · 2013-02-01T13:50:18.000Z

No it can't go into Ruby 2.0. People will need to gem install psych regardless. I only meant b/c of Ruby 2.0 the overall time frame is a good one to make larger improvements to Psych.