ruby/psych

YAML.safe_load fails when a string contains a non-existent date

Opened this issue ยท 22 comments

Fjan commented

YAML.safe_load will raise an exception when you try to load text that happens to contain a sequence of numbers that looks like a date but is not:

s="2016-02-31"
YAML.safe_load(s.to_yaml)
# =>  Psych::DisallowedClass: Tried to load unspecified class: Date

Using YAML.load instead of safe_load works fine and text that contains a correct date works fine too. But this can be used to raise an exception on any application that uses YAML.safe_load on user provided text (accidentally or otherwise)

I guess this is because Psych leans on the Date class to determine what is valid date or not. I'm not really keen on writing my own date validation logic. Do you have a suggestion for how to fix this?

Fjan commented

I took a brief look at the code but I don't understand yet why a valid date will not raise and exception. So it's doing validation somewhere anyway.

A workaround is to simply allow the Date class. Another option would be use modified tokeniser for safe_load so it only recognises allowed classes and turns the rest into strings.

I took a brief look at the code but I don't understand yet why a valid date will not raise and exception. So it's doing validation somewhere anyway.

I don't follow. A valid date will raise an exception with safe_load:

irb(main):003:0> Psych.safe_load '2013-01-01'
Psych::DisallowedClass: Tried to load unspecified class: Date
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/class_loader.rb:97:in `find'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/class_loader.rb:28:in `load'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/class_loader.rb:39:in `block (2 levels) in <class:ClassLoader>'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/scalar_scanner.rb:70:in `tokenize'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/visitors/to_ruby.rb:60:in `deserialize'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/visitors/to_ruby.rb:123:in `visit_Psych_Nodes_Scalar'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/visitors/visitor.rb:16:in `visit'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/visitors/visitor.rb:6:in `accept'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/visitors/to_ruby.rb:32:in `accept'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/visitors/to_ruby.rb:311:in `visit_Psych_Nodes_Document'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/visitors/visitor.rb:16:in `visit'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/visitors/visitor.rb:6:in `accept'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych/visitors/to_ruby.rb:32:in `accept'
    from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/2.3.0/psych.rb:302:in `safe_load'
    from (irb):3
    from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:11:in `<main>'
irb(main):004:0>

A workaround is to simply allow the Date class. Another option would be use modified tokeniser for safe_load so it only recognises allowed classes and turns the rest into strings.

Both of these are major changes. I wouldn't be willing to do that in a bugfix release. I think returning strings would be fine, but I'm definitely not in favor of allowing Dates by default.

Fjan commented

I mean this:

Psych.safe_load('2013-01-01'.to_yaml)  # => "2013-01-01" 
Psych.safe_load('2013-02-31'.to_yaml)  # Psych::DisallowedClass

So some date parsing is apparently already happening somewhere, I just haven't been able to find it in the code, perhaps we can fix it there. I agree that allowing Date as an extra class in a bug fix release would not be appropriate, but this is a good workaround for people who have a problem and want a quick fix:

Psych.safe_load('2013-02-31'.to_yaml,[Date])   # => "2013-02-31"

When you do '2013-01-01'.to_yaml, Psych looks at the string and realizes that it is ambiguous, so the resulting YAML is quoted:

irb(main):002:0> require 'psych'
=> true
irb(main):003:0> '2013-01-01'.to_yaml
=> "--- '2013-01-01'\n"
irb(main):004:0>

Note the single quotes around the date string in the resulting YAML.

Since '2013-02-31' isn't a valid date, it considers this to not be an ambiguous value, so it doesn't add the quotes around it:

irb(main):004:0> '2013-02-31'.to_yaml
=> "--- 2013-02-31\n...\n"
irb(main):005:0>

The single quotes tell the parser "this is absolutely a string, do not check for other values". The second value isn't ambiguous when dumping the YAML, but is ambiguous when parsing. Maybe that helps?

Fjan commented

Yes, I just realised that. So another avenue would be to fix .to_yaml to always quote strings that look like a date, that's probably much easier to do.

Perhaps we should have a .to_safe_yaml that can be used on user supplied input and can be relied upon to only create scalars. Sigh. But it's probably better to move to JSON (which I was hoping to avoid).

Any news on this?

[2] pry(main)> str = "tt: 2012-09-24T13:00:00"
=> "tt: 2012-09-24T13:00:00"
[7] pry(main)> YAML.safe_load(str)
Psych::DisallowedClass: Tried to load unspecified class: Time
from /usr/share/ruby/psych/class_loader.rb:97:in `find'

What is the problem here?

What I'm really after is stop Psych from parsing the date and then dumping back to YAML in a different format. I want to read the YAML, change some values and dump back to YAML. The problem is that once the date is parsed, then dumping back to YAML results in

---
tt: 2012-09-24 16:00:00.000000000 +03:00

This is why I tried whether safe_load will help. It is same problem as described here http://stackoverflow.com/questions/32730275

I just got hit by this after the hashie library updated and they now call safe_load. I would really like to request that this method just returns strings for anything that is deemed unsafe to be parsed.

The difference is that on the end of the chain I can work with sanitized data. If I really want a Date object I can do the conversion myself. Or I can use it as a string, but if reading in a user supplied configuration file leads to a backtrace, than that's it. Game over. Epic fail. Given that a library like hashi is calling this means I have no control over the call to load vs safe_load, and even if I did, I would prefer to be safe and decide how I validate date strings before converting them, but I don't want it to become impossible to read in a perfectly valid yaml file.

My program doesn't need to have values automagically converted, but asking every user of my library for the eternity to never ever put an unquoted date in a yaml file is impossible.

We are running into this issue as well. A user provided String that is attempted to be serialized and put into the database is 0000-00-00:

require 'psych'

original_valid_content = "---\n- >-\n  test\n- >-\n  1976-01-02"
original_invalid_content = "---\n- >-\n  test\n- >-\n  0000-00-00"

# Works
deserialized_content = Psych.safe_load(original_valid_content)
new_serialized_content = Psych.dump(deserialized_content)
# => "---\n- test\n- '1976-01-02'\n"
new_deserialized_content = Psych.safe_load(new_serialized_content)

# Breaks:
deserialized_content = Psych.safe_load(original_invalid_content)
new_serialized_content = Psych.dump(deserialized_content)
# => "---\n- test\n- 0000-00-00\n"
new_deserialized_content = Psych.safe_load(new_serialized_content)
# => Psych::DisallowedClass: Tried to load unspecified class: Date

Looking at tenderlove's comment here maybe it might make most sense to be more aggressive with quoting on dump?

Ensure that a string that is not a valid date like 0000-00-00 gets quoted and treat it as "ambiguous". What would be a drawback of that type of solution?

Ensure that a string that is not a valid date like 0000-00-00 gets quoted and treat it as "ambiguous". What would be a drawback of that type of solution?

Honestly, I can't think of any drawback. I'd merge a commit that does that.

On a different subject, maybe in the future we should add a safe_dump. I want to guarantee that objects can round trip through dump and load, but I'm not sure if we can make the same guarantee about dump -> safe_load. This case seems like we can though.

Is there any way around the issue? A date notation like 20190101 works fine, however I do need 2019-01-01...

If you want to make sure a YAML value is always treated like a string, you can cast it explicitly by prepending !!str .
Example:

pry(main)> YAML.safe_load("!!str 2019-01-01")
=> "2019-01-01"

Can anyone explain what is the problem with @najamelan's suggestion?

I would really like to request that this method just returns strings for anything that is deemed unsafe to be parsed.

Users can put quotes around dates themselves, but in many cases that may prevent interoperability between Ruby/Psych-based apps (in my case gollum) and other applications that parse YAML, when the latter do support the parsing of dates. If Psych just returned a string for anything that looks like a date, the user wouldn't need to worry about how to specify their data.

ping

Users can put quotes around dates themselves, but in many cases that may prevent interoperability between Ruby/Psych-based apps

Fully agree with @dometto.

The YAML spec does support "date" natively: https://yaml.org/spec/1.2.2/

Example 2.22 Timestamps

date: 2002-12-14

Example 2.23 Various Explicit Tags

not-date: !!str 2002-04-28

It is clear that quoting or escaping the date is not the correct solution (according to the spec).

Is there a plan to resolve this issue? @najamelan's suggestion of "just returns strings for anything that is deemed unsafe to be parsed." seems reasonable as it allows for the user to handle any "unsafe" cases.

Thanks!

Just a comment about the 1.2.2 spec and the timestamp example: None of the recommended schemas in YAML 1.2 provide a timestamp tag anymore, and the example 2.22 is a leftover that should probably have been removed.
I created an issue: yaml/yaml-spec#268

Thanks @perlpunk for clarifying the spec -- is the YAML way for handling a date/time now a string?

@ronaldtse yes, in YAML 1.2 it would be loaded as a string, and the app can turn it into a date.
Or the specific YAML loader can add a custom tag resolver the same way it was working in 1.1.
In 1.1 there was no standard set of tags to be implemented, so some implement merge, date, binary, and some didn't.
In 1.2 there is a recommended Core schema that every YAML library should implement. That will hopefully increase interoperability. But it has less tags than the complete 1.1 tag repository.

Was dealing with a similar date parsing issue in js-yaml: nodeca/js-yaml#477, and I'm sorry to disagree with @ronaldtse, but the spec is definitely impossibly ambiguous about those timestamp examples and we can't simply use RFC 3339 (or even a faithfully minimal extension of it) to support those examples.

To quote myself:

Just found this and I've got to say, the spec is way to ambiguous on its definition of timestamps. YAML 1.2 ยง10.1. Failsafe Schema doesn't actually provide any guidance about what to do with Example 2.22 Timestamps even though it's implied that it shouldn't need an explicit tag. As an editor of other specs, I'd call this a spec bug, not just an implementation one. So congrats, you're both right!

I'm inclined to say we should defer to RFC 3339 Date and Time on the Internet: Timestamps ยง5.6 Internet Date/Time Format, which covers the canonical, iso8601 and date examples, but I really don't know what to do about spaced: 2001-12-14 21:59:43.10 -5. If it were me, I'd say allowing space would go from RFC 3339's:

   time-numoffset  = ("+" / "-") time-hour ":" time-minute
   time-offset     = "Z" / time-numoffset

   partial-time    = time-hour ":" time-minute ":" time-second
                     [time-secfrac]
   full-date       = date-fullyear "-" date-month "-" date-mday
   full-time       = partial-time time-offset

   date-time       = full-date "T" full-time

to instead have:

   full-time       = partial-time [" "] time-offset
   date-time       = full-date ("T" / "t" / " ") full-time

But then we don't either allow 2019-03-13 20:18:42 +0000 (no colon) or 2001-12-14 21:59:43.10 -5 (not even remotely close to a time-offset).

I think that a well specified standard would require not just that the date-time match the ABNF, but that it be a valid date. If we deferred to RFC3339, it says:

    date-mday       = 2DIGIT  ; 01-28, 01-29, 01-30, 01-31 based on
                             ; month/year

Which pretty clearly implies 2016-02-31 is not an RFC3339 date.

Personally, I think the horses have left the barn and the spec should suggest a reasonable middle ground. Most YAML parsers have date parsing, it's acting differently everywhere, both inbound and outbound. We create standards to avoid this (well, in theory ;-).

Psych::ClassLoader::ALLOWED_PSYCH_CLASSES = [ Time ]

module Psych
  class ClassLoader
    ALLOWED_PSYCH_CLASSES = [] unless defined? ALLOWED_PSYCH_CLASSES
    class Restricted < ClassLoader
      def initialize classes, symbols
        @classes = classes + Psych::ClassLoader::ALLOWED_PSYCH_CLASSES.map(&:to_s)
        @symbols = symbols
        super()
      end
    end
  end
end 

It works for me http://sundivenetworks.com/archive/2021/tried-to-load-unspecified-class-time-psych-disallowedclass.html