yaml/pyyaml

PyYAML 4.2 Release Plan

ingydotnet opened this issue ยท 48 comments

Synopsis

See the project planning page: https://github.com/yaml/pyyaml/projects/1

  • Make release/4.2 with current master
  • Make PR to revert #74 ( #194 )
  • PyYAML team fixes broken pyyaml-build system
    • Builds wheels with libyaml-2.1 linked in
  • Fix any other 4.2 blocker issues
  • Merge in successor to (#74, #189) IF it reaches approval consensus
  • Release 4.2 to PyPI
  • Continue to work on successor if not part of 4.2
    • Bump version to 5.1 when merged

The PyYAML Release Situation

The most recent PyYAML, 3.12, was released Aug 2016. At that time, Kirill
turned over maintenance of PyYAML and LibYAML to @sigmavirus24 and @ingy .
Since then about 20 PRs have been applied to PyYAML and about 40 to LibYAML.

PyYAML has a release builder: https://github.com/yaml/pyyaml-build
It builds PyYAML wheels against specific versions combinations of
(Python, PyYAML, LibYAML).

This builder no longer works and it's complicated by the fact that the build
process for libyaml has been changed. The PyYAML team is working hard to fix it.

The 4.1 release attempt was rushed out because we knew that PyYAML 3.12 doesn't
work with Python 3.7 which went out this week. We had a fix for that in master,
and so we tried to get it out in time for 3.7.

We thought we had a Jenkins build system that would build the wheels as soon as
the sdist was uploaded. So we pushed the release only to find out minutes later
that this build system wasn't set up to build with libyaml. We were going to
have to use the pyyaml-build system.

After 48 hours of work on the windows/wheels system we decided to pull the plug
on 4.1. We didn't have wheels and we were getting reports of other things that
were wrong. We didn't have a sense that the build system was going to get fixed
soon, and we are all volunteers with limited time.

Soon after the release I learned about PR #74 and was completely surprised to
find that something this big went in without my seeing it. Looking back now I
remember that I had a lot going on in my life at that specific time.

#74 is a non backwards compatible change at the most basic level. It changes
how the dump and load functions behave.

The intent of the change is a good one:

  • Currently PyYAML has the sugar-API: dump, load, safe_dump, safe_load
  • PyYAML has had that API since version 3.1 (April 2006)
  • load() is trivial to exploit on untrusted data
  • Change load and dump to be aliases to safe_load and safe_dump
  • Add alarming (danger_*) new functions for the old load and dump

But this change has contentions:

  • It's non-back-compat and is going to affect a ton of existing code
  • The name danger is misleading when used in completely safe ways
    • In addition, danger_dump is not known to be exploitable in any way
  • PyYAML has known about this and had a safe_ solution in place from the start
    • People are just not feeling comfortable with the defaults

The change is important, worthy of a major release, but is not ready to
be part of PyYAML in its current form. A new PR, building from #74 and #189
should be worked on.

The Current Plan Forward

We need to get PyYAML released soon, if only for the Python 3.7 release. We
can't make any release at all until the build system works again. IOW, we
couldn't even re-release 3.12 right now.

The #74 API change is big and it is more important to get it right than to rush
it out. ie It load() may be a big can of gasoline, but nothing's on fire. ie
#74 doesn't "fix" anything. It just changes a default to something that's
always been safe and available.

There are 60 other changes that I'd like to tackle in the next release, while
at the same time taming a broken release process. My hope is that when we
figure this out, it will be easy to put out PyYAML releases on a regular basis.

We went from 3.12 to 4.x because this was a big release. It's big with or
without #74. I would like to see PyYAML 4.2 get out in the next few days.

If the successor to #74 / #189 is ready and approved by the time we are ready
to upload 4.2, it can go in.

If not, then I think it should be the focus of a 5.1 release. It's a big enough
change to trigger a major release. It should be in the first release of either
4.x or 5.x.

imo #74 is the most important change to keep in the 4.x release -- without it this is still a thing:

>>> yaml.danger_load("!!python/object/apply:os.system\nargs: ['echo I am a shell, wow']")
I am a shell, wow
0

I'd be pretty sad to see load() go back to its dangerous defaults.

Bump of major version number allows compatibility break, but it should be explained for everyone. And removal of release is worse, than just blacklisting release number as buggy (you can see openstack global requirements with a lot blacklisted versions).
P.S.
Today is bench of open non-destructive pull requests with bugfixes, which affect users and I expect, that some of them can be in the next release

Thank you for working on this @ingydotnet. I really appreciate both the work going into sorting this out, and the clear communication above. I'm very pleased to hear that the project is moving forward again ๐Ÿ‘

cdent commented

This is clearly a tricky situation but it seems like this plan can address it:

  • Whatever the eventual outcome on #74 and #189 it's clear that 4.1 was broken in several ways (including bad wheels and inconsistencies like #187). Whether removal was the right thing (#192) is moot at this point because it's already happened, some kind of plan needs to happen for moving forward.
  • There's desire to have a working release for Python 3.7.
  • There isn't agreement on how safe-by-default ought to work and the implementation in #74 broke people in weird ways (by changing what safe means). Getting agreement on how it should be done is important for the long-term health of PyYAML, otherwise, people will be presenting complaints and issues to a group of maintainers who never decided what right was, and it'll get confusing and stressful.
  • If something was unsafe for several years, and it unsafe for a few more weeks, is that the end of the world?

So this plan seems okay. It addresses the need for a py3.7 release, avoids breaking people (other than #192) and gives time to figure out how to do safe-by-default well.

Perhaps a bare Python 3.7 compatibility release could go into something like 3.13 and 4.2 could then include #74 and #189 and recent changes?

The missing compatibility with 3.7 currently breaks usage of PyYAML. This is definitely worse than not having #74 and #189 done properly I think.
Note: Homebrew just rolled out the Python 3.7 update. All packages using PyYAML are broken right now.

that would unblock python3.7 for me:

git checkout 3.12 -b 3.12.post1
sed -i 's/3\.12/3.12.post1/g' -- setup.py
virtualenv venv
venv/bin/pip install cython
venv/bin/python setup.py sdist

This produces a source distribution with working .c file that's installable in python3.7:

$ ls dist/
PyYAML-3.12.post1.tar.gz
$ venv/bin/pip uninstall cython -y >& /dev/null
$ venv/bin/pip install dist/PyYAML-3.12.post1.tar.gz
Processing ./dist/PyYAML-3.12.post1.tar.gz
Building wheels for collected packages: PyYAML
  Running setup.py bdist_wheel for PyYAML ... done
Successfully built PyYAML
Installing collected packages: PyYAML
Successfully installed PyYAML-3.12.post1
$ venv/bin/python --version
Python 3.7.0
$ venv/bin/python -c 'from yaml.cyaml import CLoader'
$

EDIT (so it's clear what the above does):

it applies this diff:

diff --git a/setup.py b/setup.py
index 9dc5e8d..0a1db91 100644
--- a/setup.py
+++ b/setup.py
@@ -1,6 +1,6 @@
 
 NAME = 'PyYAML'
-VERSION = '3.12'
+VERSION = '3.12.post1'
 DESCRIPTION = "YAML parser and emitter for Python"
 LONG_DESCRIPTION = """\

(just a version bump)

and then produces a new source distribution with freshly built .c file (due to new-enough cython being installed alongside the setup.py invocation)

@ddanier @asottile your wish has (almost) come true...

@nitzmahone and I just released https://pypi.org/project/PyYAML/3.13b1/#files which is a rebuild of 3.12 with the latest cython. It includes 9 out of the 10 windows wheels we want to ship.

I expect the final 3.13 to be out in 24-48 hours. @nitzmahone is running tons of ansible tests to give us confidence.

We realized today that putting out a 3.13 PyYAML with a 0.1.7 LibYAML could happen a couple days sooner than a 4.2 with 0.2.2, so we've decided that having PyYAML work on Python 3.7 comes first.

This hasn't slowed down our work on 4.2, which we expect out possibly a week after 3.13. There's even a decent chance it can include a proper safe-load-by-default fix. @ddanier, remember that 4.2 is first about 60+ PRs to PyYAML and LibYAML since Aug 2016 (not just the #74 #189 #198 change).

Thanks a lot!

cc @kalefranz @msarahan (for awareness)

Just a follow up to yesterday's comment about the 3.13 release. We are scheduled to upload 3.13rc1 in about 6 hours from now.

We realized the tomorrow is a major US holiday and probably a terrible day to make a final release, so we are hoping to get 3.13 out ASAP on July 5th.

As promised, PyYAML 3.13rc1 is on PyPI now.

We'll try to get a final 3.13 released asap on July 5th.

congrats on the 3.13 release ๐ŸŽ‰

can confirm it's functioning on python3.7 for me :D

$ tox -e py37
...
___________________________________ summary ____________________________________
  py37: commands succeeded
  congratulations :)

PyYAML-3.13 was released yesterday.

Onwards and upwards to 4.2

(thanks @asottile :)

FYI. Added this link to the main issue text: https://github.com/yaml/pyyaml/projects/1

Just for clarification, is 3.13 considered a stable release? Also along those lines, is https://pyyaml.org/wiki/PyYAML (history notes) just out of date or no longer considered source of truth?

3.13 is considered a stable release.

https://pyyaml.org/wiki/PyYAML was just out of date. Fixed in yaml/pyyaml.org@356b5bf. Thanks @dthkao :)

The pyyaml.org docs will probably be moved to http://readthedocs.org/ but it is canonical for now.

Thanks for the confirmation! I figured as such considering the recent project changes and since pypi ws also at 3.13 but just wanted to make sure.

Guys, since 3.x version is vulnerable when do you plan to release 4.2?

@syabro Note that nothing is "vulnerable" as long as you use yaml.safe_load for untrusted data.

yilmi commented

@The-Compiler I don't mean to be dismissive here but I don't think that's an acceptable argument.

At the moment, we have a default function yaml.load() that is vulnerable and likely plenty of projects that haven't been made aware of this vulnerability that are still using it. Therefore creating vulnerability in these projects depending on how they process yaml content.

Having a secure default should be the target for a module such as PyYAML which is used in a lot of projects.

That said, I can definitely understand the challenges behind that change.

@The-Compiler then why not replace yaml.load with yaml.safe_load in the library itself? ๐Ÿ˜‰

@yilmi I can see where you're coming from, but with the same argument, you could say Python itself is vulnerable because its pickle.load doesn't happen to be called pickle.unsafe_load. IMHO, if it's made clear that yaml.load shouldn't be used for untrusted data, and projects still do so, then that might still be an issue with an unclear API, but the vulnerability is in those projects, not PyYAML itself.

@syabro Because that'd break everything that relies on using yaml.load like it works currently, I presume. IIRC that was also because the release was pulled again, but it's been a while. To clarify, my main point is that as an user of PyYAML who is aware of this issue, there's nothing to be worried about - if you're already using yaml.safe_load, there's nothing an update/fix for this will bring you.

(Note that I'm not involved with PyYAML in any way - I'm just an user of it myself, and never really felt like this was a problem based on my understanding of PyYAML's documentation, which clearly says "Warning: It is not safe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load and so may call any Python function. Check the yaml.safe_load function though.".)

One reason for reverting #74 was that it makes load an alias to safe_load, meaning they both use the same (safe) loader. Now code that does yaml.add_constructor(...) changes the SafeLoader, which could be again a security risk if you are using the SafeLoader in the same program and trust it to be safe. @cdent wrote about this. This has all been discussed.

Another issue is that the unsafe methods are now called danger_*, which we don't all like very much, and which is extremely misleading for dumping. danger_dump is not dangerous. (I suggested python_* because it can load and dump python objects and the YAML tags start with python, but not much feedback was received.)

We just couldn't agree on a nice alternative API so far. Additionally we had to fight a lot with getting a hotfix release out for python 3.7.

The current plan is, IIRC, to release a new version, but with #74 reverted, so that we have all the other bugfixes and features out, and then think again about how to make an API change that makes safe to be the default.

The only other problem is the lack of free time :-/

yilmi commented

@The-Compiler I'd say that taking a similar example (like python pickle) doesn't justify having an unsecure default in PyYAML, even if it's documented.
Two wrongs don't make a right and again having a secure default is IMHO the right thing to do.

I did follow this issue for some time now and already acknowledged the technical difficulty of making such a change in my first comment, just as you did in your reply to @syabro. There's no judgement behind my comment, one could easily tell me to find time to work on it myself and prepare a PR to push a fix. I just don't want us to accept a situation just because that's what have been there for a while or because it is documented.

There was an attempt to make it secure by default as mentioned by @perlpunk and I hope we'll find a way to get through some of the difficulties were identified.

Gby56 commented

As a general rule of thumb to me, documentation isn't a safeguard, and especially when the default method isn't stating anything in its name.

Perhaps adding a default argument like unsafe = False would be the cleanest API change, even non-breaking in some case (pure speculation) ?

  1. PyYAML has been exactly like this since 2006. This is not a crisis situation.
  2. Making safe_load the default loader is a non-starter. YAML is a full serialization language, not a config loader.
  3. Making a safer loader is an option but this is a lot of work, and requires that people get deep into our dev process, not just tell us what they think should happen, and then ask us why it's not done yet.
  4. If you want to engage in this work join #pyyaml on irc.freenode.net and engage with us in conversation and spend a couple months making this happen

@ingydotnet First I want to be clear that fully appreciate the efforth you put on bringing pyyaml back to life as it did lack maintenance for quite a long period of time.

Still, there is only aspect where I have to disagree with your: regarding unsafe behaviour not being implicit. Your 2nd bullet does not make much sense to me.

Using unsafe implicitly is a historical mistake and with current growing number of security issues I think it is mandatory for pyyaml, which is mainly a library to put security first and not to have defaults which are security risks.

I am talking about thousants of tools that may introduce security risks without even knowing.

Changing the default behaviour in 4.2 would be a good step in the right direction, even if it may break few users. If this happens fixing take no more than a minute and the project could decide which path they want to go safe or unsafe, based on their needs.

As a library pyyaml has the responsability of not promoting unsafe code use, this is why I do advocate for making safe_load the default.

For those of us who want/need the existing behavior of yaml.load, what is the backward compatibility path? Assuming the current yaml.load gets renamed to yaml.unsafe_load, shouldn't that rename happen now with yaml.load as a alias to maintain the existing behavior. That would allow a period of time for us to update our code to maintain the desired behavior. Then, when a future release changes the behavior of yaml.load to match yaml.safe_load, our projects won't be broken. And of course, for those who want yaml.safe_load behavior, that has already been available for some time, so nothing breaks for them.

Oh, and thank you for doing the right thing and backing out #74.

@waylan I'll answer your question and also make a statement on what's going to happen next.

yaml.load() will continue to be the full, "unsafe" load it is today. It's long name will be yaml.python_load and yaml.load will alias it. To be clear, yaml.load() semantics will never change. The way it has been for over 12 years is how it will remain. ie That ship has sailed. However we are going to encourage people not to use it, and if they do, to know what they are buying into.

The way we will bring this issue to the fore, is that yaml.load() will issue a warning that it is potentially unsafe, and that warning will point to a URL that always has the latest information on the issue. You can disable this warning in one of these ways:

  1. Use yaml.safe_load() instead
  2. Use yaml.python_load() explicitly
  3. Call yaml.disable_warnings(yaml.WarningUnsafeLoad) to disable all such warnings, even from calls in modules that you don't own

The new release will be called 5.1 and I'll write up a "PyYAML 5.1 Release Plan" issue when the time is right.

There seems to be some suggesting that people use one of the failed 4.2bx releases to get #74 behavior. This is a bad idea. 3.13 is the current supported release. I could delete the 4.2b-s from PyPI but I haven't. I almost certainly will after 5.1 goes out.

I will close #207 after 5.1 goes out and close #243 now.

That ship has sailed.

The promise of software is that it can change. This is why it is called soft ware, it is malleable compared to hardware.

yaml.load() will continue to be the full, "unsafe" load it is today.

This seems a step in the wrong direction. The important part is to avoid unsafe defaults. Warnings don't show raise in Python by default, so only an environment configured with enabled warnings would be protected from unsafe loads by an yaml.WarningUnsafeLoad warning class. Prudent users with all warnings enabled will likely be already using safe_load (or avoiding YAML altogether), so the proposal to use a warning is targeting the wrong audience, and not actually addressing the real issue: bad defaults.

The more typical use of a warning would be to deprecate existing yaml.load behavior - indicate that yaml.load behavior will be changing in backwards-incompatible way in the next backwards-compat breaking release (for PyYAML I guess this would be the major version number change). And, possibly, suggest to the user how to update their code in order to keep the old behavior should they desire.

The fact that #243 is prematurely closed, and there have been no commits to master for over 7 months, suggests the maintainers do not really take this CVE seriously.

@cdent Thanks, edited for accuracy

cdent commented

Warnings don't show in Python by default, so only the users who have explicitly enabled warnings will see your yaml.WarningUnsafeLoad.

Ignoring the rest of your comment for a moment, this quoted statement isn't fully accurate. Some, but not all, warnings are ignored: https://docs.python.org/3/library/warnings.html#warning-categories

You're probably thinking of DeprecationWarning which is indeed ignored by default.

To add a practical case to this non-trivial debate:

At the company I work in, another developer used yaml.load in a piece of code, because another piece of code used json.load in the same way. Had I not reviewed his code, with my knowledge of this CVE, the change would have landed in production, potentially loading harmful YAML data from untrusted sources.

The problem here, is that PyYAML is considered a standard, and is used exactly as other standard serialized data loaders. Many developers will not read the full doc, and expect it to work just like standard lib json.

I understand that breaking change is scary for maintainers, but all renowed projects do it (Django is a good example of that), with deprecation warnings, and transition versions. As a developer, I do expect breaking changes when a library bump its major version, that's why we do unit tests to prevent regressions.

I have to agree with @thunderk here. And I need the old default behavior. As long as there is a transition release, I'm fine with the default being changed.

Normally I wouldn't chime in on a long thread, but I stumbled across this issue while working with a project of ours that depends on Cloud Custodian (https://github.com/cloud-custodian/cloud-custodian) wondering why it was being flagged as having a high severity vulnerability.

Leaving the existing unsafe yaml.load behavior isn't really an option. I work as part of my company's software security team, and all dependency vulnerability scanners work the same way - they look at the dependency (pyyaml) and the version. None, outside of specialized data flow-based static analysis tools, look at usage. In addition, as others here have stated there's a major usability problem where the "default" behavior (load) is unsafe.

I can't speak for other companies, but at mine we don't have time to look at the usage each and every dependency across the enterprise - we literally have tens of thousands. If a package shows up as being vulnerable, we either upgrade to the latest version or we rip it out.

Cloud Custodian is already considering removing pyyaml as a result of this (cloud-custodian/cloud-custodian#3413), and I assume other packages within the Python community may be doing the same.

If the current plan @ingydotnet is to not fix or have a breaking change on the load function, that means the CVEs remain open and the only option for dependent packages and most companies is to rip out all usage of pyyaml. Full stop - there is no other option.

https://tools.cisco.com/security/center/viewAlert.x?alertId=58783

Apologies if I'm being a bit blunt here, no disrespect is intended. Just wanted to give some insight into the reality of how vulnerability management takes place at large enterprises and why the current plan won't work. As others have said, I appreciate all the hard work that has gone into pyyaml.

I would strongly suggest putting out a fixed version soon. Changing the behavior of load() with a new major release (3.x -> 4.x) should be no problem at all - that's what release notes are for. If people are having problems they did not follow the upgrade process carefully enough. I think this still is better than leaving a potential threat open and deeming people to - by mistake - not fix their code.

Hope this get's resolved soon. And thanks again for all your great work!

PyYAML 5.1 is expected to be released in the next 2 weeks. Version 5.1b3 was released today.

Version 5.1 addresses the issues here and in #207.

The release applies many pull requests, and all the ones people have been requesting the most. This includes the addressing the unsafe nature of the load function: #257.

A quick TL;DR for people like me who was confused in the situation and ended up here:

  • 4.2 is not releasing (this key comment is folded by GitHub)
  • 5.1 is coming soon
  • It will be deprecated to callload() without Loader= in 5.1, the default behavior is changing to FullLoader (or full_load()) which is almost as complete for serialization as UnsafeLoader/Loader, but it avoids arbitrary code execution.

@fantix That is a good summary. Thanks.

The longer version is here: https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation
which is redirected to by https://msg.pyyaml.org/load which is included in the deprecation warning issued by load() in PyYAML 5.1.

The release is expected to land in the next 16 hours.

PyYAML 5.1 has been released.

https://pypi.org/project/PyYAML/5.1/

Thanks @ingydotnet. Where can users find the release notes and latest docs? The different default_flow_style is also a breaking change. (https://pyyaml.org/wiki/PyYAMLDocumentation still seems to be 3.13 docs)

@wimglenn Thanks for pointing that out. We'll update the docs soon.

As the docs are not yet updated, there is something I would like to ask for clarification on.

The change to the API is clear enough from https://msg.pyyaml.org/load, but what is not clear is the actual difference in behavior between SafeLoader, FullLoader, and UnsafeLoader. Which set of YAML tags is accepted by each?

For example, assuming SafeLoader has not changed, it presumably accepts "Standard YAML tags" but not "Python-specific tags" or "Complex Python tags". And presumably, UnsafeLoader accepts all three types of tags. But which subset specifically is accepted by FullLoader? I get the impression it accepts "Standard YAML tags" and "Python-specific tags" but not "Complex Python tags". However, without any clear documentation, I can't be certain.

@waylan yes, the docs need to be updated.
FullLoader accepts the same tags as UnsafeLoader, but it will not import modules, and it will not do function calls.

FullLoader accepts the same tags as UnsafeLoader, but it will not import modules, and it will not do function calls.

But don't the "Complex Python tags" import modules and call functions, etc?

!!python/name:module.name
!!python/module:package.module
!!python/object:module.cls
!!python/object/new:module.cls
!!python/object/apply:module.f

Are you saying those don't work on both FullLoader and UnsafeLoader. I would have assumed they worked on UnsafeLoader but not FullLoader?

@waylan The commit message here might be useful for you: 0cedb2a

There are simple command line examples you can play with at the end of that message.

All of the interesting stuff happens in this module: https://github.com/yaml/pyyaml/blob/master/lib/yaml/constructor.py (and https://github.com/yaml/pyyaml/blob/master/lib3/yaml/constructor.py for Python 3).

This list of Full (also inherited by Unsafe) tags is here: https://github.com/yaml/pyyaml/blob/master/lib/yaml/constructor.py#L630

Also see #265

yilmi commented

For the record, how the early decision to not move faster has affected other projects - https://www.bleepingcomputer.com/news/security/googles-tensorflow-drops-yaml-support-due-to-code-execution-flaw/

I'm sure it's not the only example... something to have in mind for future discussions

For the record, how the early decision to not move faster has affected other projects - https://www.bleepingcomputer.com/news/security/googles-tensorflow-drops-yaml-support-due-to-code-execution-flaw/

I think this is a funny example, because the code from TensorFlow deliberately uses unsafe_load, and then complains about its unsafety. But note that I only skimmed the article.