GoogleCloudPlatform/gsutil

gsutil regex - behaviour change (v5.17)

b-becker-k opened this issue · 6 comments

Hi,
We use gsutil with the -x argument and a regex to upload code to a gcloud storage bucket. This was working fine all the way to 5.16, and broke with the 5.17 release.

The full command, note regex after -x argument:
gsutil -m -q rsync -c -dr -x '(?!src/|tests/|remote_debug|project_root_folder_marker.txt$)|(src/project/node_modules/)|(.*__pycache__.*)|(.*pyc$)|(.*gstmp$)' /home/circleci/project gs://<bucket>/

In the past this would have uploaded these example files

remote_debug/
tests/
project_root_folder_marker.txt
src/project/test.py

but left out these

src/project/node_modules/test.json
src/project/__pycache__/test.py

It currently only uploads remote_debug and project_root_folder_marker.txt

I guess this is likely to be linked to the regex fix applied to 5.17

I'm actually struggling to understand why this ever worked in the first place. The regex you provided here would tend to match a lot of different sub strings (i.e. exclude a lot of sources from rsync), and the negative lookahead should never have really prevented that from happening because it's nullified by the other or clauses.

To make this a bit more simplistic:

(?!a|b|c)|(bc)

This would match the bc in abc even though the negative lookahead would seem to exclude not only the a prefix but the bc as well, because it's part of an or clause and not part of the main group. You would need the negative lookahead to be included in one of the other or groups to have the desired effect here. Maybe something more like:

(src/project/node_modules/)|(?!src/|tests/|remote_debug|project_root_folder_marker.txt$)(.*__pycache__.*)|(.*pyc$)|(.*gstmp$)

The one big difference between the current code and what was there previously is that now the -x regex will match directories first so that there isn't unnecessary iteration into directories that would have all their files excluded later on anyway. I suspect that might have been something your regex was exploiting, but I think what you're showing here was a bug that was inadvertently fixed rather than a bug that was inadvertently introduced.

I guess we will have to rewrite that regex to avoid that behaviour then, thanks for taking the time to look at the issue.

The change introduced in 5.17 is effectively a breaking change for anyone using -x with negative lookahead, as that flow presupposes you're "including" from a list of objects to be excluded, only now that list of objects is shorter because entire directories are being skipped rather than processing being completed object-by-object. I'm moving the new behavior into a new flag (because it is technically faster) and bringing back the old behavior to the -x flag. See #1642

That would be great for our use case until we get around to making it work with the new flag, thanks.

gh2k commented

breaking change

Yeah, fwiw this broke our CI and caused us to ship a build without javascript assets on our CDN. (Lesson learned: Don't use gcr.io/cloud-builders/-related tools in production workflows without pinning the version in couldbuild.yml) :-)

Thanks for the fix.

Very sorry for that! We usually take great pains not to introduce breaking changes like this. We've added additional test coverage to ensure these behaviors are well-documented and don't cause regression for people in the future.