peak/s5cmd

Path globbing syntax is not documented

PerMildner opened this issue · 2 comments

Looking at README.md and s5cmd --help, I see no details about the glob syntax.

In particular, it seems s5cmd understands the "double star" ** syntax for matching any folder depth, but this is not mentioned in the README.md examples.

Even a single star matches any folder depth. The asterisk is not bound by path separators:

$ s5cmd ls "s3://foo/*.txt"
2024/04/13 09:38:00               3  bar/baz.txt

The reason why ** works is because any occurrence of * is replaced by a .* regular expression (as you can see, it also supports ? to match single characters):

s5cmd/strutil/strutil.go

Lines 63 to 68 in c1c7ee3

// WildCardToRegexp converts a wildcarded expresiion to equivalent regular expression
func WildCardToRegexp(pattern string) string {
patternRegex := regexp.QuoteMeta(pattern)
patternRegex = strings.Replace(patternRegex, "\\?", ".", -1)
return strings.Replace(patternRegex, "\\*", ".*", -1)
}

And s5cmd gives the S3 API an empty delimiter, instead of /, when the URL in question contains a "*" or "?":

s5cmd/storage/url/url.go

Lines 264 to 270 in c1c7ee3

if loc := strings.IndexAny(u.Path, globCharacters); loc < 0 {
u.Delimiter = s3Separator
u.Prefix = u.Path
} else {
u.Prefix = u.Path[:loc]
u.filter = u.Path[loc:]
}

This could be enhanced so that u.Delimiter is set to / for the else branch, as well, unless the URL contains **, but I think that'd be crude and incomplete - you might have URLs with several combinations of * and ** wildcards, so it probably needs some more logic in other places.

Thanks for looking at this.

I think the thing I did not see in the documentation was something that explicitly and clearly says "Even a single star matches any folder depth". Perhaps this is what Usage means by "s5cmd supports multiple-level wildcards for all S3 operations" but it is not clear.

Personally I prefer clear specification-style descriptions in --help and README.md before showing the examples, rather than just relying on the user to guess meaning from examples, but I am sure not everyone would agree.