golang/go

proposal: spec: disallow LTR/RTL characters in string literals?

karalabe opened this issue Β· 34 comments

TL;DR https://play.golang.org/p/LPkPTRF7fC

The above code looks quite plain and obvious, except it does something completely different than you'd expect (feel free to run it). The obvious thing that should happen is that it counts the number of bits set in the given string. The non-obvious thing that happens is that the mask is actually 0, not 0x01.

So, what happened there? The abuse in the above code is around the invisible Unicode characters that mark following text to be right-to-left or left-to-right. Since Go permits arbitrary Unicode characters to be present in string literals, it also allows me to have a string of the form "bla bla blaabc". Since we're dealing with valid Unicode sequences here, any modern editor/website will actually interpret those special characters, causing the content in between the two special marks to be reversed to the end of the line (alas still part of the string literal).

In my playground code this is abused by having the following source code:

str, mask := "Hello, World!<rtl>10x<ltr>", 0

Which will be displayed by all modern editors/websites as:

str, mask := "Hello, World!", 0x01

The security aspect of this issue is social engineering attacks. The "display" line of my sample code is obvious beyond doubt, so noone will ever inspect such a thing; however it managed to flip one bit in a mask (imagine doing this for file permission umasks). The issue is that such code could easily get past reviews and into a final product.

The tricky part is how to avoid these issues. My only meaningful suggestion would be to disallow these two special characters in string literals. This does break Go 1.0 compatibility guarantees, however I think it's worth it (I can't really figure out a meaningful use case for it). Using it in a single line string will screw up the display of the source code, so it doesn't make sense to do it, and using it in a multi line string is questionable at best. I think requiring users to explicitly use \x notation for adding these characters it a good compromise to protect source code sanity.

cznic commented

String literals must be able to hold any byte sequence whatsoever. Using a string to hold binary data is not uncommon and this proposal will introduce a forbidden byte sequence.

I understand the problem (nice trick BTW), but I don't think this is a way to solve it.

@cznic,

String literals must be able to hold any byte sequence whatsoever.

Yes, but not in their raw form. Only if they're escaped.

Even today:

      s := "`\"`"

... you need to escape the double quote. We could require that RTL/LTR characters also need to be escaped.

cznic commented

We could require that RTL/LTR characters also need to be escaped.

Yes, but that's unfortunately only a partial solution: https://play.golang.org/p/ZL_QF7xc-e.

cznic commented

It would need to be forbidden from `` as well.

I don't like to say that, but then we are back to square 1 (modulo the backtick). To make things even worse, there's also the Go 1 compatibility promise and I'm quite unsure if such breaking change qualifies under it.

That's the catch 22 :)

Perhaps a nice compromise (alas it makes for ugly parsing rules and introduces corner cases) is to forbid the two characters in single line strings (irrelevant of the type) and only allow it in multiline strings for lines that are fully inside (i.e. not the same line as the opening or closing tick is on). In reality I think it's enough to disallow for the closing tick's line.

These "forbidden" cases would already even now screw up the code so I cannot imagine anyone benevoletly using the in such a way. Yet by limiting to the closing mark only, we can retain the functionality of rtl characters that some may have in backtick code.

All in all personally I'd just forbid everywhere but I'm not accustomed to rtl use cases so I can't really comment on how people use them in completely valid and meaningful ways.

This could be handled by a linter of some kind, but then people need to be aware of that tool and run it.

I know this doesn't match all the criteria for go vet, but adding it there may be best.

If go vet is added to go test per #18084 it would be run by default in a lot of situations making it harder to sneak this into a codebase.

Of course, that doesn't close the hole: it just makes it smaller.

Given how unlikely this is to be done on purpose without malicious intent, breaking Go 1 (which is allowed when it comes to security) and making it illegal across the board probably wouldn't cause anyone any pain.

I think solving this problem in the language is the wrong place.
How about in your code review tool? It would be as simple as highlighting weird/unusual UTF8, homographs, ... Then a human can decide whether it is ok or not.

If you're not doing code reviews, you have bigger problems.

GitHub flat out stated they don't care about these kinds of attack vectors. Given that most Go code is hosted on GitHub, most reviews will be done there too.

Also these two characters are not homoglyphs, they are invisible.

They are currently invisible. I am proposing to make them visible in a code review tool.
I don't feel the need to second-guess GitHub's decision. If you want to protect against this attack, don't use GitHub. There are plenty of code review tools to choose from, one may take up your cause.

Developers use GitHub, whether you like it or not. Telling the entire ecosystem to go use something else isn't really feasible.

But even if one project or another would actually go and use something else, you still have the issues when vendoring in code. Do you have the capacity to go over every line of every package you depend on to see if there's something malicious accidentally added to it? In a perfect world, the answer is yes, but we don't live in such a world.

It is easy to blame the user for using it wrong, but if something is easy to use wrong, is really the user to blame? In this instance it's easy to say "just use another tool", but this requires users to be aware of the issue in detail, and to know exactly what other tool to use, why and even then what to look for.

Compared to requiring everyone to change their entire tooling, a minor spec change can solve it elegantly for all, without needing to make the community aware of one more complexity that can end very badly if taken lightly.

cznic commented

Compared to requiring everyone to change their entire tooling, a minor spec change can solve it elegantly for all, without needing to make the community aware of one more complexity that can end very badly if taken lightly.

While the minor spec change may solve this particular problem, the root cause stays: If you don't properly review [untrusted] code with the proper tools you are to blame yourself for the consequences. There are many other ways how to sneak malicious code into a repository in a way which is obfuscated/hidden/confusing on the same level as what this issue is about. Resolving only one of the cases is not a solution.

If you don't properly review [untrusted] code with the proper tools

That's what I'm trying to accomplish here :) Add support for avoiding this into the proper tool, the language.

There are many other ways how to sneak malicious code into a repository in a way which is obfuscated/hidden/confusing on the same level as what this issue is about.

Please provide examples for such. I've yet to see anything so deceitful as this one.

Resolving only one of the cases is not a solution.

It is exactly infinitely better than resolving 0.

cznic commented

Please provide examples for such. I've yet to see anything so deceitful as this one.

For example, without going into details, any exec.Command(var1, var2, ...), ie. not using string literals. Level 0, admitted, but many people will simply never look at the code at all, less on where the values come from and/or how they are computed at runtime. rot13 would be enough to just move on for many out of sheer laziness.

(I've recently published something with a string encoded blob of gzip of gob of VM code of a C program, source of which is not even included in the repository. Moreover, gob outputs for the same input are not reproducible so is the gzip. Here not even a proper code review code tool can help.)

It is exactly infinitely better than resolving 0.

Well, the difference is 1, yes, but the ratio is actually NaN ;-)

slrz commented

These things really seem more appropriate for code review or diff viewing tools. Git already marks up trailing whitespace in screaming red, it could do the same for LTR/RTL marks.

Restricting Go doesn't fullly solve the problem anyway: there are lots of shell scripts and Makefiles in people's repos that you could use instead.

rsc commented

The open questions for this issue, #20210, and #20115 are:

  • How much should the Go language & implementation be in the business of defending against Unicode-based visual similarity attacks?
  • What tools should do it? (Language spec, vet, something else?)

I don't think we know the answers to any of this.

Without careful thought, it seems like it might be OK for vet to contain these kinds of checks, provided the scope is limited to Unicode similarity problems and the implementation scope can be cleanly restricted. But I don't know.

/cc @robpike @mpvl @nigeltao @golang/proposal-review

Without discounting the dangers, I don't believe the language is the place to solve this, plus it's a very slippery slope to start stepping around the rich list of funny Unicode characters that can be exploited. The specification could become a mess.

Tooling perhaps, but not the language specification.

/cc @SamWhited who also works on golang.org/x/text/secure/precis, amongst other things.

@rsc #20215 "crypto/elliptic: different ecdsa.Verify result between p256 amd64 and generic implementations with a zero hash" seems unrelated. Did you have something else in mind?

If there's a clean, stable, simple specification of what to avoid, I'm open to bradfitz's idea of requiring escaping: "\u200F" for U+200F RIGHT-TO-LEFT MARK.

If there isn't, I agree with robpike's concerns.

I don't know whether there is, though. @mpvl?

@cznic I'm confused as to how prohibiting RTL inside backtick string literals (and regular string literals) brings us "back to square 1". Can you elaborate?

Personally, I agree with Rob that the language itself should not deal with this. I suppose I could be convinced that there was a simple set of rules to apply, or a certain class of character that should never be allowed in a literal, but I suspect there won't be or you'll just run into a number of other places RTLO's or other characters could be put into the code and made to confuse people (maybe you could do something similar with an inline comment block, or with a rune literal, etc.)

This is a matter of display, not syntax, and security concerns related to the display of code are best addressed by your favorite editor or vetting tool (Vim for instance shows me that as "Hello, World!<202e>10x<202d>").

cznic commented

@nigeltao

I'm confused as to how prohibiting RTL inside backtick string literals (and regular string literals) brings us "back to square 1". Can you elaborate?

I think my "I don't like to say that, but then we are back to square 1 (modulo the backtick)." refers ty my earlier "String literals must be able to hold any byte sequence whatsoever. Using a string to hold binary data is not uncommon and this proposal will introduce a forbidden byte sequence.".

TBH, I'm not sure if I have maybe thought about something different at that time as now my post seems confusing to me too, or at least not clear.

Maybe something - perhaps the language, perhaps vet - should simply forbid literal non-printing characters above U+007E.

Following a discussion with some colleagues, some notes in no particular order:

It's not merely RTL overrides. https://play.golang.org/p/ifwGIJbjBA looks confusing without having non-printing characters. That example is:
const s = "S123S" //456
but looks like:
const s = "456// "S123S
where S is U+0633 ARABIC LETTER SEEN.

For "go get" path confusion, BiDi in general is a pain, let alone punycode. There's RFC 3987 Example 5:
Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
Visual representation: "http://ab.cd.HG/FE/ij/kl.html"

https://bugs.chromium.org/p/chromium/issues/detail?id=683314 is a Chromium bug about "аррӏС.com" where that first term is 5 Cyrillic letters, not the 5 latin letters "apple". In Go: https://play.golang.org/p/j-50CRKctO. I believe that, in Chromium, they have a narrowly defined filter for Cyrillic or Greek homographs for Latin glyphs.

For homograph attacks, not just LTR/RTL, there's http://www.unicode.org/reports/tr39/#Confusable_Detection although IIUC it may be too aggressive, as it marks "m" (em) and "rn" (arr en) as confusable.

The Unicode Cf "Other, Format" category is http://www.fileformat.info/info/unicode/category/Cf/list.htm

In summary, it's complicated, and therefore the response is probably in the tools rather than the language.

rsc commented

Thanks @nigeltao. It definitely sounds to me like we should be experimenting in a tool (maybe vet) and not in the spec. Does someone want to come up with a list of what a vet "unicode" check would check?

rsc commented

Updated my comment above - meant 20115 not 20215.

rsc commented

Does someone want to come up with a list of what a vet "unicode" check would check?

rsc commented

I think we agree that the first step is to add a vet check and only after building experience with it think about actual language restrictions (or not). Marking this proposal-hold until there is a proposal (a short list would be fine) of what a vet "unicode" check would check.

/cc @nigeltao @mpvl

In #40717 I propose 3 codepoints to add to that list for a vet check: U+115F, U+1160, U+3164

"β€˜Trojan Source’ Bug Threatens the Security of All Code": https://krebsonsecurity.com/2021/11/trojan-source-bug-threatens-the-security-of-all-code/

Apparently some researchers with the University of Cambridge have given this (type of) bug a cool-sounding name now.