golang/go

proposal: cmd/vet: detect homograph attacks

lukechampine opened this issue · 14 comments

A homograph attack is an attack that exploits the visual similarity of two glyphs. Traditionally, this has been used in phishing attacks to trick a user into visiting a malicious domain that looks identical to the real domain. However, homographs can also be used to sneak malicious source code past review. This is possible in any language that supports Unicode source code.

Here is a simple example of a homograph attack in Go source code:

package main

import (
	"log"
	"os"
)

func main() {
	log.SetFlags(0)
	f, err := os.Create("test")
	if err != nil {
		log.Fatal(err)
	}
	f.Close()
	if _, еrr := f.Write([]byte("data")); err != nil {
		log.Fatal(еrr)
	}
}

(Playground link)

The expected output is write test: file already closed, but the actual program prints nothing. This happens because e and е are homographs, and thus err and еrr refer to different variables. This example is not very threatening, but it should demonstrate that sophisticated attacks could be constructed using this mechanism.
In my analysis, strings are the most likely vector for a homograph attack. For example, a homograph could be inserted in a switch statement over runes or strings, such that a particular case branch would never be taken. A homograph could also be used in an import path, although sites like GitHub seem to do a good job of preventing users from registering names or repos that contain homographs. Finally, as in the example above, homographs could be inserted in variable names where scoping rules make the duplication difficult to detect.

I propose that go vet make a "best effort" to detect homograph attacks. This is a bit nuanced because there are many valid reasons to include Unicode characters in source code; distinguishing malicious homographs from harmless ones is probably impossible in general. However, I believe that a few simple heuristics can catch the vast majority of viable attacks. For example, go vet could flag identifiers that mix ASCII with known homographs.
I have already developed a simple tool for this purpose here, though it is currently too strict to be used in projects that contain harmless homographs. Perhaps an external tool is sufficient, but adding it to go vet increases the chance that a security-conscious project will be protected from such attacks. At any rate, I recommend that open-source projects at Google run any publicly-submitted patches through a homograph detector.

mvdan commented

How do you suggest your tool could be made free from false positives? That should happen before inclusion in vet is considered.

Edit: I've noticed the heuristics comment, but I'd be more convinced by an implementation in your tool.

You could make this slightly less prone to false positives if you collect two sets, one of plain identifiers, the other of homograph-containing identifiers mapped to their apparent strings, and if those two sets overlap, that's suspicious.

Homographs attacks in strings (targeting file names or URLs, for example) are more problematic because strings can be broken up, for example "goo"+"gle" and "g00g"+"1e" (using not-a-real-homograph for clarity) to defeat the simple set-matching. Even reporting all such strings and requiring a //vet:homograph comment to note permitted cases is still vulnerable to trickier string constructions in arrays of bytes, etc. We could go down the rabbit-hole of flow-analysis (if the string can be copied to a filename context or a URL context, care a lot more about its provenance), and perhaps we should.

@mvdan is it a requirement that go vet be free from false positives? The docstring says:

Vet uses heuristics that do not guarantee all reports are genuine problems

There are also quite a few issues on the tracker about vet false positives, e.g. #11843, in which Rob says:

The shadow code is marked experimental. It has too many false positives to be enabled by default, so this is not entirely unexpected, but don't expect a fix soon. The right way to detect shadowing without flow analysis is elusive.

To me this suggests that go vet may contain false-positive code, although particularly sensitive checks should not be enabled by default. Of course, the tool should strive to keep false positives to a minimum. I will have to think about some strategies for this.

As far as I can tell, if go vet reports a false positive, it's always possible to refactor the code such that the logic is identical but go vet is silent. If a homograph check is added, I would want it to meet that standard. For example, @dr2chase suggested comparing plain identifiers to homograph-containing identifiers. If this check returns a false positive, it should be trivial to rename your identifiers such that go vet stops complaining. I would hope that most projects that use go vet allow it to twist their arm a little rather than disable it, even for small sections of code.
For projects that can afford to do so, my recommendation is to ban homographs from appearing anywhere in their source code (except for comments). This is what my glyphcheck currently enforces.

mvdan commented

is it a requirement that go vet be free from false positives?

I believe there has been a push recently to make vet (with no flags) free from false positives, and to get it running on the Go codebase (standard library and toolchain).

I assume you want this check to be in the group that would be run by default, and not in the group that have to be enabled explicitly due to false positives.

mvdan commented

For context re vet with no false positives: #18084

Rob's comment that you referenced predates this proposal and effort.

I think a good place to start would be the overlap check that @dr2chase suggested, and a check for mixing ASCII and homographs in string literals (including import paths). This should catch the majority of viable attacks, and both checks can be circumvented by refactoring if needed. I will update glyphcheck with these changes and run it on the standard library to see how many false positives are flagged.

mvdan commented

That sounds like a good idea. To find false positives, on top of the Go source code itself, you could use @rsc's corpus of well-known Go code: https://github.com/rsc/corpus

If I may ask a naive question, what is the threat model here, and how will vet address that threat?

If the concern is import paths, it might be better for go get to reject such import paths.

If the concern is variable names, where will that code come from such that vet is a relevant factor in helping to catch it?

If the concern is strings containing urls and user-provided data, maybe it would be better to have good library support for detecting these, say in net/url or a golang.org/x/text package.

If the concern is strings containing urls and user-provided data, maybe it would be better to have good library support for detecting these, say in net/url or a golang.org/x/text package.

https://godoc.org/golang.org/x/text/secure/precis#Profile.Compare (or https://godoc.org/golang.org/x/net/idna for URL's)

Using the new detection rules, the standard library contains only one string literal that gets flagged: "μs" in the time package.
The corpus contains 39 files with false positives, mostly in Unicode-heavy packages such as golang.org/x/text/language/display. Example: "afarabkhazianuavestanínafrikaansakanamháricuaragonésárabeasamésaváricuay"

There are also some packages where homographs are mixed with explicit escape sequences. Example from golang.org/x/text/currency: "\x00\x02Kz\x01$\x02A$\x02KM\x03৳\x02Bs\x02R$\x01P\x03р.\x03CA$\x04CN¥"

Most of the other false-positives are foreign-language names that contain accented characters. Example from github.com/spf13/hugo: Preserve taxonomy names as written ("Gérard Depardieu" vs "gerard-depardieu").

I did not detect any variable names that contain a mix of ASCII and homographs.

These results show that most packages containing false-positives contain a lot of false positives, so the maintainers of those packages would probably disable homograph checking. This shouldn't incur much of a security risk as long as these packages contain mostly text and little logic.
However, there are definitely some cases where false-positives would be annoying, such as hugo. A better strategy is needed for these packages.

@josharian:

If the concern is import paths, it might be better for go get to reject such import paths.

This seems reasonable to me. Since import paths are typically domain names, we could employ the same strategies used to thwart the IDN homograph attack. Non-URL import paths probably aren't a threat.

If the concern is variable names, where will that code come from such that vet is a relevant factor in helping to catch it?

The attack scenario is code submitted via pull request to an open-source project, either by an anonymous user or a compromised team member. Running vet on all pull requests would detect the attack.

If the concern is strings containing urls and user-provided data, maybe it would be better to have good library support for detecting these, say in net/url or a golang.org/x/text package.

I definitely support runtime detection of malicious strings, but it's a separate problem. In fact, it's probably a more important problem to address, since users are much more numerous than developers.

See also #20209. TL;DR, I think we should solve this in code review tools, not the language.

rsc commented

Blocked on #20209.

Unicode® Technical Standard #39 UNICODE SECURITY MECHANISMS

FYI.

Oops, sorry, I didn't see the status is Hold. Forwarding to 20209...