#[no_mangle] is unsafe in the presence of name collisions

Question

#[no_mangle] is unsafe in the presence of name collisions

geofft opened this issue 10 years ago · 32 comments

On some platforms (at least GNU/Linux, but I hear Windows and several others too), if you link together two static libraries that both export a symbol of the same name, it's undefined which symbol actually gets linked. In practice on my machine, the first library seems to win. This lets you defeat type/memory safety without the unsafe keyword, by having two crates export a #[no_mangle] pub fn with different signatures but compatible calling conventions:

// one.rs
#![crate_type = "lib"]

#[no_mangle]
pub fn convert(x: &'static i32) -> Result<i32, f32> {
    Ok(*x)
}

// two.rs
#![crate_type = "lib"]

#[no_mangle]
pub fn convert(x: &'static f32) -> Result<i32, f32> {
    Err(*x)
}

// transmute.rs
extern crate one;
extern crate two;             

fn main() {
    static X: f32 = 3.14;
    let y: i32 = two::convert(&X).unwrap();
    println!("{}", y);
}

geofft@titan:/tmp/transmute$ rustc one.rs
geofft@titan:/tmp/transmute$ rustc two.rs
geofft@titan:/tmp/transmute$ rustc transmute.rs -L .
geofft@titan:/tmp/transmute$ ./transmute 
1078523331

Despite the stated call to two::convert, it's actually one::convert that gets called, which interprets the argument as a &'static i32. (It may be clearer to understand with this simpler example, which doesn't break type safety.)

On at least GNU/Linux but not other platforms like Windows or Darwin, dynamically-linked symbols have the same ambiguity.

I don't know what the right response is here. The following options all seem pretty reasonable:

Acknowledge it and ignore it. Maybe document it as a possible source of unsafety, despite not using the unsafe keyword.
Have #[no_mangle] export both a mangled and un-mangled name, and have Rust crates call each other via mangled names only, on the grounds that #[no_mangle] is for external interfaces, not for crates linking to crates. ("External interfaces" includes other Rust code using FFI, but FFI is unsafe.) This is analogous to how extern "C" fns export both a Rust-ABI symbol as well as a C-ABI shim, and a direct, safe Rust call to those function happens through the Rust ABI, not through the C ABI. I'm pretty sure that all production uses of #[no_mangle] are extern "C", anyway (see e.g. #10025).
Deprecate #[no_mangle] on safe functions and data, and introduce a new #[unsafe_no_mangle], so it substring-matches unsafe. (#[no_mangle] on unsafe functions or mutable statics is fine, since you need the unsafe keyword to get at them.)

All of these are, I think, backwards-compatible.

Answer 1 · 2015-09-03T00:22:38.000Z

I think this even leads to a linker error on some platforms (depending on the situation), so in that sense I'd also be fine just having a set of all #[no_mangle] symbols for each crate and if they overlap we generate a hard compile time error. While strictly not backwards compatible I believe it's practically fine and it's basically what most crates what anyway

Answer 2 · 2015-09-03T00:42:15.000Z

You can #[no_mangle] override an existing mangled function too:

fn main() {
  println!("ok")
}

#[no_mangle]
#[allow(non_snake_case)]
pub fn _ZN2io5stdio6_print20h94cd0587c9a534faX3gE() {
    unreachable!()
}

This makes the playpen timeout on stable 1.2: http://is.gd/sxdUl3

Answer 3 · 2015-09-03T02:24:17.000Z

While strictly not backwards compatible I believe it's practically fine

Yeah, I'd argue that having multiply-defined #[no_mangle] symbols is, in essence, undefined behavior (in the sense that nobody defined it), so making it a hard error seems fine.

Answer 4 · 2015-09-03T10:38:08.000Z

I doubt there’s anything sensible (other than writing our own linker) that can be done here to truly fix the issue.

I’m not against the idea of making less of these cases possible, though. e.g.

#[no_mangle]
fn main(){}

doesn’t work already. But most of the proposed solutions (exception being keeping more symbols in memory) sound like a non-option to me.

Answer 5 · 2015-09-08T20:04:17.000Z

@nagisa

#[no_mangle] will always be always unsafe as you can use it to hijack C library functions, .e.g make malloc do something evil.

Answer 6 · 2015-09-08T20:11:10.000Z

@arielb1 i didn’t say anything about safety; only about our capability to do anything about this issue.

Either way, if you have to be wary of code you control unknowingly hijacking other code or doing evil things, something went very wrong where Rust (or any language, really) can’t help.

Answer 7 · 2015-09-15T15:01:38.000Z

@nagisa

JS is designed for that use-case. Rust is theoretically supposed to fill that role too, but both rustc and the type-system are way not sufficiently trustworthy for that.

Answer 8 · 2016-10-19T23:24:35.000Z

I agree that unmangled symbol collisions should be a compilation error. Breakage here is extraordinarily unlikely, and the only code that it could break is code that won't be working as intended anyway, which is exactly the sort of the code that we ought to break.

Answer 9 · 2016-10-20T20:49:07.000Z

Reassigning to lang team. What a pain.

Answer 10 · 2016-11-03T21:28:11.000Z

triage: P-low

Seems like a real problem, if not that concerning. @rust-lang/lang feels that the fix proposed by @alexcrichton in this comment is the correct one. Check at link time for duplicate #[no_mangle] symbols in rust code; don't try to solve similar problems that might occur with C code and other stuff out of our purview.

Answer 11 · 2018-01-27T02:40:20.000Z

Why is calling #[no_mangle] functions safe in the first place?

Also #[no_mangle] feels like a first step towards #[naked], which should also require unsafe to be called.

Answer 12 · 2018-01-27T04:35:27.000Z

Why is calling #[no_mangle] functions safe in the first place?

They can't do anything that normal functions can't do. And there isn't in theory any way to use them to violate safety (this issue is a bug; it would be totally fine in theory for rustc to fail to compile code with any symbol collisions in any objects, whether they come from Rust or from another language).

#[naked], which should also require unsafe to be called.

As far as I can tell, this is basically because naked functions have no guaranteed ret opcode, so #[naked] fn foo() -> {println!("hi!")} will just let the instruction pointer run past the end of a function, which is rather memory-unsafe. If they did (and if there were a more coherent story about accessing arguments and returning values from pure Rust code), I'm having trouble seeing what makes them unsafe.

That is, the reason #[naked] functions should require unsafe is that the compiler expects a particular thing to be done by the implementation in order for them to be safely callable, and it has no way to check that. (I sort of think that, like unsafe traits, it should be possible for the implementor to say "Trust me on this" and others to be able to call the function from safe code, but maybe this is a discussion for the naked function RFC, not here.) There's nothing special that #[no_mangle] functions need to take care to do; they're just the same as normal functions except for their linkage name.

Answer 13 · 2018-01-27T05:19:17.000Z

They can't do anything that normal functions can't do.

It's not the function itself that is unsafe but the call. The call to something unmangled is FFI-ish, therefore looks (and is) unsafe.

#[naked] functions are unsafe per se, so calls to them should be unsafe not directly, but as a consequence.

Answer 14 · 2018-01-29T17:25:40.000Z

There are some other examples of abusing #[no_mangle] that may be worth mentioning.

For example, this hack overrides main with custom assembly:

#![no_main]

#[link_section=".text"]
#[no_mangle]
pub static main: [u32; 9] = [
    3237986353,
    3355442993,
    120950088,
    822083584,
    252621522,
    1699267333,
    745499756,
    1919899424,
    169960556,
];

Sigh.

Answer 15 · 2018-01-29T18:02:58.000Z

#![forbid(unsafe)] code should be assumed free of such hacks...

Answer 16 · 2018-12-03T19:51:41.000Z

#![forbid(unsafe)] code should be assumed free of such hacks...

Yeah, my thinking exactly. Maybe that should forbid no_mangle as well, or so?

Answer 17 · 2019-06-25T15:56:29.000Z

The playground example in this comment no longer times out on Rust 1.35 stable and prints "ok" without any apparent issues.

Answer 18 · 2019-06-25T16:00:46.000Z

That just means that the mangled name of the print function changed.

Answer 19 · 2019-06-25T16:03:48.000Z

@RalfJung Thanks, that makes sense. Just wanted to point it out for future readers, as I had stumbled across it and was surprised to see it work despite the issue not being resolved.

Answer 20 · 2019-06-25T16:21:11.000Z

I think the current version (for Rust 1.35) would be

fn main() {
  println!("ok")
}

#[no_mangle]
#[allow(non_snake_case)]
pub fn _ZN3std2io5stdio6_print17ha4c0b9f4da5c9e13E() {
    unreachable!()
}

but that doesn't run either, it just makes the linker fail:

/rustc/3c235d5600393dfe6c36eeed34042efad8d4f26e//src/libstd/io/stdio.rs:752: multiple definition of `std::io::stdio::_print'

Answer 21 · 2020-08-27T08:48:37.000Z

In this comment, @dtolnay brings up the idea of "unsafety at the item level". I feel like that is also relevant for this issue here, which basically concerns an unsafe function-level attribute.

Answer 22 · 2022-02-03T14:27:27.000Z

#[export_name] has the same theoretical problem as #[no_mangle] here, right? https://doc.rust-lang.org/reference/abi.html#the-export_name-attribute

Answer 23 · 2022-02-03T14:32:07.000Z

@bstrie there's a tracking issue for all attributes with unsafe behaviors here, which also tracks adding lints for them: #82499

Answer 24 · 2023-04-26T01:06:42.000Z

This issue will be resolved by RFC 3325 if it is accepted and stabilized.

Answer 25 · 2024-01-11T02:36:56.000Z

I think this is just a compiler bug. Am I missing something?

On ELF for example, if we linked #[no_mangle] symbols as STB_GLOBAL, we would get a linker error with multiple definitions. I can only infer that we must be linking them as STB_WEAK, but don't see any reason why we should be. That's the binding we would use if we wanted to combine multiple definitions of the same thing, e.g. multiple monomorphizations of a crate's generics.

Speculating a bit more, there's a chance that weak binding was a convenient way to allow multiple codegen units to instantiate the same function, but I think we could prevent that pretty easily for #[no_mangle] and #[export_name] functions.

I would be surprised if the equivalent linker behavior did not exist in the executable formats of every major platform we support.

Answer 26 · 2024-01-11T03:53:31.000Z

A linker will trawl through static libraries looking for missing symbols. Once it finds a relevant symbol it essentially extracts that object file from the static library then stops looking. Similarly, if a symbol is already resolved it won't go looking in static libraries for others of the same name.

In this scenario we can always use no_mangle to override symbols in static libraries and this behaviour may even be desirable.

Answer 27 · 2024-01-11T04:48:49.000Z

I believe most linkers have options that can force them to link an object file. But yes, there are still opportunities for interactions when there's a global symbol namespace. The important thing for Rust is that the behavior is predictable and consistent.

Answer 28 · 2024-01-11T04:53:33.000Z

Sure but we can't make it a safe attribute? Even if we can guard against specific issues.

Answer 29 · 2024-01-11T05:08:38.000Z

Overriding malloc was mentioned earlier: #28179 (comment)

Here's a trivial example of that:

fn main() {
    println!("Hello!");
}

#[no_mangle]
pub fn malloc() -> usize { 1 }

i.e. safe code that crashes with a segfault.

Answer 30 · 2024-01-11T05:41:19.000Z

I agree that the attribute still needs to be unsafe because of its potential interactions with native code, the ability to override the same symbol in shared libraries, and other potential weird linking behavior the compiler can't check.

That said, I think we could remove a lot of the fangs with thoughtful consideration of better linking defaults. That would need to come with ways of overriding them via link attributes.

Answer 31 · 2024-08-28T09:24:10.000Z

This is fixed by #123757. :)

@tmandry I don't know enough about ELF linkers to understand what could be done here by changing linker flags. Could you open a new issue describing that? I think this one here is already filled with too much discussion going in a different direction.

Answer 32 · 2024-08-28T18:06:54.000Z

Super exciting to see this issue from 2015 / Rust 1.2 resolved - thanks to everyone who made it happen! I agree that the unsafety check is the best that can be done here.

Regarding the comments about static linking - it appears to me that there's a distinction between linking an object file and linking an archive file containing that exact same object file. The linker complains about collisions in STB_GLOBAL symbols in two .o files but not in two .a files (or .rlib files, which are also ar archives).

$ cat main.c
#include <stdio.h>

extern int collision(void);

int main(void) {
	printf("%d\n", collision());
}
$ cat collide1.c
int collision(void) {
	return 1;
}
$ cat collide2.c
int collision(void) {
	return 2;
}
$ cc -c collide1.c
$ cc -c collide2.c
$ cc -o main main.c collide1.o collide2.o
/usr/bin/ld: collide2.o: in function `collision':
collide2.c:(.text+0x0): multiple definition of `collision'; collide1.o:collide1.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status
$ ar rcs libcollide1.a collide1.o
$ ar rcs libcollide2.a collide2.o
$ cc -o main main.c -lcollide1 -lcollide2 -L.
$ ./main
1
$ cc -o main main.c -lcollide2 -lcollide1 -L.
$ ./main
2
$ readelf -s collide1.o | grep collision
     9: 0000000000000000    15 FUNC    GLOBAL DEFAULT    1 collision
$ readelf -s libcollide1.a | grep collision
     9: 0000000000000000    15 FUNC    GLOBAL DEFAULT    1 collision

(GCC 9.4 and ld 2.34 as shipped in Ubuntu 20.04)

But you do get the error if you use --whole-archive:

$ cc -o main main.c -Wl,--whole-archive -lcollide2 -lcollide1 -Wl,--no-whole-archive -L.
/usr/bin/ld: ./libcollide1.a(collide1.o): in function `collision':
collide1.c:(.text+0x0): multiple definition of `collision'; ./libcollide2.a(collide2.o):collide2.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status

GNU ld's documentation of that option says that it makes the linker "include every object file in the archive in the link, rather than searching the archive for the required object files." I'm interpreting that to mean that with the two static archives, it decides that it's found collision in libcollide1's collide1.o, and when going through libcollide2, it doesn't need anything from collide2.o and skips it.

This seems like relatively intentional behavior on the part of the linker, and I'm not sure it would work / be a good idea for rustc to use --whole-archive for rlibs, but maybe that's worth exploring. I don't think it will help for collisions with libc and other standard libraries, though, because if you leave off the -Wl,--no-whole-archive flag at the end, you get a whole lot of collisions from libraries implicitly added by the compiler driver.