`hash` generates different results on identical objects (even with same memory address)
dipterix opened this issue · 11 comments
I thought hash
is supposed to generate the same results for identical objects. Could you help me with the following cases?
options(keep.source = TRUE)
a <- function(){}
rlang::hash(a)
#> [1] "eca7e650f5be54ba7c122fbc88ed0811"
a <- function(){}
rlang::hash(a)
#> [1] "7bcbcbd5583248d94607c79bab4b70f0"
a()
#> NULL
rlang::hash(a)
#> [1] "a70206d89b1b3cc96363d3413aea8ed6"
a()
#> NULL
rlang::hash(a)
#> [1] "c3b887fbb758842048691f04406c130c"
Created on 2024-01-16 with reprex v2.1.0
Also
memF <- memoise::memoise(function(f){ f() })
a <- function(){
message("a is evaluated")
}
memF(a)
#> a is evaluated
memF(a)
#> a is evaluated
memF(a)
#> a is evaluated
memF(a)
Created on 2024-01-16 with reprex v2.1.0
Try rlang::zap_srcref()
on the function.
zap_srcref
removes the source reference. I turned keep.source
off and the result is still weird.
options(keep.source = FALSE)
a <- function(){}
rlang::hash(a)
#> [1] "f53c7a98354786f99f3bdd8a1f655827"
a <- function(){}
rlang::hash(a)
#> [1] "57f55bbc9d540a599eb94edb2b4422d1"
a()
#> NULL
rlang::hash(a)
#> [1] "b5c51b4d7ecdc52b7ad405be5e10d998"
a()
#> NULL
rlang::hash(a)
#> [1] "4489adf065cb6c71e077228d56c46424"
Created on 2024-01-16 with reprex v2.1.0
The R code is evaluated after it is parsed, and source refs are attached by the parser, so setting the option from the same source will not work.
Call zap_srcref()
before hash()
if you need a stable hash.
options(keep.source = FALSE)
a <- rlang::zap_srcref(function(){})
rlang::hash(a)
#> [1] "b7e8ef5f48c3aa74a30b6ca39fbac850"
a()
#> NULL
rlang::hash(a)
#> [1] "f5852273190358bbe0b6e8328b37a4d6"
a()
#> NULL
rlang::hash(a)
#> [1] "c40ea5774ecd8ba9dccf890d710f0b46"
Created on 2024-01-16 with reprex v2.1.0
oh that's the bytecode I bet
From the JIT
Gotcha, then how can I get rid of it and produce stable results?
You could do something like this:
my_hash <- function(x) {
if (is.function(x)) {
# Attach a marker to disambiguate from an actual list
x <- c("my_unique_function_marker", as.list(x))
}
rlang::hash(x)
}
On our side we should consider ignoring bytecode when computing the hash (and possibly the srcrefs).
I see https://github.com/wch/r-source/blob/67c905672a7f4dd00d12d9a0f1763bc46b985bb5/src/main/serialize.c#L1023C5-L1045C6 that if
if (R_compile_pkgs && TYPEOF(s) == CLOSXP && TYPEOF(BODY(s)) != BCODESXP &&
!R_disable_bytecode &&
(!IS_S4_OBJECT(s) || (!inherits(s, "refMethodDef") &&
!inherits(s, "defaultBindingFunction")))) {
/* Do not compile reference class methods in their generators, because
the byte-code is dropped as soon as the method is installed into a
new environment. This is a performance optimization but it also
prevents byte-compiler warnings about no visible binding for super
assignment to a class field.
Do not compile default binding functions, because the byte-code is
dropped as fields are set in constructors (just an optimization).
*/
SEXP new_s;
R_compile_pkgs = FALSE;
PROTECT(new_s = R_cmpfun1(s));
WriteItem (new_s, ref_table, stream);
UNPROTECT(1);
R_compile_pkgs = TRUE;
return;
}
then R serialize
will compile the functions during serialize
, provided the BODY(s)
is not BCODESXP
. Maybe we should consider temporarily enable
Sys.setenv("_R_COMPILE_PKGS_" = "1")
Sys.setenv("R_DISABLE_BYTECODE" = "0")
in rlang::hash
? This should resolve most JIT cases and trim the source reference.
I tried Sys.setenv("_R_COMPILE_PKGS_" = "1")
and it seemed to work only on reprex not interactively... However, if I compile explicitly, the function hashes are the same. There must be something missing...
options(keep.source = TRUE)
Sys.setenv("_R_COMPILE_PKGS_" = "1")
a <- compiler:::tryCmpfun( function() {})
rlang::hash(a)
#> [1] "bcbe14ef3a47a590767f94f36c0765d8"
a()
#> NULL
rlang::hash(a)
#> [1] "bcbe14ef3a47a590767f94f36c0765d8"
a()
#> NULL
rlang::hash(a)
#> [1] "bcbe14ef3a47a590767f94f36c0765d8"
a()
#> NULL
rlang::hash(a)
#> [1] "bcbe14ef3a47a590767f94f36c0765d8"
a()
#> NULL
rlang::hash(a)
#> [1] "bcbe14ef3a47a590767f94f36c0765d8"
Created on 2024-01-16 with reprex v2.1.0
Disabling JIT will not help here because JIT kicks in when a function is called and hash()
doesn't call the function.