r-lib/rlang

`hash` generates different results on identical objects (even with same memory address)

dipterix opened this issue · 11 comments

I thought hash is supposed to generate the same results for identical objects. Could you help me with the following cases?

options(keep.source = TRUE)
a <-   function(){}
rlang::hash(a)
#> [1] "eca7e650f5be54ba7c122fbc88ed0811"
a <- function(){}
rlang::hash(a)
#> [1] "7bcbcbd5583248d94607c79bab4b70f0"
a()
#> NULL
rlang::hash(a)
#> [1] "a70206d89b1b3cc96363d3413aea8ed6"
a()
#> NULL
rlang::hash(a)
#> [1] "c3b887fbb758842048691f04406c130c"

Created on 2024-01-16 with reprex v2.1.0

Also

memF <- memoise::memoise(function(f){ f() })

a <- function(){
  message("a is evaluated")
}

memF(a)
#> a is evaluated
memF(a)
#> a is evaluated
memF(a)
#> a is evaluated
memF(a)

Created on 2024-01-16 with reprex v2.1.0

Try rlang::zap_srcref() on the function.

zap_srcref removes the source reference. I turned keep.source off and the result is still weird.

options(keep.source = FALSE)
a <-   function(){}
rlang::hash(a)
#> [1] "f53c7a98354786f99f3bdd8a1f655827"
a <- function(){}
rlang::hash(a)
#> [1] "57f55bbc9d540a599eb94edb2b4422d1"
a()
#> NULL
rlang::hash(a)
#> [1] "b5c51b4d7ecdc52b7ad405be5e10d998"
a()
#> NULL
rlang::hash(a)
#> [1] "4489adf065cb6c71e077228d56c46424"

Created on 2024-01-16 with reprex v2.1.0

The R code is evaluated after it is parsed, and source refs are attached by the parser, so setting the option from the same source will not work.

Call zap_srcref() before hash() if you need a stable hash.

options(keep.source = FALSE)
a <- rlang::zap_srcref(function(){})
rlang::hash(a)
#> [1] "b7e8ef5f48c3aa74a30b6ca39fbac850"
a()
#> NULL
rlang::hash(a)
#> [1] "f5852273190358bbe0b6e8328b37a4d6"
a()
#> NULL
rlang::hash(a)
#> [1] "c40ea5774ecd8ba9dccf890d710f0b46"

Created on 2024-01-16 with reprex v2.1.0

oh that's the bytecode I bet

From the JIT

Gotcha, then how can I get rid of it and produce stable results?

You could do something like this:

my_hash <- function(x) {
  if (is.function(x)) {
    # Attach a marker to disambiguate from an actual list
    x <- c("my_unique_function_marker", as.list(x))
  }
  rlang::hash(x)
}

On our side we should consider ignoring bytecode when computing the hash (and possibly the srcrefs).

I see https://github.com/wch/r-source/blob/67c905672a7f4dd00d12d9a0f1763bc46b985bb5/src/main/serialize.c#L1023C5-L1045C6 that if

    if (R_compile_pkgs && TYPEOF(s) == CLOSXP && TYPEOF(BODY(s)) != BCODESXP &&
        !R_disable_bytecode &&
        (!IS_S4_OBJECT(s) || (!inherits(s, "refMethodDef") &&
                              !inherits(s, "defaultBindingFunction")))) {


        /* Do not compile reference class methods in their generators, because
           the byte-code is dropped as soon as the method is installed into a
           new environment. This is a performance optimization but it also
           prevents byte-compiler warnings about no visible binding for super
           assignment to a class field.


           Do not compile default binding functions, because the byte-code is
           dropped as fields are set in constructors (just an optimization).
        */


        SEXP new_s;
        R_compile_pkgs = FALSE;
        PROTECT(new_s = R_cmpfun1(s));
        WriteItem (new_s, ref_table, stream);
        UNPROTECT(1);
        R_compile_pkgs = TRUE;
        return;
    }

then R serialize will compile the functions during serialize, provided the BODY(s) is not BCODESXP. Maybe we should consider temporarily enable

Sys.setenv("_R_COMPILE_PKGS_" = "1")
Sys.setenv("R_DISABLE_BYTECODE" = "0")

in rlang::hash? This should resolve most JIT cases and trim the source reference.

I tried Sys.setenv("_R_COMPILE_PKGS_" = "1") and it seemed to work only on reprex not interactively... However, if I compile explicitly, the function hashes are the same. There must be something missing...

options(keep.source = TRUE)
Sys.setenv("_R_COMPILE_PKGS_" = "1")
a <- compiler:::tryCmpfun(  function() {})
rlang::hash(a)
#> [1] "bcbe14ef3a47a590767f94f36c0765d8"
a()
#> NULL
rlang::hash(a)
#> [1] "bcbe14ef3a47a590767f94f36c0765d8"
a()
#> NULL
rlang::hash(a)
#> [1] "bcbe14ef3a47a590767f94f36c0765d8"
a()
#> NULL
rlang::hash(a)
#> [1] "bcbe14ef3a47a590767f94f36c0765d8"
a()
#> NULL
rlang::hash(a)
#> [1] "bcbe14ef3a47a590767f94f36c0765d8"

Created on 2024-01-16 with reprex v2.1.0

Disabling JIT will not help here because JIT kicks in when a function is called and hash() doesn't call the function.