nuprl/MultiPL-E

R unit tests atomic vector comparison

Closed this issue · 7 comments

Hi Arjun and Co.,

I was testing generations for R and realized that sometimes when a program output a vector and we expect it to be atomic (all elements within have the same type), the unit tests are still comparing them to list type (which allows different element types within. The downfall of this, is that when program output c(1,2), and is being checked against list(1,2), the unit test says it's incorrect.

An example unit test can be found in even_odd_palindrome :

candidate <- even_odd_palindrome
    if(!identical(candidate(123), list(8, 13))){quit('no', 1)}
    if(!identical(candidate(12), list(4, 6))){quit('no', 1)}
    if(!identical(candidate(3), list(1, 2))){quit('no', 1)}
    if(!identical(candidate(63), list(6, 8))){quit('no', 1)}
    if(!identical(candidate(25), list(5, 6))){quit('no', 1)}
    if(!identical(candidate(19), list(4, 6))){quit('no', 1)}
    if(!identical(candidate(9), list(4, 5))){quit('no', 1)}
    if(!identical(candidate(1), list(0, 1))){quit('no', 1)}
}

where we are obviously expecting atomic vector of length 2 that outputs two elements of the same type (integers for counts)

There are a couple other places where this happen, causing perfectly good generations to fail unit tests. Could y'all help look into how to fix it in the transpiler?

This might be one of the reason R performance is absurdly low..

Thanks so much!

Here I am updating a list of all programs whose unit tests are affected:

HumanEval_5_intersperse
HumanEval_6_parse_nested_parens
HumanEval_7_filter_by_substring
HumanEval_8_sum_product
HumanEval_9_rolling_max
HumanEval_20_find_closest_elements
HumanEval_21_rescale_to_unit
HumanEval_22_filter_integers
HumanEval_25_factorize
HumanEval_29_filter_by_prefix
HumanEval_30_get_positive
HumanEval_33_sort_third
HumanEval_37_sort_even
HumanEval_58_common
HumanEval_62_derivative
HumanEval_68_pluck
HumanEval_70_strange_sort_list
HumanEval_74_total_match
HumanEval_81_numerical_letter_grade
HumanEval_88_sort_array
HumanEval_96_count_up_to
HumanEval_100_make_a_pile
HumanEval_101_words_string
HumanEval_105_by_length
HumanEval_113_odd_count
HumanEval_117_select_words
HumanEval_120_maximum
HumanEval_123_get_odd_collatz
HumanEval_125_split_words
HumanEval_130_tri
HumanEval_148_bf
HumanEval_152_compare
HumanEval_155_even_odd_count
HumanEval_159_eat
HumanEval_163_generate_integers

To that end I almost wonder if this rule similarly can be applied to the input as well (i.e. when input element is a python list, and all elements are of the same type, then transpile to atomic vector).

One example program would be HumanEval127_intersection

where the tests are

candidate <- intersection
    if(!identical(candidate(list(1, 2), list(2, 3)), 'NO')){quit('no', 1)}
    if(!identical(candidate(list(-1, 1), list(0, 4)), 'NO')){quit('no', 1)}
    if(!identical(candidate(list(-3, -1), list(-5, 5)), 'YES')){quit('no', 1)}
    if(!identical(candidate(list(-2, 2), list(-4, 0)), 'YES')){quit('no', 1)}
    if(!identical(candidate(list(-11, 2), list(-1, -1)), 'NO')){quit('no', 1)}
    if(!identical(candidate(list(1, 2), list(3, 5)), 'NO')){quit('no', 1)}
    if(!identical(candidate(list(1, 2), list(1, 2)), 'NO')){quit('no', 1)}
    if(!identical(candidate(list(-2, -2), list(-3, -2)), 'NO')){quit('no', 1)}
}

and ideally the inputs should be c(1,2) not list(1,2)

@arjunguha Sorry to bother repeatedly (and tagging), but seems like 32 / 161 (22%) of the R tests are compromised due to the reason I mentioned in the problem description. This likely needs a fix in the transpiler code for the test part.

Looking into dataset_builder/humaneval_to_r.py function gen_list and gen_tuple, there used to be 2 lines of commented code that is what I thought would be the exact solution to this problem. But was somehow reverted back in this commit. @canders1 Could you help me understand why that decision was made? I'm not an expert in R so I could totally be missing something here.

I think something like this should work

    def is_atomic(self, l):
        '''inputs are all strings, but we need to determine what type they are in R
        '''
        def get_r_type(e:str):
            if e.startswith("c("):
                return "vector"
            elif e.startswith("list("):
                return "list"
            elif e == "NULL":
                return "nan"
            else:
                # https://stackoverflow.com/questions/354038/how-do-i-check-if-a-string-represents-a-number-float-or-int
                return "numeric" if e.replace("-","",1).replace('.','',1).isdigit() else "string"
        type_set = set([get_r_type(e) for e in l])
        return len(type_set) <= 1


    def gen_list(self, l):
        '''Translate a list with elements l
           A list [ x, y, z ] translates to list(x, y, z)
        '''
        # if len(set(types)) <= 1:
        if self.is_atomic(l):
           return "c(" + ", ".join(l) + ")"
        return "list(" + ", ".join(l) + ")"
   
    #there are no r tuples, but r lists are mostly immutable?
    def gen_tuple(self, t):
        '''Translate a tuple with elements t
           A tuple (x, y, z) translates to list(x, y, z) }
        '''
        # if len(set(types)) <= 1:
        if self.is_atomic(t):
           return "c(" + ", ".join(t) + ")"
        return "list(" + ", ".join(t) + ")"

Thanks! Do you want to do a PR? (Against the dev branch.) CC @mhyee

mhyee commented

Hi @PootieT,

I agree that when all elements have the same type, we should use c() instead of list().

I'm taking a look at your proposed fix, and will open a PR.

@PootieT any opinions on the proposed fix here?

looks good!