rstudio/reticulate

Missing methods in autocomplete for subclassed dicts

Closed this issue · 1 comments

Hello - I'm trying to replicate the code snippet that's located in the top box of this HuggingFace NLP course: https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt.

For reference, here's the Python code that I'd like to transcribe into reticulate:

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)`

Here's the R code I write to import the datasets module, and pull the raw data:

library(reticulate)
virtualenv_create("test")
py_install(c("datasets", "transformers"), envname = "test")
datasets <- import("datasets", convert = FALSE)
raw_datasets <- datasets$load_dataset("glue", "mrpc")

You'll notice that, in Python, the raw_datasets object has a method called map. When I start typing raw_datasets$..., I don't see the map method come up -- all I see are the 3 dictionary values ("Train," "Validation," "Test"). I do see it if I start typing raw_datasets$train$...

However, I see map as an attribute on the object when I pull it into R:
py_list_attributes(raw_datasets)

The missing map entry from autocomplete is because raw_datasets is a subclassed python dictionary, and reticulate special cases python dictionaries to only show items, and not attrs. This is because $ is overloaded to handle both attrs and items, and we have to make reasonable guesses about which to prefer when (and to make some effort to not present items and attrs as interchangeable in the R interface).

reticulate/R/python.R

Lines 475 to 480 in c9ce74c

if (inherits(x, "python.builtin.dict")) {
names <- py_dict_get_keys_as_str(x)
names <- names[substr(names, 1, 1) != '_']
Encoding(names) <- "UTF-8"
types <- rep_len(0L, length(names))

Now that @ is an S3 generic, it might be a good idea to add an @ method that is a dedicated Python attr accessor.