About the interface to transform dataframe to Huggingface dataset with a column typed with Image.
svjack opened this issue · 2 comments
I review the construction of this project about dataframe. When use this dataframe with image_url column,
It seems, the inner logic of fast display image, is to render the cell by img src and send them to "repr_html" and render them in html format in Jupyter notebook rather than download the real image.
(and in the yaml config file define the image formatter for display different size images)
as your documentation say, use "to_" prefix methods (such as to_csv to_arrow) and so on, they all drop
the image column, but when use "write" and "read" method, it solely save the "config" (not trigger the truly download function)
This design makes a "lossy transformation" of image, when I want to init a Huggingface dataset from your dataframe rapidly, it is not convenient. (e.x. Dataset(df.to_arrow()) )
I think you should add a trigger for truly download the image of the image column and wrap it by a timeout
decorator (you already have _write_empty_image defination) add this function may be easy.
Generally, we avoid downloading / reading in images as far as possible, but it sounds like this might be a case where that isn't appropriate.
Would you mind pasting in a small example code snippet that shows what you would like to do from Meerkat -> HF Dataset? We can take a closer look.