AlexMili/torch-dataframe

Internal storage

gforge opened this issue · 8 comments

I think that it would be beneficial to use Torch's tensors as internal storage instead of tables. This would allow us to:

  • a more efficient storage (the tables flexibility probably costs)
  • reduce the risk of conversion issues in the to_tensor()
  • separate float/double from integers which would be beneficial in the output functions.

The API changes would probably mostly affect get_column() where as_tensor should default to true. This isn't something that I plan to pursue at the moment but I figure that I'd add this as this could be worth-while considering.

It is a good idea.

Do you mean we would have string and integers in a table and double and float in a (or multiple) tensor ? Or you suggest to use CharStorage to store our string in tensors ?

I was thinking using intor long for integers and float/double for floats (see types) while keeping strings in tables. I haven't looked at CharStorage but it could be an interesting option.

We will see CharStorage for an further enhancement. Right now your suggestion is great ;)

One important thing that I changed was the _infer_schema. You previously checked a proportion of the rows for row type. This is problematic with small datasets such as our test datasets and also checking 1000 rows should be rather cheap so I changed the code to:

function Dataframe:_infer_schema(max_rows)
    rows_to_explore = math.min(max_rows or 1e3, self.n_rows)

With the integer functionality we should probably change 'number' to 'integer' and 'float'. The concept is that the column schema goes from integer --> float --> string

Just got a hint that may solve the string storage issue: https://github.com/torch/tds

Looks great ! Furthermore it would allow more complex string operations in the future :)

Feature implemented and will be merged into develop once the doc script is updated & a working update to 1.6 is added. Until then it's in the feature branch internat_storage.

A few issues seem to remain:

  • Non-luajit fails in Travis
  • get_mode fails with categorical columns