AlexMili/torch-dataframe

Code organization

Closed this issue · 5 comments

I was thinking about a better way to organize the growing main file and tests. It's getting bigger and bigger after each of your great contributions, which is great. But it's also getting harder and harder to read and follow up.
I was thinking about splitting each file in multiples one but I don't know yet the splitting logic. Do you have any idea ?

I agree that this is a must. Currently I'm not aware of any way to split the class function definitions into several files (I tried a basic test but that didn't work out). I've posted on Gitter but not gotten any replies, I'll perhaps try StackOverflow once I'm done with the categories.

My current thought is that we should move all helper functions to a utils-file, e.g. the entire utils section, the _to_html and my new __to_string could probably be moved. There is also some double documentation at the moment - I'll post a new issue regarding this.

How to split a class into several files

I got a suggestion from @andreaskoepf to simply use the return value. Here's an example of an approach we could use:

main.lua

test = require("test")
assert(loadfile("test2.lua"))(test)

local a = test.new()
a:a()
a:b()
print("end")

The test.lua file

Here we create the class and return it to the main for passing onto the subsections:

test = torch.class('test')
function test:__init()
  self.data = {}
end

function test:a()
  print("a")
end

return test

The test2.lua

As we pass unnamed parameters to the file we need to extract the first and set the test class:

local params = {...}
local test = params[1]
function test:b()
  print("b")
end

This gives the desired output:

$ th main.lua
a
b
end

Possible structure

I haven't thought that thoroughly about this but the main themes seem to be:

  • Core
  • Load data (csv + table)
  • Save data
  • Print to screen
  • Print to itorch
  • Subset data (head/tail/sub/where)
  • Manipulate data
  • Utilities

Another thing that may be worth considering is if we should imitate Pandas' Series. This would be nice but I'm not sure its worth the effort.

There was some additional tweaking necessary in the init.lua/rockspec for everything to work (see my misc-branch for how I've implemented this). I've summarized everything in a SO-question as this information was surprisingly hard to get by.

The current subfiles that I have are:

  • load_data
  • save_data
  • load_batch
  • missing_data
  • output
  • select_set_update
  • statistics

I'm thinking of adding a categorical file with the specific categorical functions. I've tried to keep main to core functions but we should probably split it a little further now that the infrastructure is in place.

Mind blown

I'm thinking of adding a categorical file with the specific categorical functions.

I agree, categorical functions would be better in a specific file.

I've tried to keep main to core functions but we should probably split it a little further now that the infrastructure is in place.

Yes we should split again by creating sub-directories in Extensionsto keep it clean but for now it's great enough to try the new architecture you did ^^.

What you could do, is PR the new functionalities, then PR the new architecture (or vis versa). What do you think ?

Glad you like it. We will unfortunately have to do the file split last as that is how I've organized the branches. I'll add the next PR tonight