JuliaData/Missings.jl

`length` and iteration?

Closed this issue · 6 comments

I think this package is going very well and I'm on board with most of it (e.g. the behavior of comparisons and logical operators), but I find the definitions of length, iteration, and the array interface methods pretty sketchy. Those would be OK if null always represented a missing number, but if it's going to represent a missing string (or something else) then those will give misleading results. Are we 100% sure we need those methods? The approach I favor is to add no methods to Null until it becomes extremely clear that they're needed to avoid major pain. Are there examples where that threshold has been reached for e.g. length?

I think for length and iteration, they were carry-overs from porting NAtype, so I'm not too familiar w/ their use-case (my approach in porting, was actually as you described: only port if it was needed somewhere). I think @ararslan mentioned at one point that null should behave like a Number in those cases, but I don't remember if there was an explicit reason or not. What are the array interface methods you mentioned?

One way to test some of these out would be to take @nalimilan's branch to port DataArray's to Nulls and see what tests fail: JuliaStats/DataArrays.jl#288

Let's remove iteration on Number in Base itself? :-)

More seriously, I agree we should try removing these methods and only reintroduce them if we realize they are really needed. Testing DataArrays first is a good idea, I'll do that after updating it to take into account recent changes in Nulls.jl.

Just tried it, there are only a few lines to change. Three of them explicitly tested the removed methods, so that's expected. Two others are of the form:

dvstr = @data ["one", "two", null, "four"]
all([length(x)::Int for x in dvstr] == [3, 3, 1, 4])

I think they also qualify as non-use cases.

So overall I think we should remove them. We could also imagine providing length, size and ndims, but returning null. Probably better remove them first, and add the versions returning null only if they sound useful.

For clarity, the full set of methods I'm talking about is: length, size, ndims, getindex, start, next, done

I think @ararslan mentioned at one point that null should behave like a Number in those cases

I did? o_O

I think that null should be able to represent missing data of any kind, as it does in SQL, but I don't think it should silently behave like a number in all cases.

See #40.