JuliaData/Missings.jl

Why move away from the Nullable approach?

Closed this issue ยท 28 comments

This is a pretty basic question, but I guess I'm not entirely clear why the goal is now to give up on something like Nullable? Is Union{T,Null} believed to eventually be faster? Or some other reason?

I did read the julep that John wrote, but it actually never explained why there is this general shift towards Union{T,Null}.

Union{T, Null} is the representation of missingness that makes the most sense; it's intuitively how one would think that missing values should work. Indeed, it's how they work in nearly all other statistical software. You have a scalar value that has type T or is null. The problem with Nullable is that the container-based approach is incredibly awkward for data analysis. We can tack on all of the automated machinery we want to try to pretend that it's not a container, but there will always be more cases where something doesn't quite work nicely. Not everyone wants to have to do all of their analysis via macros that perform lifting or unwrapping or whatever; being able to interact with missing values as scalars is what people are used to and expect.

This was the approach taken by DataArrays. We wanted missing values to be scalars. The only problem was performance due to type instability, which is why people started exploring the use of Nullable for these tasks. It addressed a significant concern, but had significantly poorer usability. It became increasingly clear that it just wasn't the right solution. So instead the plan is to just make the natural representation, Union{T, Null}, fast.

So instead the plan is to just make the natural representation, Union{T, Null}, fast.

An important point is that while we took for granted for a long time that the Union approach was doomed to be slow, @vtjnash said he would be able to make it efficient in time for Julia 1.0. The impact of type instability inside functions can be reduced to a mere branch for unions, and the memory layout of Array{Union{T, Null}} can be changed to be (almost) the same as DataArray/NullableArray, with an array of values and an array of type tags stored each in one byte.

And to be clear, I think there are certainly use-cases where Nullable makes more sense. In a lot of regular, day-to-day programming and APIs, using Nullable to indicate explicit possible missingness/unknown values is great, clarifies API contracts, and can mesh well w/ the Nullable-as-a-container broadcast functionality.

But as @ararslan mentioned, Union{T, Null} is a more natural representation coming from the data analysis, database worlds, where things tend to be typed T, but allow null values.

I'm probably also a little at fault here for the confusion: a while ago now I actually called for an offline discussion on the future of DataFrames/Nullablility (probably 6-7 months ago now); I tried to invite the major community players that I knew personally & any core developers that would be willing to discuss. It was a time where things were very unclear with regards to a strong sense of purpose/direction for DataFrames/Nullability and I felt like having a real-time discussion to hash out the most outstanding issues would help re-invigorate and focus some of the key players/packages. Obviously, those kinds of discussions inevitably leave people out and you have emerged one of the most key people in this space. On the other hand, I felt like the discussion really served it's purpose since it helped clarify (as @nalimilan mentioned) some of the compiler possibilities that most of us had previously written off as "probably not going to be possible or any time soon".

All of that said, I want to re-iterate that we're still all very much in experimentation phase; I'm by no means convinced that the Union{T, Null} approach is going to completely work out and be a long-term solution, but it has appealing properties that make me want to experiment and give it a shot. I'm also firmly committed to try and collaborate as much as possible w/ fellow community members and developers. I think we all win when we share feedback and iterate together on coming up with the best designs. Open source is inherently hard sometimes due to the lack of personal connections and asynchronous workflow, but I think we've been able to take it a step further in the Julia ecosystem by making efforts to reach out and be inclusive.

Thanks @quinnj, that is helpful. And let me stress, I really want to sort this out with you guys as well, my goal for Query.jl has always been to not have any missing value specific code at all in it :)

I think I have two worries with the Union{T,Null} approach right now. One is that I haven't seen a fully specified description of how this meant to work, including the lifting story etc. My worry here is that it looks more "natural" at this point because no one has gone through all the uses cases and discovered the issues with that representation. The second is pure timing. I'm not sure what the current timeline for julia 1.0 is. The last "official" word I have seen is this, and if this is even half way true I'm worried that this whole transition can't be pulled of for julia 1.0.

Having said that, I think the approach here is obviously the right one: experiment with this and see where it leads us. At the same time I would very much like to see a similar attempt to sort out the remaining issues with the container based approach, so that at the end of the day we can compare both approaches and see which one works better.

I'll write up one proposal for the container based approach in a few moments. If that is ok I'll do it here, maybe we can just centralize the whole discussion about this in this repository here?

One is that I haven't seen a fully specified description of how this meant to work

That's what John Myles White was describing in his Julep.

including the lifting story etc

It's unlikely that automatic lifting will be at thing, as Jeff Bezanson is firmly opposed to it. It's intended that the functionality being explored in this package will be moved into Base eventually, at which time any function that can deal with null values in Base will be overloaded as such. Jeff's idea is then that not providing methods that handle missing values is a way of saying, "This function doesn't have a natural way to deal with missing values." It's expected that package authors would follow suit with their functions.

I consider the container-based approach to be somewhat of a failed experiment. It was good that it was explored, but there's a reason why thinking has shifted back to the DataArrays-style approach.

Here is one proposal for a general missing data value approach for the julia data science area.

We use DataValue as it exists in DataValues.jl to represent missing data throughout. This type is similar to Nullable, but with some key differences. Lets also assume that JuliaLang/julia#21875 makes it into base. At that point we would have the following situation:

Scalars

For basic arithmetic infix operators on DataValues, the following would work:

DataValue(3) + DataValue(2) # returns DataValue(5)

DataValue(3) + 2 # returns DataValue(5)

DataValue() + 2 # returns DataValue{Int}()

This works because we provide methods for arithmetic infix operators that propagate null values.

For all comparison operators on DataValues, the following would work:

DataValue(3) == DataValue(3) # returns true

DataValue(3) == 3 # returns true

DataValue(3) > 2 # returns true

DataValue(3) > DataValue() # returns false

In general you can use the call-site operator . to create a lifted version of a function that propagates missing values. So these would all work:

log.(DataValue(3)) # returns DataValue(log(3))

DataValue(3) .+ DataValue(2) # returns DataValue(5)

DataValue(3) .== 3 # returns DataValue(true)

Arrays

Lets assume we have an array x = DataValue{Int}[DataValue(3), DataValue(5)].

Basic arithmetic infix operators would be used like this:

x .+ x # returns Array{DataValue{Int},1}

For comparisons one would use this:

x .== x # returns Array{Bool, 1}

x ..== x # returns Array{DataValue{Bool},1}

For any other function that one wants to apply element wise with the null propagating lifting semantics, one would write:

log..(x) # returns Array{DataValue{Float64},1}

For higher order functions, one could easily apply the null propagated lifted version of a scalar function like this:

map(log., x) # returns Array{DataValue{Float64},1}

Questions

Are there any other use cases that I have missed? I'm sure I have, but this proposal seems to handle all the issues that I remember having seen in the various debates about this.

I would think that x ..== x and log..(x) are controversial. I'm not sure myself, but it seems to me that there are arguments pro and con for both of these. ..== is ugly, but it also is a bit of a corner case. log..(x) I actually kind of like, at least it is very explicit that you are vectorizing and lifting. With both of these my sense is that we would need to actually have people try these and see whether folks like it or not to really come to a conclusion whether this is good or bad.

Again, I'm sure I missed use cases, so please point them out!

That's what John Myles White was describing in his Julep.

I think that was a great draft, but I never saw a clear and complete description of the lifting story spelled out. In my mind that is the part where things get complicated.

It's unlikely that automatic lifting will be at thing, as Jeff Bezanson is firmly opposed to it. It's intended that the functionality being explored in this package will be moved into Base eventually, at which time any function that can deal with null values in Base will be overloaded as such. Jeff's idea is then that not providing methods that handle missing values is a way of saying, "This function doesn't have a natural way to deal with missing values." It's expected that package authors would follow suit with their functions.

I don't understand. Is this going back to the whitelist approach to lifting?

I consider the container-based approach to be somewhat of a failed experiment. It was good that it was explored, but there's a reason why thinking has shifted back to the DataArrays-style approach.

Why? It solves pretty much every problem for Query.jl, and I think the proposal I made above could work for arrays of missing values. On the flip side, at least with the current idea, Union{T,Null} does not seem to work for Query.jl at all. Maybe we can figure it out, but that seems to be the current situation...

Nullable and DataValue both suffer from what John's Julep called the "counterfactual return type problem". Even with automatic lifting for standard operators, that approach is annoying because of the need to do f.(x) all the time since you want lifting in 90% of the cases. Then there's the issue of double dots.

So even if the need to add functions to a white list isn't ideal with the Union{T, Null} approach, it works better in practice, as the experience with DataArray shows (as opposed to Nullable about which we get many complaints). And maybe one day we will be able to convince Jeff...

Nullable and DataValue both suffer from what John's Julep called the "counterfactual return type problem".

Can you explain how that would show up? Doesn't the current . lifting implementation get around that just fine?

So even if the need to add functions to a white list isn't ideal with the Union{T, Null} approach, it works better in practice, as the experience with DataArray shows (as opposed to Nullable about which we get many complaints). And maybe one day we will be able to convince Jeff...

Why would white listing be easier with Union{T,Null} approach than with DataValue? If white listing is on the table again, why can't we just add white listed methods that operate on DataValue?

Can you explain how that would show up? Doesn't the current . lifting implementation get around that just fine?

I can't explain it better than the Julep, really.

Why would white listing be easier with Union{T,Null} approach than with DataValue? If white listing is on the table again, why can't we just add white listed methods that operate on DataValue?

It wouldn't be easier, but if we have to whitelist, then why bother with the broadcasting approach? Why not use a standard Array{Union{T, Null}}, which has the advantage of simplifying/converting to Array{T} (without even the need for a copy) when there are no missing values?

I can't explain it better than the Julep, really.

I'm not sure I understand your point. Are you saying that the current . lifting for Nullable in base is broken? If not, I don't see the problem. Yes, it does use type-inference, but that seems to work just fine?

It wouldn't be easier, but if we have to whitelist, then why bother with the broadcasting approach? Why not use a standard Array{Union{T, Null}}, which has the advantage of simplifying/converting to Array{T} (without even the need for a copy) when there are no missing values?

Well, for one we know that the container based approach works for Query.jl and the Union{T,Null} approach doesn't as of right now (unless we find a solution to #6). If we do go with white listing, the container approach also seems to work for every other use case at least as well as the Union approach, right? Or are there known problems for container based storage with white listing that the Union{T,Null} approach would solve?

Plus, we could keep all the investment in the various packages that have made a move towards a container based solution and wouldn't have to start from scratch yet again?

We aren't starting from scratch. This approach is basically optimized, cleaned up DataArrays, which is one of the oldest packages AFAIK.

I'm not sure I understand your point. Are you saying that the current . lifting for Nullable in base is broken? If not, I don't see the problem. Yes, it does use type-inference, but that seems to work just fine?

Sure, it works, but we still need to use hacks calling Base.return_type, and if we fail to get a concrete type, we return Nullable{Union{}}. This makes it kind of pointless to have the type parameter if we can't rely on it in general.

Well, for one we know that the container based approach works for Query.jl and the Union{T,Null} approach doesn't as of right now (unless we find a solution to #6). If we do go with white listing, the container approach also seems to work for every other use case at least as well as the Union approach, right? Or are there known problems for container based storage with white listing that the Union{T,Null} approach would solve?

Plus, we could keep all the investment in the various packages that have made a move towards a container based solution and wouldn't have to start from scratch yet again?

How many packages? DataFrames, CategoricalArrays, DataStreams (and associated packages), and Query (which uses DataValue, so it's not plain Nullable already). Many other packages haven't followed the move because they found it too painful, so it's not like we had to rewrite lots of code. Actually, for CategoricalArrays, I rather think this will make the code simpler.

Are there three options on the table right now?

  1. Union{T,Null} with white list lifting.
  2. Nullable (or DataValue) with white list lifting.
  3. Nullable (or DataValue) with some white list lifting and . lifting (essentially what I outlined above).

Purely in terms of user experience, are there any arguments that would favor 1 over 2, or the other way around? I don't see anything in the discussion so far that would favor 1, but at least one that favors 2 (Query.jl doesn't work with 1 unless we find a solution for #6 ), but happy to be corrected. 3 seems open to me, I'm just not sure. There is an argument to be made that all the dots get annoying, but there is also an argument that having the dots actually makes things more explicit, which could be beneficial because it might avoid bugs. I think to really say anything about the merits of 3 over 1 or 2 we would have to implement it, test it out a while with users and see what feedback we get.

In terms of project management complexity and the likelihood that we will have something for julia 1.0 that works, 2 seems the clear winner. If we go with DataValue we need nothing new in julia base for this to work, which to me is a very strong argument in favor. Both 1 and 3 need changes in julia base to work fully. I'm not an expert to really tell whether the changes in base required for 1 or 3 are more complicated, but my gut feeling is that changes to the parser (lowering?) should be much simpler than the changes required for the Union{T,Null} story.

Am I missing something? There must be some more concrete argument in favor of Union{T,Null} than "it feels more natural", right? I would really appreciate if someone could spell that out for me.

Some more detailed responses on other points raised:

Why not use a standard Array{Union{T, Null}}, which has the advantage of simplifying/converting to Array{T} (without even the need for a copy) when there are no missing values?

The same is true for NullableArray, right?

Sure, it works, but we still need to use hacks calling Base.return_type, and if we fail to get a concrete type, we return Nullable{Union{}}. This makes it kind of pointless to have the type parameter if we can't rely on it in general.

This looks like a trade-off to me: certainly Union{T,Null} is also not exactly a general approach if it has to rely on white listing for lifted versions. Plus, you can always add white listed methods even if you are using . lifting for cases where the . lifting does't work.

How many packages? DataFrames, CategoricalArrays, DataStreams (and associated packages), and Query (which uses DataValue, so it's not plain Nullable already). Many other packages haven't followed the move because they found it too painful, so it's not like we had to rewrite lots of code.

There is also ReadStat, TypedTable, StatsModels, TextParse, RCall and IterableTables.

@quinnj Just out of curiosity, would there be a benefit purely for the DataStreams ecosystem from Union{T,Null} over a container type?

My initial impression is yes, and I'm actually starting to work on branches across DataStreams, CSV, and DataFrames to do so to experiment with. I was planning on writing up more thoughts tomorrow as I have a big presentation tomorrow morning, so I'll try to collect some more thoughts and share after that's over with.

Are there three options on the table right now?

Union{T,Null} with white list lifting.
Nullable (or DataValue) with white list lifting.
Nullable (or DataValue) with some white list lifting and . lifting (essentially what I outlined above).

Purely in terms of user experience, are there any arguments that would favor 1 over 2, or the other way around? I don't see anything in the discussion so far that would favor 1, but at least one that favors 2 (Query.jl doesn't work with 1 unless we find a solution for #6 ), but happy to be corrected. 3 seems open to me, I'm just not sure. There is an argument to be made that all the dots get annoying, but there is also an argument that having the dots actually makes things more explicit, which could be beneficial because it might avoid bugs. I think to really say anything about the merits of 3 over 1 or 2 we would have to implement it, test it out a while with users and see what feedback we get.

In terms of user experience, it's so much nicer that a[i] returns a normal scalar value when there are no nulls rather than a wrapper. See for example this thread and this issue.

As you say, there's basically only one argument in favor of using a wrapper, it's #6. All other advantages of wrappers (lifting via broadcast...) can also be applied to Union{T, Null} if we want to. So we should work on finding a solution to #6, not arguing over the whole design because of this single issue.

In terms of project management complexity and the likelihood that we will have something for julia 1.0 that works, 2 seems the clear winner. If we go with DataValue we need nothing new in julia base for this to work, which to me is a very strong argument in favor. Both 1 and 3 need changes in julia base to work fully. I'm not an expert to really tell whether the changes in base required for 1 or 3 are more complicated, but my gut feeling is that changes to the parser (lowering?) should be much simpler than the changes required for the Union{T,Null} story.

I'm not an expert either, but @vtjnash seems to consider this as doable, and he already has branches improving the performance of Union. Anyway everything is already there in Julia 0.6 to make Union{T, Null} work; the needed changes are "only" to make them fast. That difference matters since it means we can work on it right now and even stabilize the API by Julia 1.0, even if everything isn't as fast as it will eventually be.

Am I missing something? There must be some more concrete argument in favor of Union{T,Null} than "it feels more natural", right? I would really appreciate if someone could spell that out for me.

I think you greatly underestimate the work required to get Nullable/DataValue-based framework to suit everybody's needs. As the links I gave above show, we're very far from providing a nice user experience. Maybe it works for Query.jl, but it doesn't for most other cases. Have a look at the list of NullableArrays issues to get a feel of the remaining problems (for example this one). And most of the packages you cite had (and still have for some of them) support for DataArrays (i.e. Union{T, NAtype}), so support for Union{T, Null} can be added quite easily.

I looked at all three examples you linked to, and they all seem like cases where the current Nullable with neither white listing nor nested . broadcasting approach doesn't work. But as far as I can tell both my proposal 2 and 3 would solve these cases entirely. So I still don't see a good argument why the Union{T,Null} approach would be any better. I'll walk through all three examples below.

All other advantages of wrappers (lifting via broadcast...) can also be applied to Union{T, Null} if we want to.

Ah, good. If that is possible, I guess we can split the discussion into a) what data structure is better and b) is white listing or . broadcasting better.

Anyway everything is already there in Julia 0.6 to make Union{T, Null} work; the needed changes are "only" to make them fast. That difference matters since it means we can work on it right now and even stabilize the API by Julia 1.0, even if everything isn't as fast as it will eventually be.

This strikes me as a really risky plan for julia 1.0. What if the optimizations don't make it into julia 1.0 for one reason or another? I would much prefer a strategy where any work on introducing Union{T,Null} into the package ecosystem is done after the work in julia base is finished. A lot obviously depends on the timeline for julia 1.0, which is a mystery to me at this point :)

I think you greatly underestimate the work required to get Nullable/DataValue-based framework to suit everybody's needs. As the links I gave above show, we're very far from providing a nice user experience. Maybe it works for Query.jl, but it doesn't for most other cases. Have a look at the list of NullableArrays issues to get a feel of the remaining problems (for example this one).

I looked through all of these and I saw lots of examples that can't be handled if we use a container based approach and rule out white listing or nested . broadcasting, but I haven't seen anything that wouldn't be solved by either of these two approaches.

My current thinking about all of this is that the container based approach is so unusable right now because the lifting approach taken in NullableArrays.jl really doesn't work. What we have right now is a half-way attempt at white listing functions, and then a (in my opinion failed) approach to try to automatically lift things in higher order functions like map etc. I completely agree with everyone that this leads to terrible usability. But I think an approach to containers that solves the lifting problem for the scalar case and then properly composes that with the vectorization story can work just fine. I think both proposal 2 or 3 from above would fit that bill, or at least I haven't seen a counter example.

The three examples

I just created a new package DataValueOperations.jl that provides white listed lifting for the DataValue type from DataValues.jl. So essentially if you want to try proposal 3 (sans the nested . broadcasting) you can do using DataValues, if you want to try proposal 2 one can use using DataValueOperations. In both cases one uses the DataValue type as a drop-in replacement for Nullable.

First example

The first example you linked to was this. Here is how that looks with the white listed DataValue approach:

using DataValueOperations

a, b = ?("14:00:00"), ?("15:15:00")

Dates.value(DateTime(a,"HH:MM:SS") - DateTime(b,"HH:MM:SS"))
c = DateTime(a,"HH:MM:SS") - DateTime(b,"HH:MM:SS")

For this to work I needed to define white listed methods for DataTime and -. As far as I can tell you would have to do the same for the Union{T,Null} case for this to work properly.

Here is how it looks with the . lifting approach

using DataValues

a, b = ?("14:00:00"), ?("15:15:00")

Dates.value.(DateTime.(a,"HH:MM:SS") .- DateTime.(b,"HH:MM:SS"))
c = DateTime.(a,"HH:MM:SS") .- DateTime.(b,"HH:MM:SS")

Second example

The second example you linked to was this. Essentially madeleineudell at one point said that if someone provided either white listed functions or . lifting, things would work for her use-case.

There was also the question about masking in that link. Here is how that looks with the white listed approach or the nested . broadcast proposal:

using DataValueOperations

a = [?(3.), ?(2.), ?(5.)]

a[a .> 2.]

Third example

The third example you linked to was this. Again, this works just fine with either approach 2 or 3, here is how it would look:

using DataValueOperations

A = [?(9), ?(8), ?(15)]

map(i->isnull(i) ? false : get(i) % 3 == 0, A)

f(i) = isnull(i) ? false : get(i) % 3 == 0

f.(A)

All these arguments only prove that containers are not worse in these examples than Union{T, Null}. But so far the only argument against Union{T, Null} is #6, which we haven't really investigated yet. That's not enough to compensate for the big advantage of having Array{Union{T, Null}} replace NullableArray without sacrificing performance. With containers, the memory layout of Array{DataValue{T}} would be a sequence of (value::T, isnull::Bool), which cannot be converted to Array{T} without making a copy.

Regarding the contingency plan, it's easy: continue working with DataFrames and DataArrays. Another advantage of Union{T, Null} is that it would lead to a very similar API, which could be made even more similar by replacing NAtype with Null.

Maybe we won't convince you, but these discussions have been going on for a long time and unless a big issue is discovered in the transition we're going to go ahead with this plan.

The biggest reason that I see for going with Union{T, Null} over Nullable{T} is avoiding the need to wrap. It's super annoying to me to have to wrap and then unwrap things all over the place. I realize this is not really an issue for those who only use high-level query interfaces because everything happens automatically under the hood, but I'm not one of those users. I tend to work at an intermediate layer where IO & data manipulation sometimes have to be streamlined and fine-tuned for performance on huge datasets and trying to use a high-level query interface just isn't an option. In those cases, it's much more natural to have a simpler representation of the data as a Union{T, Null}. In my mind, it's one less thing I have to force on my data. With Union{T, Null}, I know I'm getting the actual value T, or a Null value, that's it. Period. With Nullable{T}, I have to mentally put in effort up front to wrap values, which can be very frustrating in cases where I know there are no null values. For me, it's much easier to see code switching from Vector{Union{T, Null}} to Vector{T} than to do the same kind of conversion with a NullableArray. In those cases, I get an even bigger win by being able to operate directly on Vector{T}.

I'm happy to help dig into DataValues/Query and better understand what kind of changes would be required; I've currently made progress in getting an end-to-end CSV.read(file, DataFrame) working, but I still have some work on making it more robust and performant. I'm learning a lot in the process and once I feel like it's in a better state, I'll open an issue w/ Query.jl and we can take a look at things.

What about the following strategy: we try to pursue both approaches for a while. We move DataTables.jl over to use DataValue with full white list lifting. I just created a fork of NullableArrays.jl that is based on DataValue here. It needs a few more cleanups but otherwise is pretty functional. Lets see how far we can get with the container approach on that side. The main difference to the current Nullable approach would be that we can easily implement the white list approach without type piracy concerns (or rather, I have pretty much implemented that already).

At the same time you guys pursue the Union{T,Null} approach. If everything comes together before julia 1.0 for that, fantastic. If we find a solution that makes this compatible with Query.jl, I'll be on board. The natural place for that work seems to be in DataFrames.jl, right?

This is a hedging strategy that should increase the likelihood that we have a performant story ready for julia 1.0. I really just don't want to stop all work on container based approaches at this point because of the Union{T,Null} approach. Once that is further along and it is clear that things will work out with the julia 1.0 schedule and the wider ecosystem, sure, but at this point I feel strongly that it is too early for such a decision.

unless a big issue is discovered in the transition we're going to go ahead with this plan.

Would a fundamental incompatibility of the Union{T,Null} design with Query.jl count as a big issue in your opinion?

Would a fundamental incompatibility of the Union{T,Null} design with Query.jl count as a big issue in your opinion?

Not until it's 100% clear there is no way to get it working. Jacob proposed his help but you don't seem to care.

There's certainly nothing wrong (and nothing stopping) trying both approaches and having a branch of DataTables that uses DataValue. I'm going to start working on a branch for DataFrames that uses Nulls.jl.

I guess I'm wondering if you (David) have any other real concerns with the Union{T, Null} approach? I consider #6 merely a "just need to find the time to figure out the solution" as opposed to some fundamental blocker. I understand that you've already put in some great efforts into Query/DataValue, which is commendable and probably makes you hesitant to want to consider switching to some other notion of nullability. I also recognize that it's not the end of the world if we end up just having two different approaches that we let co-exist in the Julia ecosystem, just as we've come to the conclusion that there are valid use-cases for both Union{T, Null} and Nullable{T} in Base.

Anyway, I'm definitely in the spirit of wanting to collaborate and encourage healthy debates, but also want to avoid discouraging other's ideas or making things personal. I've tried to lay out why exactly I'm not a fan of the Nullable wrapping approach, which is enough to motivate me to try another approach. We can agree to disagree, but I also always have lofty unification goals for the Julia ecosystem.

I guess I'm wondering if you (David) have any other real concerns with the Union{T, Null} approach? I consider #6 merely a "just need to find the time to figure out the solution" as opposed to some fundamental blocker.

I think it is the symptom of a very fundamental problem pretty unrelated to Query.jl. I'll elaborate over in #6.

I understand that you've already put in some great efforts into Query/DataValue, which is commendable and probably makes you hesitant to want to consider switching to some other notion of nullability.

No, I'd be happy to investigate and switch Query.jl over to a different representation of missing values if we can find a solution that works. Also happy to help search for that solution :)

I also recognize that it's not the end of the world if we end up just having two different approaches that we let co-exist in the Julia ecosystem

I think we should experiment with multiple approaches until we fully understand the pros and cons of all of them, but I am certainly game to try to converge eventually to one representation. I just don't think we are there yet. I did get the vibe from various comments here and over on the forums (not from @quinnj, though) that a decision has been made to make the switch, and I'm pushing back against that. Instead, I would like the official message to be "we are experimenting with a new approach (Union{T,Null}) and will make a decision after we have more experience with that approach."

Anyway, I'm definitely in the spirit of wanting to collaborate and encourage healthy debates, but also want to avoid discouraging other's ideas or making things personal.

I completely agree and I should say that I've found every single comment from you (@quinnj) a model example of that spirit. I read over my comments here, and I believe they are all about technical merits, never get personal and I also stressed over and over again that I'm fully in favor of experimenting with this Union{T,Null} approach and would like to join that approach if we can make it work from a technical point of view. Having said that, I do not appreciate comments like "[...] you don't seem to care." That style of communication is unprofessional IMHO.

Having said that, I do not appreciate comments like "[...] you don't seem to care." That style of communication is unprofessional IMHO.

Sorry if it sounded harsh, but I have taken the time to discuss arguments in depth, and when Jacob proposed help (after I did the same several times) you didn't reply to his offer and instead proposed another plan which would require quite some work on our part. I wouldn't qualify this as very professional either. These seemingly endless discussions have already prompted John to abandon his Nullable Julep, that may give you an idea of where I came from. I think we all need to talk less and experiment more, that's the most productive way forward. Nobody is going to release a version of DataFrames/DataTables without being certain that it can work with the whole ecosystem.

@quinnj said:

I'm probably also a little at fault here for the confusion: a while ago now I actually called for an offline discussion on the future of DataFrames/Nullablility (probably 6-7 months ago now); I tried to invite the major community players that I knew personally & any core developers that would be willing to discuss. It was a time where things were very unclear with regards to a strong sense of purpose/direction for DataFrames/Nullability and I felt like having a real-time discussion to hash out the most outstanding issues would help re-invigorate and focus some of the key players/packages. Obviously, those kinds of discussions inevitably leave people out and you have emerged one of the most key people in this space. On the other hand, I felt like the discussion really served it's purpose since it helped clarify (as @nalimilan mentioned) some of the compiler possibilities that most of us had previously written off as "probably not going to be possible or any time soon".

Ahh! This is the piece of information I had been missing in previous discussions - I definitely sensed a shift in consensus here and couldn't find a matching discussion online. (Unfortunately, me trying to get to the bottom of the reasons here was taken the wrong way... sorry)

What I wanted to add here is that one of the conceptual (and practical) difficulties for me with Nullable{T} was the it was a container with either zero or 1 elements, and it was broadcast and indexed as such. I feel that introducingNull will let Nullable{T} be semantically a length-1 container with either a T or Null , which will make unwrapping and setting (JuliaLang/julia#21912) them much easier in many cases. I'll point out that with both, we can let users be able to choose either approach, with a simple interface for Nulls (say with a comprehensive set of white-listed functions in Base), and manual unwrapping of Nullable (for safety).

Closing as I think this discussion has run its course. The TL;DR is that Union{T, Null} is a simpler and more convenient approach than Nullable{T} and has received a broad consensus and validation from package devs and core Julia developers alike in terms of past performance and usability concerns.

has received a broad consensus and validation from package devs

I know I'm repeating myself, but I don't agree with this assessment. I've always said that if someone finds a solution for how Query.jl can work with Nulls.jl I'd be thrilled, but I haven't seen one and haven't been able to come up with one myself (not for a lack of trying).