datajoint/datajoint-matlab

perform joins on primary key attributes only

Closed this issue · 7 comments

DataJoint only supports natural joins. In the vast majority of cases, the joins are performed on the primary key of one of the operands.

Starting in version 2.9.0, I propose to designate the * (mtimes) operator as the primary join. The primary join performs the natural join using common attributes that are included in the primary keys of at least of the operands. If a dependent attribute shares the same name in both relations, an error will be issued prompting the user to project out or rename the offending attribute.

This will allow users to have identically named generic dependent attribute names such as 'comment' or 'timestamp' without causing poorly detectable bugs.

The operator .* (times) will be turned into the full join, which is the regular natural join on all identically named attributes, as implemented by * now.

Warning to old DataJoint users: .* was used as the semijoin operator in DataJoint 1. Since DataJoint 2.0, semijoin is implemented as the & operator but .*remained for backward compatibility. Several recent versions have been showing error messages when .* to allow reclaiming this operator for other purposes.

Explanation

Dependencies between data are implemented as foreign keys referencing primary keys of parent tables. In most cases, joins are performed between directly or indirectly dependent relations. The join is always performed on primary key attributes. This change will help avoid unintended inclusions of dependent attributes in the join.

Cases when a join needs to be performed on purely dependent attributes are outside the prescribed design patters prescribed by DataJoint. I cannot think of a case when such joins might be necessary. To still allow such joins, we implement the full natural join that joins on all identically named attributes.

I'm not sure whether I like that idea. How would you handle non-primary, dependent attributes (i.e. -> below the divider)? My understanding of your suggestion is that in such a case one would have to use .* instead of * if the table that has the attribute as PK is not included in the join. Agreed, these cases are probably not too frequent, but they're equally subtle and hard to catch as two attributes that accidentally have the same name.

While the frequency with which either of the two issues arises is probably a matter of how one uses DJ, I think your suggested change has one big disadvantage: one has to think about which join to use potentially every time one does a join (i.e. very frequently). With the current scheme, in contrast, one has to be careful only when designing the tables, which happens much less frequently.

In summary, it's (1) a pretty major change, (2) introduces a new potential problem that is hard to identify and my be unexpected for many users, and (3) has the potential for silly bugs on old code, where .* is still used as semijoin.

In the cause you described, the join will be correctly done by the primary join operator * because all attributes to be joined on will be in the primary key of the referenced base relation.

All the cases I have seen in practice will be addressed correctly by the * operator without any downsides. The advantage of this change is to preclude inadvertent inclusion of other attributes in the join. So, you can basically forget that .* even exists. Maybe we shouldn't even include it.

To further address your concern (2), the problem is not hard to identify since the primary join * will throw an error if any of the dependent attributes in the joined relations have identical names. Concern (3) only pertains to users who started using DataJoint 1.0 back before 2011, which I think was three people total, besides me and you. We can make datajoint 2.9 throw an error for using .*. We can wait with implementing the full join .* for a few versions since I cannot think of any practical use for it other than for completeness.

Ok, I didn't realize .* was already throwing an error. So that means our old code for processing ephys data doesn't use .*.

And throwing an error when there are matching non-PK attributes and * is used sounds reasonable.

Also, the case I thought could happen doesn't make sense, since if both dependent tables have the same attribute as non-PK, then joining those two should either work properly on the PK or it doesn't make sense.... Never mind.

So go ahead – sorry for the confusion!

This still leaves the question of semjoins and antijoins. Should they also be performed on primary key attributes only? I think they should be. Maybe this can be controlled by a flag that can be turned on or off using dj.set.

Hmm, that's a bit of an issue. I think there are two options.

  1. We don't ever need to include non-key attributes in joins, then we can just get rid of it for all types of joins and don't even use .*
  2. We do need it sometimes, then we need the capability for all types of joins, which can be done in one of two ways:
    • Leave things the way they are at the moment and force users to use unique names (I don't think it's a big problem)
    • Make joins on key attributes the default and provide implementations of the full join. Since we don't have .- and .& as operators, the only way I see is a function syntax like (semi/anti)join(A, B) (equivalent to A * B, A & B, A - B) and (semi/anti)join(A, B, 'full'), where the latter includes all attributes. Using dj.set doesn't seem like a good idea to me, since making code dependent on the user's environment sounds a lot like asking for trouble.

Yes, I am inclined toward 1. On the future branch now, the there is no .*. I cannot think of cases when it would ever be necessary. Therefore, I am thinking to update the semjoin and antijoin to match the join behavior.

OK, let's go for 1. But let's have a few people test those changes on the future branch before we commit to it.