lancedb/lance

Best way to add nullable column with NULL for existing rows?

tonyf opened this issue · 4 comments

tonyf commented

I want to add a new column to an existing dataset. The new column should be of type string and can be nullable. The existing table should get all NULL values for this column. This could just be a schema update but replace_schema is not implemented. If I use add_columns it will modify all the existing data. Additionally, add_columns infers the dtype from the UDF which means it gets type null which can't be later casted to str

Any tips on how to handle this?

I would think you should still be able to use add_columns. If you want it to always be null, then I don't you need a UDF. A SQL expression would work just fine:

dataset.add_columns(dict(null_str = "cast(null as string)"))

Additionally, add_columns infers the dtype from the UDF which means it gets type null which can't be later casted to str

The batch_udf decorator has an output_schema parameter, btw. So you don't have to use type inference if you want to pass explicit types instead.

tonyf commented

Hm, this is still an expensive operation since every row needs to be updated.

According to the docs

Data fragment. A fragment is a set of files which represent the
different columns of the same rows.
If column exists in the schema, but the related file does not exist,
treat this column as nulls.

Would be great if this column could just be added to the schema instead

If column exists in the schema, but the related file does not exist, treat this column as nulls.

Which docs are this from? I don't think we have this implemented. Given our limited null support in the V1 file format, it didn't yet make much sense to invest much in null support at the table format level.

Hm, this is still an expensive operation since every row needs to be updated.

This isn't as expensive as you might imagine. Unlike other table formats, Lance can split columns across multiple files. So it just needs to add a all null file for each fragment. It doesn't need to rewrite any of the existing columns.

tonyf commented

Found that here https://lancedb.github.io/lance/format.html#fragments

But, makes sense. Plus its just a one-time op so not the end of the world

Thanks!