TileDB-Inc/TileDB-Py

Writing sparse arrays with no duplicates coordinates allowed

weidinger-c opened this issue · 4 comments

Hi, I want to read in pointcloud data (x,y,z as dimensions and some values as attributes) and want to only have unique coordinates. "Duplicate" coordinates should be skipped. For that I found the setting "sm.dedup_coords" = "true", but I am not sure how should apply the setting when creating a new sparse tiledb array.
Currently I always get
tiledb.cc.TileDBError: [TileDB::Writer] Error: Duplicate coordinates (280435, 5.86544e+06, 561.929) are not allowed
(these examples did not help: https://github.com/TileDB-Inc/TileDB-Py/blob/dev/examples/config.py)

Addiditionally I did not find any possibilty to get/print the current config that is used for a tiledb array. The only thing I could get was the "schema", but not the configuration.

Hi @weidinger-c,

  1. The coordinate uniqueness requirement for a given array is controlled at the schema level by the allows_duplicates parameter to ArraySchema. (note: only applicable to sparse arrays)
  2. The sm.dedup_coords setting is a runtime config option. Here's a demo for how to change it.
import tiledb, numpy as np, tempfile

tiledb.default_ctx({"sm.dedup_coords": "true"})

dims = []
dims.append(tiledb.Dim('X', (0, 1023), 1024, dtype=np.uint32))
attr = tiledb.Attr(name='', dtype=np.uint8)
schema = tiledb.ArraySchema(domain=tiledb.Domain(*dims), attrs=[attr], sparse=True)

uri = tempfile.mkdtemp()
tiledb.Array.create(uri, schema)

with tiledb.open(uri, "w") as A:
    A[np.array([1,2,1])] = np.array([1,2,3])

with tiledb.open(uri) as B:
    print(B[:])

Addiditionally I did not find any possibilty to get/print the current config that is used for a tiledb array.

Config options only control runtime behavior, and - there are defaults set for all calls unless you override. Configuration options are applied to a "Context", which can be controlled with:

  1. tiledb.default_ctx - can be called once at process start, before any other tiledb calls. Will apply to all calls in the process, except those within a scope_ctx block.
  2. tiledb.scope_ctx - can be called (or nested) via a with block, and will apply the context to all calls inside the block. Here's a demo:
import tiledb, numpy as np, tempfile

dims = []
dims.append(tiledb.Dim('X', (0, 1023), 1024, dtype=np.uint32))
attr = tiledb.Attr(name='', dtype=np.uint8)
schema = tiledb.ArraySchema(domain=tiledb.Domain(*dims), attrs=[attr], sparse=True)

uri = tempfile.mkdtemp()
tiledb.Array.create(uri, schema)

# creates a new context with config applied
with tiledb.scope_ctx(tiledb.Ctx({"sm.dedup_coords": "true"})):
    with tiledb.open(uri, "w") as A:
        A[np.array([1,2,1])] = np.array([1,2,3])

with tiledb.open(uri) as B:
    print(B[:])

Adding the line tiledb.default_ctx({"sm.dedup_coords": "true"}) now results in:

tiledb.cc.TileDBError: Global context already initialized!

Adding the line tiledb.default_ctx({"sm.dedup_coords": "true"}) now results in:

tiledb.cc.TileDBError: Global context already initialized!

Just read, that this function needs be called before any other tiledb function calls, although I was only deleting the old array with:

if tiledb.object_type(tiledb_array_name) == "array":
        tiledb.remove(tiledb_array_name)

I was now able to write the points into the database, but unfortunately now when reading points, the process just exits. I will create a separate issue for this.