man-group/ArcticDB

Sort and finalize can reorder the fields in a segment and crash with static schema

Closed this issue · 0 comments

Describe the bug

Sort and finalize uses merge_descriptors to generate the field descriptor for the newly added segment. After that during the merging phase it creates an Aggregator and strips the fields from the merged descriptor leaving the aggregator to create the field collection. This later leads to crash on write since the segment field descriptor is different than the one in the header.

In the example below merge_descriptors would order the fields in the field descriptor in order of appearance: index - 0, a - 1, b - 2. The final sorted segment adds rows one by one in the order of their index, thus column b will be reported first and it will have index 1 after that a will have index 2.

Steps/Code to Reproduce

import numpy as np
import pandas as pd
import arcticdb

ac = arcticdb.Arctic("lmdb://test")
lib = ac.get_library("test", create_if_missing=True)

idx1 = pd.DatetimeIndex([
    pd.Timestamp("2024-01-02")
])
df1 = pd.DataFrame({
     "a": np.array([1], dtype="float"),
     "b": np.array([22250], dtype="int64")
}, index=idx1)

b = np.array([-53979, -53973], dtype="int64")

idx = pd.DatetimeIndex([
    pd.Timestamp("2024-01-03"),
    pd.Timestamp("2024-01-01")
])

df2 = pd.DataFrame({"b": b}, index=idx)

lib.write("sym", df1, staged=True)
lib.write("sym", df2, staged=True)
lib.sort_and_finalize_staged_data("sym")
lib.read("sym")

Expected Results

Create the right field descriptor and do not throw.

OS, Python Version and ArcticDB Version

Python: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
OS: Windows-10-10.0.22631-SP0
ArcticDB: dev

Backend storage used

No response

Additional Context

No response