polarsignals/frostdb

What is the state of schema v1alpha2 ?

gernest opened this issue · 18 comments

parca uses v1alpha1 and so does frostdb. Since it is not used we can remove it for now, we can always bring it back when we have use for it.

By not using it in frostdb I mean conversions are made back to v1alpha1 before being consumed.

I wanted to introduce binary type for []byte columns and realised I had to duplicate the effort.

We want to switch to v1alpha2 ASAP. We just haven't had the bandwidth for it, I think that we definitely don't want to delete it.

We want to switch to v1alpha2 ASAP. We just haven't had the bandwidth for it, I think that we definitely don't want to delete it.

@asubiotto So, its okay If I phase out v1alpha1 ? or is the plan to maintain both in tandem ?

Assuming I have the bandwidth, what course of action will align with your roadmap ?

  • Phase out v1alpha1 completely, submit relevant patch to parca as well.
  • Freeze v1alpha1 and add new features to v1alpha2 , keep v1alpha1 indefinitely.
  • Maintain both but make v1alpha2 primary.
  • ... anything else ?

I would say freeze v1alpha1 and add new features to v1alpha2. There are more projects than just parca relying on frostdb schemas so it should be a slow deprecation. I think @thorfour has more context on what the plan here should look like since he wrote v1alpha2.

v1alpha2 will work like magic with our generic record builder. Because it will automatically generate v1alpha2 schema with the arrow.Record that is being inserted into the table.

There will be some challenges though, we can't possibly have a single static schema like we do now. Due to nature of the dynamic columns, the schema will be changing as new dynamic columns are added. There is a big chance different lsm levels will have different schemas.

Also a lot of assumptions in dynparquet package will have to change.

For this to work we need to attach v1alpha2 schema to arrow.Record , this will also spill into parquet files as well.

I would say freeze v1alpha1 and add new features to v1alpha2.

I think this is hard( for me, I kinda tried it already). The way v1alpha2 treats dynamic columns requires you need to know them before hand.

My impression of v1alpha2 is we were moving away from dynamic columns. If we still support dynamic columns then feature parity between the two needs to exist in some form. There are no dynamic columns in v1alpha2

I think

func Test_ParquetToArrowV2(t *testing.T) {
illustrates an example with dynamic columns in v1alpha2.

I think

func Test_ParquetToArrowV2(t *testing.T) {

illustrates an example with dynamic columns in v1alpha2.

@asubiotto

Nope, it does not. That case is for nested columns. There is a difference between nested columns and dynamic columns from schema perspective.

  • nested columns are leaf nodes. They exist in the schema
  • dynamic columns are not in the schema, only the group or root node of the soon to materialise columns exist on the schema e.g labels.

A group node without children is not a dynamic column.

I have double checked SchemaFromDefinition and there is no part where v1alpha2 sets (*ColumnDefinition).Dynamic = true

I can be wrong though, maybe there is an implicit assumption that I might have missed ?

please be patient with my dumb questions and mediocre explanations. I just want to understand how things work, there is a limit on what I can grok without context.

Isn't it equivalent for all intents & purposes? The test defines dynamic columns the following records will use. I might be missing something though, think this is something @thorfour knows a lot more about.

Isn't it equivalent for all intents & purposes? The test defines dynamic columns the following records will use. I might be missing something though, think this is something @thorfour knows a lot more about.

Mmmh! I'm confused now, If we need to define columns before using them, are they still dynamic ? As I understand in v1alpha1 we don't define them, instead we describe them via marking the dynamic field.

Okay, so schemav2 was created because we wanted a way to define arbitrarily nested structs in FrostDB which is impossible with v1 since it only supports flat schemas. So that's the goal of v2 is to become a superset of v1 and v1 could be ripped out. We never moved to v2 because I never got the recursive conversion for arbitrary nesting so it's been languishing (I'm not even certain it still works).

That said...my dream for FrostDB is to remove the idea of pre-defined schemas as a requirement entirely. We already give frostDB schemas when we write to it (via arrow record schemas). So it should actually be smart enough to just determine the schema during compaction time based on the given arrow or parquet schemas that it's compacting (it already kind of does this it just only really looks for dynamic columns when it could just merge all unique columns into a schema).

And if a user wanted to define a specific storage definition for a column that could still be a table options that they can define, but otherwise compaction would use some sort of "sane defaults" during compaction. So all this is to say I think I'd rather not pursue "fixing" v2, and work towards deprecating schemas entirely.

Thanks, It makes sense now. v2 is indeed out of place. I also believe we can move away from schemas because both arrow and parquet support metadata, which means all the information needed to convert/merge between the two can be part of the data itself.

It will be awesome, frostdb becoming a storage and query api of arrow.Record

@thorfour so toward, your vision can we start by

  • phasing out v1alpha2 ? We should leave v1alpha1 because it is used and actually works.

I will propose next steps afterwards. I'm kinda short on work to do on frostdb, and I'm embarrassed to ask for assignments.

I think I can handle removing v1alpha2. Going schema less will take a while and a bunch of design choices to make sure we don't regress in overall ergonomics and ux.

It will be possible for me to do full audit and chart some kind out outline on a roadmap towards schema less goal if v2 is out of the picture.

Yes, we can remove v2 entirely I believe. And if you want to start tackling schemaless that would be cool!

Thanks

I don’t know that I agree. I think there are various reasons a known schema can be useful:

  1. migrations
  2. validation
  3. query optimization
  4. storage layout optimizations (encoding, compression, etc.)

I can see the first-time-use UX being nice of not having to predefine a schema but long tail usage I think we’re better off knowing and ensuring records conform to a schema.

I don’t know that I agree. I think there are various reasons a known schema can be useful:

  1. migrations
  2. validation
  3. query optimization
  4. storage layout optimizations (encoding, compression, etc.)

I can see the first-time-use UX being nice of not having to predefine a schema but long tail usage I think we’re better off knowing and ensuring records conform to a schema.

Interesting points, I think arguments can be made for both schema or schema less . It just boils down to what works best for current use case.

  1. My understanding with schema less isn't that no schema at all. As levels gets compacted, they collapse and thus eventually resolves into a single parquet or arrow.Record which does have schema.

  2. migrations : We are not doing migration now we have a schema, not sure it will make a big difference without schema.

  3. validation : see bullet 0

  4. query optimization : see bullet 0 we will still have access to arrow.Record schema and parquet schemas for parts in the lsm tree.

  5. storage layout optimizations (encoding, compression, etc.) These have been covered, I think they will be per table options that will automatically apply matching column paths( this is what arrow package does with arrow -> parquet conversion)

Both cases looks sensible to me. Personally I like flat nature of v1alpha1 its simple to reason with and practical.

Yea there isn't "no schema" the schema is just derived from the records that have been written into FrostDB.
We only use schemas during compaction time, and there's no real reason we need one pre-defined at compaction time when we have the combined schemas from all the things we're compacting to determine the final schema.

Storage layout is the only thing that a schema is likely useful for since arrow doesn't have a 1:1 storage layout with parquet so we'd need to come up with a design around that, but that could be a table ovrerride option or even just metadata in the records that hints what the layout should be.

Can you provide concrete examples of your bullet points @brancz that describe why a pre-defined schema does something that not having one can't?

Okay, we had a discussion around this, and agreed that removing the stale V2 is the right path forward, but are not going to pursue removing pre-defined schema definitions.

So instead of ripping out schema definitions we should work to replace schema v2 with a new proper one that correctly defines aribtrarily nested schemas, and move to that in lieu of v1. So basically re-do what v2 was trying to do, we just think it'll be easier to rip it out and rewrite it than trying to make it work with how much FrostDB has changed since it was implemented.