LibertyDSNP/parquetjs

Are bloom filters supported on LIST types?

Closed this issue ยท 5 comments

Hi there ๐Ÿ‘‹

Firstly, thank you for this amazing library!

I'm curious to know how to add bloom filters to LIST types.

For example, given this schema:

{
  querystring: {
    type: "LIST",
    fields: {
      list: {
        repeated: true,
        fields: {
          element: {
            fields: {
              key: { type: "UTF8" },
              value: { type: "UTF8" }
            }
          }
        }
      }
    }
  }
}

How do you add a bloom filter for the querystring.list.element.key field?

[
  {
    column: "querystring.list.element.key",
    numDistinct: 100
  }
]

I assume the above won't work? (Sorry in advance if that literally is how you do it!)

Thanks in advance!

Hi @ljwagerfield !

Been looking into this. It might work, but I think there might also be a bug in the column naming causing issues.

So your setup of how you think it might work, is approximately how I think it likely should work (or close to it).

So I think this is a bug. The library's bloom filter does not currently handle nested fields at all. (Although most of the pieces are in place for it to do so).

For someone who wants to work on making this possible here are some notes:

  • writeBloomFilters needs to handle all the possible nested "columns" not just the top level ones. getColumn from the reader.ts file is a good example of going down into the various groups.
  • Currently the writeBloomFilters will write one if the opts.bloomFilters has a column with name key or value instead of respecting querystring,list,element,key

Here is a simple setup for what "should" work, but doesn't due at least in part to the note ^.

const main = async () => {

  const file = "parquet-testing/issue-98.parquet";

  const schema = new parquet.ParquetSchema({
    querystring: {
      type: "LIST",
      fields: {
        list: {
          repeated: true,
          fields: {
            element: {
              fields: {
                key: { type: "UTF8" },
                value: { type: "UTF8" }
              }
            }
          }
        }
      }
    }
  });

  try {
    const writer = await parquet.ParquetWriter.openFile(schema, file, {
      bloomFilters: [
        {
          column: "querystring,list,element,key",
        },
      ],
    });

    await writer.appendRow({ querystring: { list: [ { element: { key: "foo", value: "bar", }, }, { element: { key: "foo2", value: "bar2", } } ] } });
    await writer.close();
  } catch (error: any) {
    console.log("I'm in the write catch!", error)
  }

  try {
    const reader = await parquet.ParquetReader.openFile(file);
    const cursor = reader.getCursor();
    console.log("row", await cursor.next());
    const metadata = reader.getMetadata();
    console.log("metadata", metadata);
    const bloomFilters = await reader.getBloomFiltersFor(["querystring,list,element,key"])
      console.log("bloomFilters", bloomFilters);
  } catch (error: any) {
    console.log("I'm in the read catch!", file, error)
  }
}

main()

Aha, interesting!

So the Parquet specification does support bloom filters on lists (I wasn't even sure of this), and this library is close to supporting an implementation for that.

That's awesome!

I don't have any spare cycles at present (or the Parquet knowledge!) to contribute, unfortunately, so am happy if you want to close.

Great to know, though, and thanks again! ๐Ÿ‘ ๐Ÿ‘ ๐Ÿ‘

I'll leave it open for now. With the quick test I wrote, I might be able to get a fix in if I can get the time. Others might want to pick it up as well.

Awesome, thanks so much! ๐Ÿš€