yahoo/maha

using druid maha lookups as a replacement for lookups-cached-global

Closed this issue · 4 comments

Hi ,
Currently we are using lookups-cached-global extension for loading lookups in druid(version - 0.12.3).We load lookups from different Mssql and Msql servers.We load around 50-100 lookups of which the top 10 have around 10-15 million entries.Because of such huge size of lookups we are having a lot of issues(high gc pauses,not able to query) while loading lookups on historicals and brokers.So I would like to use your extension as a replacement for lookups-cached-global.
Are there any queries that could be affected ?
Do you support extracting lookups from msql servers?

Of your 50-100 lookups, how many have the same key?

How long does it currently take to load the lookups?

You could convert your lookups to RocksDB based lookups where you create new snapshots once a day and publish updates via Kafka. This would require you to build a new RocksDB instance once a day, zip it up and publish it to HDFS. But it also means you would need some daemon process to do change data capture and publish the updated or new rows to Kafka.

In your 50-100 lookups, if many of your lookups share the same key, you could replace them with our JDBC lookup since it allows for multiple values to be loaded in one lookup, saving duplication of key space. E.g. lookups-cached-global you have one key to one value: Map(a -> aa, b -> bb) Map(a-> 123, b -> 456), our JDBC lookups allow for just one lookup : Map( a -> (aa, 123), b -> (bb, 456)). At query time, you just specific which column you want in the extraction function.

We haven't properly monitored the loading time.For one large lookup(around 10 million entries) , it takes around 45 minutes.

@vsharathchandra might be easier to talk about this on gitter or hangouts

okay sure will contact you on gitter.