zinggAI/zingg

Make Zingg More Usable - Part 1. Blocking

Closed this issue · 5 comments

Sometimes Zingg jobs are slow or fail due to a poorly learnt blocking tree. This can happen due to a variety of reasons.
For example when a user adds sgnificantly larger trainingSamples compared to Zingg learnt labeling. Or due to a natural bias in the data with lots of null columns used in matching. Having an understanding of how blocking is working may be a good step before deciding to run a match or link job.

Let us add a new phase debugBlocking which will block the incoming data and output

  • Counts per block( getPipeUtil().write(blocked.select(ColName.HASH_COL).groupByCount(ColName.HASH_COL, ColName.HASH_COL + "_count"), getPipeForDebugBlockingLocation(timestamp));
    )
  • 10% records of top 3 by count blocks so that people can see whcih records are contributing to the issue and add appropriate training

We can save results in zinggDir/modelId/blocks/timestamp/counts and zinggDir/modelId/blocks/timestamp/blockSamples

timestamp - same for both

this is a new phase.
define a new class Blocker which has the logic for blocking copied from matcher. It will take blocking tree and return blocks.
In Matcher. getBlocked. call new Blocker<S,D,R,C,T>,getBloched(getBlockingTreeutil)

In BlockingTreeDebugger, call same

if there are more than one sources, we need to do a group by of the hashes per source.

see also #893

zingg.sh --phase debugBlocking --conf config.json --zinggDir /location

what will the run command look like?

—zinggDir is optional