Make Zingg More Usable - Part 1. Blocking
Closed this issue · 5 comments
Sometimes Zingg jobs are slow or fail due to a poorly learnt blocking tree. This can happen due to a variety of reasons.
For example when a user adds sgnificantly larger trainingSamples compared to Zingg learnt labeling. Or due to a natural bias in the data with lots of null columns used in matching. Having an understanding of how blocking is working may be a good step before deciding to run a match or link job.
Let us add a new phase debugBlocking
which will block the incoming data and output
- Counts per block( getPipeUtil().write(blocked.select(ColName.HASH_COL).groupByCount(ColName.HASH_COL, ColName.HASH_COL + "_count"), getPipeForDebugBlockingLocation(timestamp));
) - 10% records of top 3 by count blocks so that people can see whcih records are contributing to the issue and add appropriate training
We can save results in zinggDir/modelId/blocks/timestamp/counts and zinggDir/modelId/blocks/timestamp/blockSamples
timestamp - same for both
this is a new phase.
define a new class Blocker which has the logic for blocking copied from matcher. It will take blocking tree and return blocks.
In Matcher. getBlocked. call new Blocker<S,D,R,C,T>,getBloched(getBlockingTreeutil)
In BlockingTreeDebugger
, call same
if there are more than one sources, we need to do a group by of the hashes per source.
see also #893
zingg.sh --phase debugBlocking --conf config.json --zinggDir /location
what will the run command look like?
—zinggDir is optional