keep-the-receipts/data-extraction

KWAZULU-NATAL pass 2

Opened this issue · 9 comments

@jbothma Somehow missed there was a pass 1 of the KZN data as well. Looking through it, it looks exactly the same as pass 2. Should it be extracted anyway?

Hey! The idea with pass1 and pass2 is that different people extract the same data twice then we can compare and spot errors so just do one of them.

Ah! Cool, makes sense

@JasonTame TBH, in retrospect we should have probably called them Pass A and Pass B since there is no implied sequence here, the intent is merely to have two people do the same task so that we can compare the outputs for correctness.

I'll take this one

Sincerest apologies and I hope this doesn't become a blocker for anyone else, but something came up and this ended up being rather challenging to process. The tabular data needs a lot of work and the OCR is not 100% so you really have to pay attention and make numerous manual fixes. Something came up so I couldn't dedicate the required time to get it up. I did some of the tables and have pushed draft data along with the outstanding rows in csv format to my fork if anybody feels they can finish this up. I'm unassigning myself in case a night owl can get it done, otherwise I'll be ready to finish up in the morrow. Kudos to @JasonTame for knocking out the first copy of this though...

May I continue where @zacharlie left off? Looks like it’s the only unassigned issue

@Fruitymo It was completed in #257 and should probably be reassigned to me.

But this one is a bit challenging because of the volume of data and the OCR processing, which was rife with errors. As described in the PR, I recommend that this particular sheet be manually cross referenced with the original PDF and the latest existing dataset I submitted, row by row.

That should perhaps be a new issue though. Perhaps @jbothma or @schalkventer can advise.

It's a painstaking process, but I don't think the automated integrity checks help much here at all... Unless someone has a more refined OCR processing technique we can validate against.