piegu/language-models

How did you created DocLayNet-small

mit1280 opened this issue · 0 comments

Hi @piegu,

Thank you for creating DocLayNet datasets (small, base and large). It's very time saving in finetune model for downstream task.

I have question on bounding boxes. I checked your notebooks and found that in the dataset there are two kinda bounding boxes e.g. line level and block level (paragraph). I created model using "bboxes_block". It's performing good. But my input data has only line level bounding box so wondering how had you created DocLayNet dataset (which is on huggingface). My hunch is OCR engine (pytesseract) but still want to hear it from you.

Thanks in advance!