Original Training image with XML labels to extract data from documents
Omua opened this issue · 4 comments
Hi,
I'm working in a page layout analysis and information extractor and I found that dhSegment might work ok in this task. However, I don't know exactly if dhSegment can work with XML-based anotations (TextRegion, SeparatorRegion, TableRegion, ImageRegion, points defining bounds of each region...) for training besides the RGB styled section definitions. I see in the main page of the project that there is a Layout Analysis example under Use Cases section. That is the case that most resembles to the one I want to implement. Also, I want to extract text from the detected regions.
How can I do that? Can I still use dhSegment or I have to implement my own detector?
Thanks.
Regards.
Hi,
dhSegment takes as input a pair of images : the original image and the labelled image where the regions you want to extract are annotated with different 'colors'. It is not restricted to any format of annotation, as long as you are able to convert it to the above-mentioned labelled image.
So to answer your question, if you want to input directly XML files to dhSegment, no it will not work, but if you generate the corresponding labelled images, then yes, you'll be able to train a model.
There are already some implemented functions to parse files with PAGE-XML format and generate the corresponding masks in the PAGE.py file. You can also have a look at the exps/diva/utils.py
file that may give you some hints on how to adapt it to your specific experiment (the Layout Analysis example is the DIVA experiment with DIVA-HisDB data).
Ok, thanks!
Right now I'm using the page.py functions to analyze de XML files I have currently, to labeled image that dhSegment takes as input. After that, I should be able to train the system to recognize the type of documents I need to analyze.
But what about extracting the text to postprocess it and analyze what is written? Is that possible?
After thinking about the last question I made, I think I have the solution.
After training dhSegment, the output will be the page regions classified by different colours. After that, I have to analyze that image. Having known beforehand which colour corresponds to which element, I can take the coordinates and extract it from the original image. Only then I can analyze it properly because I know exactly what type of information is in that region (table, image, text...)
how train dhsegment using own dataset?