lquirosd/P2PaLA

"No region type defined for r1 at 00001096"

vndee opened this issue · 4 comments

vndee commented

I've used P2PaLA to train a Document Layout Analysis model for zone segmentation with PRImA Layout Analysis Dataset. PRImA dataset use PAGE XML with 2010 schema version.
When model load data from corpus, I've got:

No region type defined for r1 at 0000001096
Element type "Node" undefined on color dic, set to default=175

This message have happened for all <TextRegion> in the XML file. After training phase have done, I see the result/test and nothing was predicted.
I don't know what it mean. Please help me.
Thanks.

PRImA 2010 schema is too restrictive related to region definition, for that reason we use an slightly modified version developed by Transkribus team. On this version there is a new attribute to define the structure of the document, for example:
At 2010 Prima Schema:
<TextRegion type="page-number" id="region_1489561772109_198">
Is now updated to:
<TextRegion type="page-number" id="region_1489561772109_198" custom="readingOrder {index:3;} structure {type:page-number;}">

This new codification will allow us to define any type of TextRegion we need, not only those ones defined on the PRImA schema.

P2PaLA will search for the "structure" attribute not the "type", The message you got is just a Warning that some region is gonna be ignored becouse the "type" is unknown.

To convert your files to be compatible you can use a simple sed command:

sed 's:type="\([^ ]\+\)":type="\1" custom="structure {type\:\1;}":g' in_file > out_file
vndee commented

I've converted all XML file to Transkribus version and I've got something like:

Element type "header"undefined on color dic, set to default=175

So can you explain how to define type color in color dic.

You have to define which regions (header, paragraph, etc ...) do you want to analyze and the type of each region (TextRegion, ImageRegion, GraphRegion, ....), from the help:

 --regions REGIONS [REGIONS ...]
                        List of regions to be extracted. Format: --regions r1
                        r2 r3 ... (default: ['$tip', '$par', '$not', '$nop',
                        '$pag'])
  --merge_regions MERGE_REGIONS [MERGE_REGIONS ...]
                        Merge regions on PAGE file into a single one. Format
                        --merge_regions r1:r2,r3 r4:r5, then r2 and r3 will be
                        merged into r1 and r5 into r4 (default: {})
  --nontext_regions NONTEXT_REGIONS [NONTEXT_REGIONS ...]
                        List of regions where no text lines are expected.
                        Format: --nontext_regions r1 r2 r3 ... (default: None)
  --region_type REGION_TYPE [REGION_TYPE ...]
                        Type of region on PAGE file. Format --region_type
                        t1:r1,r3 t2:r5, then type t1 will assigned to regions
                        r1 and r3 and type t2 to r5 and so on... (default:
                        None)
vndee commented

Ok, thank you.