ufal/treex

Merging documents

Closed this issue · 4 comments

Do we have a block that merges two documents? I have two files, which (should) have identical number of bundles, and only a single zone in each bundle (but the problem could be generalized to multiple zones). I want to read both documents into one so that each bundle has two zones, one from original document 1 and the other from original document 2.

I cannot solve this by putting two Read blocks in one scenario (with different zone selectors) because only one reader per scenario is allowed.

As far as I know, we don't... I once also needed an AlignedReader for Treex files, but I had no time to implement it (and no desire to delve into the depths of Treex::PML).

If the two documents contain only a-trees, it is possible to export them to CoNLL format and then use Read::AlignedCoNLL, e.g. with parameters en_selector1='!en1*.conll' en_selector2='!en2*.conll'. The problem with merging two treex files is that they may contain duplicate node IDs, so the aligned treex reader (or rather block AddTreexFile) would need to handle this (rename all IDs, including all links).

Thanks, Read::AlignedCoNLL works for me (after adjusting it to handle the 2006 flavor of the format). Closing the issue.

Just FYI, I just stumbled upon the block Misc::AddZonesFromFile, which is basically what we need... but I don't know if it handles duplicate IDs properly.