textexploration/mtas

ids for s and w in FOLIA

Opened this issue · 1 comments

I apologize if this is documented - I couldn't find it:

I am indexing a FOLIA corpus, to be queried via CQL. This works fine as far as "normal" annotations are concerned, i.e. I can query for (e.g.) POS or lemma on the token level, and also for annotations on the sentence level. However, it remains unclear to me how to account for the xml:id attribute on <s> and <w> elements. The XML looks like this:

<s class="line" xml:id="s3">
            <w xml:id="s3.w1">
                <t>are</t>
                <lemma class="be"/>
                <pos class="VBB"/>
            </w>
            <w xml:id="s3.w2">
                <t>you</t>
                <lemma class="you"/>
                <pos class="PNP"/>
            </w>
            <w xml:id="s3.w3">
                <t>ready</t>
                <lemma class="ready"/>
                <pos class="AV0"/>
            </w>
</s>

And I've tried several variants in the indexing configuration file such as:

<!-- id for the <w>-element -->
<token type="string" offset="false" realoffset="false" parent="false">
             <pre>
                  <item type="string" value="word.id" />
               </pre>
                <post> 
                    <item type="attribute" name="#" />
                 </post>
</token>

So far, I haven't been able to find or do anything with the xml:ids.

What I'd like to understand/do is:

  1. How to represent xml:id on both sentence and token level in the config file
  2. How to integrate them into a CQL query
  3. How to access the ids programmatically after having done a query

For (3), I currently test my attempts like so:

  List<String> prefixes = new ArrayList<>();
  prefixes.add("t");
  prefixes.add("word.id");
  List<CodecSearchTree.MtasTreeHit<String>> allHits 
          = mtasCodecInfo.getPositionedTermsByPrefixesAndPositionRange("content", index, prefixes, spans.startPosition(), 
              spans.endPosition()-1);
  allHits.sort((MtasTreeHit<String> o1, MtasTreeHit<String> o2) -> Integer.compare(o1.startPosition, o2.startPosition));
  for (CodecSearchTree.MtasTreeHit<String> hit : allHits){
      System.out.print(CodecUtil.termValue(hit.data) + "(" + hit.startPosition + ")" +  " / " );
  }

I'd be grateful if somebody could point me in the right direction. Thanks in advance.

To represent them on both sentence and token level:

<mappings>
	<mapping type="word" name="w">
		<token type="string" offset="false" realoffset="false" parent="false">
			<pre>
				<item type="string" value="word.id" />
			</pre>
			<post>
				<item type="attribute" namespace="http://www.w3.org/XML/1998/namespace" name="id" />
			</post>
		</token>
	</mapping>
	<mapping type="group" name="s">
		<token type="string" offset="false">
			<pre>
				<item type="string" value="sentence.id" />
			</pre>
			<post>
				<item type="attribute" namespace="http://www.w3.org/XML/1998/namespace" name="id" />
			</post>
		</token>
	</mapping>
</mappings>

Search with CQL for [word.id="s3.w2"] or <sentence.id="s3"/>