ids for s and w in FOLIA

Question

ids for s and w in FOLIA

Opened this issue a year ago · 1 comments

I apologize if this is documented - I couldn't find it:

I am indexing a FOLIA corpus, to be queried via CQL. This works fine as far as "normal" annotations are concerned, i.e. I can query for (e.g.) POS or lemma on the token level, and also for annotations on the sentence level. However, it remains unclear to me how to account for the xml:id attribute on <s> and <w> elements. The XML looks like this:

<s class="line" xml:id="s3">
            <w xml:id="s3.w1">
                <t>are</t>
                <lemma class="be"/>
                <pos class="VBB"/>
            </w>
            <w xml:id="s3.w2">
                <t>you</t>
                <lemma class="you"/>
                <pos class="PNP"/>
            </w>
            <w xml:id="s3.w3">
                <t>ready</t>
                <lemma class="ready"/>
                <pos class="AV0"/>
            </w>
</s>

And I've tried several variants in the indexing configuration file such as:

<!-- id for the <w>-element -->
<token type="string" offset="false" realoffset="false" parent="false">
             <pre>
                  <item type="string" value="word.id" />
               </pre>
                <post> 
                    <item type="attribute" name="#" />
                 </post>
</token>

So far, I haven't been able to find or do anything with the xml:ids.

What I'd like to understand/do is:

How to represent xml:id on both sentence and token level in the config file
How to integrate them into a CQL query
How to access the ids programmatically after having done a query

For (3), I currently test my attempts like so:

  List<String> prefixes = new ArrayList<>();
  prefixes.add("t");
  prefixes.add("word.id");
  List<CodecSearchTree.MtasTreeHit<String>> allHits 
          = mtasCodecInfo.getPositionedTermsByPrefixesAndPositionRange("content", index, prefixes, spans.startPosition(), 
              spans.endPosition()-1);
  allHits.sort((MtasTreeHit<String> o1, MtasTreeHit<String> o2) -> Integer.compare(o1.startPosition, o2.startPosition));
  for (CodecSearchTree.MtasTreeHit<String> hit : allHits){
      System.out.print(CodecUtil.termValue(hit.data) + "(" + hit.startPosition + ")" +  " / " );
  }

I'd be grateful if somebody could point me in the right direction. Thanks in advance.

Answer 1 · 2024-05-13T18:35:42.000Z

To represent them on both sentence and token level:

<mappings>
	<mapping type="word" name="w">
		<token type="string" offset="false" realoffset="false" parent="false">
			<pre>
				<item type="string" value="word.id" />
			</pre>
			<post>
				<item type="attribute" namespace="http://www.w3.org/XML/1998/namespace" name="id" />
			</post>
		</token>
	</mapping>
	<mapping type="group" name="s">
		<token type="string" offset="false">
			<pre>
				<item type="string" value="sentence.id" />
			</pre>
			<post>
				<item type="attribute" namespace="http://www.w3.org/XML/1998/namespace" name="id" />
			</post>
		</token>
	</mapping>
</mappings>

Search with CQL for [word.id="s3.w2"] or <sentence.id="s3"/>