iljackb/Mixtepec_Mixtec

problems with tagging <m> within strings

Opened this issue · 11 comments

In issue #88 we concluded that rather than keep the <c>'s from the transcriptions in order to make the content more searchable and usable, we would remove all <c>'s except for where on a morpho-semantically significant tone and these would be changed to <m>, thus leaving the structure as follows:

           <u who="#TS" xml:id="d1e112" n="2" start="1.48" end="2.98" xml:lang="mix">
               <seg xml:lang="mix" xml:id="d1e113" notation="orth" type="S">
                  <w xml:id="d1e114" synch="#T14">sketa</w>
                  <w xml:id="d1e116" synch="#T19">ntikii</w>
               </seg>
               <seg xml:lang="mix" xml:id="d1e118" notation="ipa" type="S" sameAs="#d1e113">
                  <w xml:id="d1e119" synch="#T14" sameAs="#d1e114">skɛ<m xml:id="d1e225">˥</m>t̪a<m xml:id="d1e120">↘</m></w>
                  <w xml:id="d1e132" synch="#T19" sameAs="#d1e116">nd̪i↘kiː↘↗ꜛ</w>
               </seg>
            </u>

However while an improvement, this is still problematic in that if one is searching for phonological content, where there is a <m> (which also means that the tone encoded therein is particularly significant) it is not possible to search for full phonetic strings.

So there are three possible solutions I can envision:

  1. Live with it

  2. Copy the string into an attribute like @orig and search for phonetics in the attribute values (though that contradicts the usage in this project in which I'm using these to keep track of where I've normalized)

  3. Make another copy of the IPA contents and don't include the <m>'s;
    However, this raises the questions of:

    • these would have to be linked to either the orthographic or the original IPA contents
      which would be best to point to? Could we instead also have the orth <seg> point to it?

    • they would have to be typed; which is a problem given that @type is already used to classify the type of segment (thus @subtype wouldn't be consistant) and @Notation is still ="ipa"

Below is an example in which I use @function="full" on the <seg> and which also points to the orthographic <seg>:

           <u who="#TS" xml:id="d1e112" n="2" start="1.48" end="2.98" xml:lang="mix">
              <seg xml:lang="mix" xml:id="d1e113" notation="orth" type="S">
                 <w xml:id="d1e114" synch="#T14">sketa</w>
                 <w xml:id="d1e116" synch="#T19">ntikii</w>
              </seg>
              <seg xml:lang="mix" xml:id="d1e118" notation="ipa" type="S" sameAs="#d1e113">
                 <w xml:id="d1e119" synch="#T14" sameAs="#d1e114">skɛ<m xml:id="d1e225">˥</m>t̪a<m xml:id="d1e120">↘</m></w>
                 <w xml:id="d1e132" synch="#T19" sameAs="#d1e116">nd̪i↘kiː↘↗ꜛ</w>
              </seg>
              <seg xml:lang="mix" xml:id="d1e128" notation="ipa" type="S" sameAs="#d1e113" function="full">
                 <w xml:id="d1e129" synch="#T14" sameAs="#d1e114">skɛ˥t̪a↘</w>
                 <w xml:id="d1e142" synch="#T19" sameAs="#d1e116">nd̪i↘kiː↘↗ꜛ</w>
              </seg>            
           </u>

Using this, a search for all phonetic strings would then have to be done matching both @Notation="ipa" and @function="full"; and to get the full phonetic string (to copy into a dictionary for example) it would have to match the same as well as point to an @xml:id of a <w> which is a child of <seg notation="orth">.

What do you think @laurent?

Now that I think about it, hadn't we manage to implement an XSLT search that flattens strings?

I already have done it myself! But the problem isn't how to do it it, it's how to encode and annotate it in a way that allows for easy access but also maximally accurate annotation

actually I remember what you were talking about it was something to retrieve the content, but it was based on searching for the translations. The goal, and the basis of this issue is to try to figure out a way to be able to search the Mixtec, specifically the phonetic and/or orthographic strings.

That's what I mean, if we can manage to search in decent conditions, I would not delete fine grained markup too much...

Sorry, I misunderstood your first comment originally, what I said I did was just to make a flat copy to convert the phonetics with the <c>'s for every character.

So the only think I do to search the strings is just basic XQuery (I generally use XQuery to search and only use XSLT to convert into another format) I search as follows: e.g. //seg[@notation='ipa']/w[contains(.,'skɛ˥t̪a↘')] (which isn't possible unless I make that flattened copy)

I wouldn't know how to do that. I assume this is with XSLT not XQuery? I like making things XQuery friendly because in Oxygen, you can do 'search whole project' and it gathers from files in different folders but in XSLT you have to specify a single directory (unless I'm mistaken)..

I'm thinking it may also be possible to search using "string-join" in XQuery but I'm not sure yet...