funderburkjim/elispsanskrit

root-class-pada list from MWvlex

Closed this issue · 49 comments

The pysanskrit conjugation algorithms depend on root-class-pada information. Currently, this is extracted from the MWvlex repository.

To get this information more in the form needed by pysanskrit, a consolidation of this information was made, resulting in mwvlex_cp.txt. Some details regarding this file and its extraction are in the readme here.

comparison to to verbdata

There is now comparable root-class-pada information from SanskritVerb and from MWvlex.

A first programmatic comparison is sanverb_mwvlex_cp.txt.

Here are summary statistics from the comparison:

  • 652 roots where the class-pada information is identical from the two sources
  • 282 roots where the mwvlex information extends that of verbdata (sanverb)
  • 121 roots where the verbdata information extends the mwvlex information
  • 457 roots where there is incommensurate class-pada information in the two sources
  • 442 roots that appear only in MWvlex
  • 172 roots that appear only in SanskritVerb verbdata.

I expect there is much useful information in the cases where there are differences, which may come
to light in time.

652 roots where the class-pada information is identical from the two sources

Good news.

282 roots where the mwvlex information extends that of verbdata (sanverb)

I am not sure about the process by which the mwvlex information was derived.
I did a test check for the first entry Case 01: aNk has 01A(sanverb) < 01A + 10P(mwvlex)
The corresponding dictionary entry is as follows.
capture
The second entry 10P seems to be part of 10th gaNa 'aNka' in sanverb.
So the major difference is as follows -

There is a special category of verbs known as 'अदन्तधातवः' i.e. verbs ending with 'a' in 10th gaNa.
There is a specific grammatical need for the last 'a'. Usually all other verbs end in a consonant in 'without anubandha' form. These verbs end with a vowel 'a'. When this is put into use, there is a grammatical rule अतो लोपः (६.४.४८) which elides the terminal 'a'.

MW (and most dictionaries for that matter) seem to keep them in consonant ending form 'aNk' instead of vowel ending form 'aNka'.

That is why majority of so called differences are seen.

If this is force matched, there would be larger correspondence.

This phenomenon is seen only in 10th gaNa 'a' ending verbs like aMsa, aNka, aNga etc.

121 roots where the verbdata information extends the mwvlex information

The cursory look at first two entries show the possible explanation.

Case 01: aMh has 01A,10P + 10A(sanverb) > 01A,10P (mwvlex)
Case 02: aww has 01A,10P + 10A(sanverb) > 01A,10P (mwvlex)

Here the difference is only in the 10A.

There is a grammatical rule णिचश्च (1.3.74) which mandates that all the 10th gaNa verbs are उभयपदि by nature. See http://localhost/SanskritVerb/Data/allsutrani/1.3.74.htm.
The explanation reads कटं कारयते। कर्त्रभिप्राये इत्येव, कटं कारयति परस्य।. Therefore there are two optional forms A and P.

So, wherever there is 10th gaNa verb, it has to have both 'A' and 'P'.
Therefore, mwvlex misses out on this detail.
Force match this and you will have greater correspondence.

@funderburkjim I will do further analysis once these discrepancies are forcematched / ignored.
Then we will have real problems.

I am not sure about the process by which the mwvlex information was derived.

mwvlex_cp.txt is derived from the markup of mw.xml, specifically that part of the markup that pertains to verbs.

Technically, there are three steps in this derivation:

  • mw.xml -> verb_step0a.txt done by MWvlex/step0/redo.sh. The file verb_step0a contains records both for simple roots, and also for
    prefixed roots, denominatives, and some other categories.
  • verb_step0a.txt -> verb_cp.txt done by MWvlex/step1/redo.sh. verb_cp pertains only to the simple roots of verb_step0a, and for these it shows a simplification of the class-pada information. One record per mw.xml record.
  • verb_cp.txt -> mwvlex_cp.txt : Aggregates the information of verb_cp.txt to one line per root spelling, and also, for each root, aggregates the class-pada information.

The original <vlex> markup of mw.xml was done by an iterative program-assisted process by me long ago; and thus is subject to errors of one kind or another.

10th gaNa verbs are उभयपदि

mwvlex1_cp.txt reflects adjustment of the pada information for mwvlex class 10 roots to always include P and A.

There were 347 roots with such adjustments, and these are detailed in mwvlex1_cp_log.txt.

When compared to verbdata sanverb1_mwvlex1_cp.txt. approximately 100 additional roots now show identical class-pada information.

The revised summary statistics are:

  • 745 (v. 652 previously) roots where the class-pada information is identical from the two sources
  • 333 (v. 282) roots where the mwvlex1 information extends that of verbdata (sanverb)
  • 41 (v. 121) roots where the verbdata information extends the mwvlex1 information
  • 393 (v. 457) roots where there is incommensurate class-pada information in the two sources
  • 442 (same)roots that appear only in MWvlex
  • 172 (same) roots that appear only in SanskritVerb verbdata.

The increase in the first category, and the decreases in the 3rd and 4th categories make sense.

However, the increase in the 2nd category is surprising.

After further examination, that increase in the 2nd category actually does seem ok.

Consider the example of root kaRW:

01A,10A,10P(sanverb)

01A,01P,10P(mwvlex) (before adjustment)
01A,01P,10A,10P(mwvlex1)  (after adjustment)

comparison of mwvlex to sanverb yields an incommensurate relation before adjustment
comparison of mwvlex1 to sanverb yields a 'subset' relation after adjustment  (mwvlex1 augments
 sanverb by the 01P item).

So kaRW moved from the incommensurate (4th) group to the 2nd group.

Incidentally, on that kaRW example. A check of MW shows that he really does have both padas. He
refers to Westergaard DAtupAWa viii, 11 and 34,40 (class 10 reference).

How does verbdata derive its pada designation ? Is it from the udAtta, anudAtta DAtupAWa section headings?

There are 45 roots (according to sanverb_cp.txt) which show a class10 with only 1 pada :

arTa:10A
kuts:10A
kusm:10A
kuha:10A
ganD:10A
garva:10A
gal:01P,10A
gUr:04A,10A
gfha:10A
cit:01P,10A
juq:06A,06P,10P
qap:10A
tantr:10A
tarj:01P,10A
tUR:10A
das:04P,10A
nizk:10A
pada:10A
piS:06A,06P,10P
puR:06A,06P,10P
bast:10A
Barts:10A
Bal:01A,10A
BrUR:10A
mantr:10A
mid:01A,01P,04A,04P,10P
mfga:10A
yakz:10A
ru:02P,10A
laj:01P,10A
lal:01P,10A
vaYc:01P,10A
vast:10A
viC:06A,06P,10P
vIra:10A
vfz:01P,10A
Sam:04P,10A
SUra:10A
saNgrAma:10A
satra:10A
sTUla:10A
spaS:01A,01P,10A
syam:01P,10A
hast:10A
hizk:10A

Based on the above comment, I expected these to show 'u' for the pada designation in verbdata.txt.

According to mwvlex1_cp.txt, there are 88 roots that

  • only occur in sanverb (SanskritVerb), and
  • whose spelling ends in short 'a'.

In fact, these comprise all the roots in sanverb_cp.txt that end in 'a'. In other words, none of the
roots in mwvlex_cp end in 'a'.

Further, all of these have class 10 forms.

I'm expect that most of these are actually alternate spellings of MW roots, the difference being
simply that MW doesn't spell them with that final 'a'.

One way to handle this is to assume that this spelling difference is a convention, and to generate
a sanverb1_cp.txt listing that drops the final 'a' on all these 88. Then, we can compare this list to
mwvlex1_cp.txt, with the expectation that several additional roots will be brought into agreement in
the two systems. This process also should check within sanverb for roots spelled both with and without a final 'a', and merge the class-pada information for the two in any such cases.

How does verbdata derive its pada designation ? Is it from the udAtta,
anudAtta section heading?

It seems that Mihail hand coded it. I am not sure. I started with an excel
file of dhaatupaatha at starting, but there were some corrections. So made
a PHP array and made corrections thereto. But it seems that the pada
designation were added by hand. Doesn't seem to be based on udAtta,
anudAtta etc.

There are 45 roots (according to sanverb_cp.txt) which show a class10
with only 1 pada :

Oh. I forgot to mention two major exceptions to the general rule of
ubhayapadi of 10th gaNa.
आगर्वादात्मनेपदिनः
आकुस्मादात्मनेपदिनः
There are two sets in the 10th gaNa which are explicitly mentioned to be
Atmanepadi.
They have been enumerated in function.php as separate lists.
A test check of 45 entries show that these are from these two exceptional
categories.

So new thumb rule is
10th gaNa are ubhayapadi, except AgarvIya and AkusmIya which are Atmanepadi.

Sorry for earlier faulty statement.

One way to handle this is to assume that this spelling difference is a convention, and to generate a sanverb1_cp.txt listing that drops the final 'a' on all these 88. Then, we can compare this list to mwvlex1_cp.txt, with the expectation that several additional roots will be brought into agreement in the two systems.

Seems logical. But reverse should also be possible. You will also like to
provide a reverse entry into sanverb database if you want to see aNka
forms.

Regarding the list of 45:
There is a small number therein which show only Parasmaipada in the class 10.

juq:06A,06P,10P
piS:06A,06P,10P
puR:06A,06P,10P
mid:01A,01P,04A,04P,10P
viC:06A,06P,10P

What about these?

As to the others of the 45, I will assume that they are Atmanepada only. Will revise mwvlex1 accordingly (TODO).

As to the others of the 45, I will assume that they are Atmanepada only. Will revise mwvlex1 accordingly

Let us not assume. Let us verify. There may have been a stray case of Atmanepadi fault also, like parasmaipadis shown above. Parasmaipadis seem to be errors.
AgarvIya.txt and AkusmIya.txt
are the files generated from function.php via extract.php for these two special verb sets.

As regards parasmaipadi verbs in 10th class, here is the analysis.
$verbdata is a kind of concordance. It has been manually collected. Sometimes some verbs that are seen in some other dhAtupATha are also mentioned in this. So treat earlier four as superfluous. I am not able to find their corresponding entries in माधवीयधातुवृत्तिः / धातुप्रदीपः / क्षीरतरङ्गिणी. But even then, they should be ubhayapadi.

juq:06A,06P,10P -> Not found in traditional dhAtupAThas.
piS:06A,06P,10P -> Not found in traditional dhAtupAThas.
puR:06A,06P,10P -> Not found in traditional dhAtupAThas.
mid:01A,01P,04A,04P,10P -> Not found in traditional dhAtupAThas.
viC:06A,06P,10P -> Should be ubhayapadi.

All made ubhayapadi in $verbdata via drdhaval2785/SanskritVerb@c5d7d53.

Let us verify

By this do you mean to use the two lists AgarvIya.txt and AkusmIya.txt ?

OR

Have you already separately checked that the updated $verbdata in function.php is now correct with regard to those two Atmanepadi lists?

Regarding sanverb_mwvlex1_cp.txt

It seems that the logic was not completely implemented.

SANVERB CPS < MWVLEX1 CPS
Case 01: aNk has 01A(sanverb) < 01A + 10A,10P(mwvlex1)

STEM IN SANVERB ONLY
Case 02: aNka:10A,10P in sanverb

When we read these both together i.e. when we disregard the terminal 'a' in sanverb, the final output would be
aNk 01A,10A,10P in both the databases.

This normalization has not been factored into I guess.
This will give some further reductions in differences.

When this is correctly implemented - there will be decrease in other discrepancies too.
e.g.

STEM IN MWVLEX1 ONLY
Case 02: aMs:00 in mwvlex1

STEM IN SANVERB ONLY
Case 01: aMsa:10A,10P in sanverb

These will be similar now.

@funderburkjim
aMs:00
Does it mean that MW is silent regarding gaNa / pada ?

I could see something like
Case 374: samaya:00P in mwvlex1

Does it mean that it is parasmaipadi, but gaNa is not specified in MW?

In such cases, it would be safe to incorporate the gaNa / pada of SanVerb into MWvlex.
That would reduce the statistics in 393 cases of DIFFERENT CPS
e.g.

Case 03: am has 01P,10A,10P(sanverb) != 00A(mwvlex1)
Case 04: amB has 01A(sanverb) != 00(mwvlex1)
Case 05: ay has 01A,01P(sanverb) != 00(mwvlex1)

There are also cases where the gaNa is specified but pada is not specified. In such cases also it is safe to take the pada from sanverb e.g.

Case 33: kIl has 01P(sanverb) != 01(mwvlex1)

Now the steps needed are

  1. Fetch the new $verbdata from function.php (There were some corrections of duplicates in #32 (comment)).
  2. Regenerate the verbdata.txt
  3. Account for AgarvIya / AkusmIya #34 (comment)
  4. Account for terminal 'a' in sanverb #34 (comment)
  5. Account for places where MWvlex is silent on gaNa / pada. #34 (comment)
  6. Regenerate the statistics.

Then, we will have a much smaller workable data set.

Agree with your interpretation involving '00' ,'00P', etc. in mwvlex. With the possible caveat of some
coding/typo error of MW (typos) or errors in extraction via MWvlex/step0, these indeed are cases
where MW is silent on gaRa, pada or both.

I agree that a good procedure in these cases is to have mwvlex adopt the verbdata information (assuming there is an unambiguous matching of root spellings).

The accomplishment of these steps is next on my agenda.

@drdhaval2785 Questions regarding the term iDAgama.

From verbdata.txt, I see that the value of the iDAgama field is sew (with 'i') or aniw (without 'i').

  • why no vew (either with or without i) ?
  • One meaning of Agama in MW is a grammatical augment, a meaningless syllable or letter inserted in any part of the radical word which seems relevant.
    • How to interpret the 'D' in iDAgama ? If it were spelled (in slp1) iqAgama that would seem
      a correct sandhi joining of iw and Agama, but the D is puzzling.

In AgarvIya.txt, sTUla appears twice. Since there is only one record in verbdata.txt with sTUla, I assume
this duplication has no significance, and have removed the second one.

This applies to item 4 in Dhaval's list above, roots with 'a'

roots_a.txt itemizes 63 cases where

  • there is a root in sanverb whose spelling ends in 'a', and
  • there is a corresponding root in mwvlex whose spelling is the same but without the ending 'a'.
    [ There are no roots in mwvlex_cp that already end in 'a']

In 28 of these cases , sanverb also has the root without the 'a', like

aNka:10A,10P#aNk:01A#aNk:01A,10P

(first field is sanverb_cp root with 'a', 
2nd field is sanverb root without 'a' (or empty if not in sanverb),
 third field is the corresponding mwvlex_cp root).

aMsa:10A,10P##aMs:00
is an example where there is no without-'a' form in sanverb.

I think that consistency of verbdata and the two lists AkusmIya.txt AgarvIya.txt needs to be
established before I proceed here.

To that end, here is some data.

The 10 roots from AgarvIya look fine.

Here are roots from AkusmIya that may be problems:

  • These roots have no class 10 records in verbdata
    • yu, gf, SaWa!, mada! , vida!, truwa!
  • These roots have a 10 A record AND a 10 u record
    • spaSa!, qipa! , divu!, daSi!, kuRa!, lakza! , kuwWa!, mAna!-
  • smaya! - not in verbdata, nor is smaya
  • These roots are not in verbdata, but the spelling without ! is in verbdata, with both 10A and 10u
    • vikza!, kUwa!

@drdhaval2785 Will wait for your resolution of these.

From verbdata.txt, I see that the value of the iDAgama field is sew (with 'i') or aniw (without 'i').
why no vew (either with or without i) ?

I don't know to be frank, why this is the case. But the database I received from Mihail seem to have these two categories only.
I do some preprocessing to get the correct iDAgama. It is not too difficult. I reproduce the relevant portion of the code here. . As you can see in the code snippet - I take the iDAgama data from $verbdata only when there is no Paninian sutra to the contrary. This takes care of 'vew' Agama as far as my derivations are concerned. There are not many sUtras (in fact only four) which specify a 'vew' Agama.

    /* idAgama decision */
    if (in_array($lakAra,array("lfw","lfN","luw","ASIrliN","luN","liw","ArDaDAtukalew"))||$san===1) // checking whether ArdhadhAtuka lakAra or not.
    {
        /* smipUGraJjavazAM sani (7.2.74) */
        if ( in_array($fo,array("zmiN","f","pUN","aYjU!","aSU!")) && $san===1)
        {
            $id_dhAtu="sew";
            gui2('7.2.74');
        }
        /* kirazca paJcabhyaH (7.2.75) */
        elseif ( in_array($fo,array("kF","gF","DfN","dfN","praCa!")) && $san===1)
        {
            $id_dhAtu="sew";
            gui2('7.2.75');
        }
        /* iT sani vA (7.2.41) */
        elseif ( (in_array($fo,array("vfN","vfY")) || preg_match('/F$/',$verb_without_anubandha) ) && $san===1)
        {
            $id_dhAtu="vew";
            gui2('7.2.41');
        }
        /* sanIvantardhabhrasjadambhuzrisvRyUrNubharajJapisanAm (7.2.49) */
        elseif ( (in_array($fo,array("fDu!","Brasja!","damBu!","SriY","svf","yu","UrRuY","quBfY","jYapa!","zana!")) || preg_match('/iv$/',$verb_without_anubandha) ) && $san===1)
        {
            $id_dhAtu="vew";
            gui2('7.2.49');
        }
        /* tanipatidaridrANAmupasaGkhyAnam (vA) */
        elseif ( in_array($fo,array("tanu!","patx!","daridrA"))  && $san===1)
        {
            $id_dhAtu="vew";
            gui2('7.2.49');
        }
        /* sani grahaguhozca (7.2.12) */
        elseif ($san===1 && (preg_match('/[uUfFx]$/',$verb_without_anubandha)||$fo==="graha!"||$fo==="guhU!") && $fo!=="UrRuY")
        {
            $id_dhAtu="aniw";
            gui2('7.2.12');
       }
        elseif (anekAca($verb_without_anubandha) || $yaG===1 || $sanAdi==="Ric" )
        {
            $id_dhAtu="sew";
            gui2('seTverb');
       }
         /* svaratisUtisUyatidhUJUdito vA (7.2.44) */
        elseif (in_array($fo,array("svf","zUN","DUY")) || in_array($fo,$Uditverbs))
        {
            $id_dhAtu="vew";
            gui2('7.2.44');
       }
       /* RddhanoH sye (7.2.70) */
        elseif ( (ends(array($verb_without_anubandha),array("f",),1) || $fo==="hana!") && in_array($lakAra,array("lfw","lfN")) )
        {
            $id_dhAtu="sew";
            gui2('7.2.70');
        }
        /* se'sici kRtacRtacCRdatRdanRtaH (7.2.57) */
        elseif (in_array($fo,array("kftI!","cfta!","Cfda!","tfda!","nfta!","nftI!","u!Crdi!r")) && (in_array($lakAra,array("lfw","lfN")) || $san===1) )
        {
            $id_dhAtu="vew";
            gui2('7.2.57');
        }
        /* gameriT parasmaipadeSu (7.2.58) */
        elseif ( in_array($fo,array("gamx!",))  && (in_array($lakAra,array("lfw","lfN")) || $san===1 ) && ($verbpada==="p"||$vsuf==="yak"))
        {
            $id_dhAtu="sew";
            gui2('7.2.58');
        }
        /* na vRdbhyazcaturbhyaH (7.2.59) */
        elseif ( $verbset==="BvAdi" && in_array($fo,array("vftu!","vfDu!","SfDu!","syandU!",)) && (in_array($lakAra,array("lfw","lfN")) || $san===1 ) )
        {
            $verbpada="u";
            $id_dhAtu="aniw";
            $suffix = $tiG;
            gui2('7.2.59');
        }
        /* tAsi ca klRpaH (7.2.60) */
        // sakArAdi. tAsi done elsewhere.
        elseif ( in_array($fo,array("kxpa!",)) && (in_array($lakAra,array("lfw","lfN")) || $san===1 ) && $verbpada==="p")
        {
            $id_dhAtu="aniw";
            gui2('7.2.60');
        }
        /* radhAdibhyazca (7.2.45) */
        elseif (in_array($fo,array("raDa!","RaSa!","tfpa!","dfpa!","druha!","muha!","zRuha!","zRiha!")) && $verbset==="divAdi")
        {
            $id_dhAtu="vew";
            gui2('7.2.45');
        }
        /* niraH kuSaH (7.2.46) */
        elseif (in_array($fo,array("kuza!")) && $us==="nis" )
        {
            $id_dhAtu="vew";
            gui2('7.2.46');
        }
        elseif (verb_itfinder($first)===array("sew"))
        {
            $id_dhAtu="sew";
            gui2('seTverb');
        }
        elseif (verb_itfinder($first)===array("aniw"))
        {
            $id_dhAtu="aniw";
            gui2('7.2.10');
        }
    }
    else
    {
        $id_dhAtu="";
    }

How to interpret the 'D' in iDAgama ? If it were spelled (in slp1)

Line 19 and 20 of function.php (Comment section) has the key.

  • The description part uses Howard Kyoto protocol.
  • The coding uses SLP1 transliteration.

There may be some places where there is some ambiguity, but majorly I have followed this convention.
Therefore, iDAgama is in HK protocol.

sTUla

I agree

aNka:10A,10P#aNk:01A#aNk:01A,10P

Here, per dhAtupATha, there are two verbs 'aNka' of 10th gaNa (अ॑ङ्क॑ - पदे लक्षणे च, चुरादि १०.०४७३) and 'aki!' of 1st gaNa (अ॑किँ॒ - लक्षणे, भ्वादि ०१.००९२).
sanverb takes care of them as separate verbs.
When it comes to MW, it is handled as per morphology of the verb without anubandha (which is same 'aNk' in both cases). Both of them are handled under the same headword.
capture
Therefore, there is no issue in merging these two separate verbs if we are trying to match MW.

I have verified every entry, and the data in Sanskrit Verb has corresponding verbs in other gaNas also.
So, there is nothing to be worried about the amalgamation.

@funderburkjim
Whenever I am talking about Mihail's database, I am talking about this file - https://github.com/drdhaval2785/SanskritVerb/blob/master/Data/Panini-dhatu-index1.xlsx.
$verbdata is derived from this file.
I am not sure how I got it. It was mostly from Marcis / Shalu. I am not sure.
But the database was very nearly perfect. So I decided to make a replica of it and then make corrections in the replica ($verbdata).

yu, gf, SaWa!, mada! , vida!, truwa! not in 10th set.

The AgarvIya and AkusmIya lists were hand coded, and not derived from $verbdata itself. There are some errors in the $verbdata, which missed the 10th set for these verbs.
Corrected them.
It seems to be a coding issue. This kind of error is seen only in 10th class verbs and that too when there are more than two verbs having the same 'verb without anubandha' forms. Will need to correct it programmatically for other such uncatched verbs of 10th gaNa.

These roots have a 10 A record AND a 10 u record
spaSa!,

It seems to be a problem in analysis. There are only two entries of spaSa! in $verbdata. Both are reproduced here.
It is 10A and 01u (Not 10u).

"spaSa!:grahaRasaMSlezaRayoH:spaS:10:0200:A:sew:स्प॑शँ॒:1256:1275:1309:spaS2_spaSaz_curAxiH+grahaNasaMSleRaNayoH:"
"spaSa!:bADanasparSanayoH:spaS:01:1032:u:sew:स्प॑शँ॒॑:558:579:590:spaS1_spaSaz_BvAxiH+bAXanasparSanayoH:"

qipa!

There are two separate verbs डि॑पँ॒ - सङ्घाते, चुरादि १०.००९८ (आकुस्मीय) - आत्मनेपदी.
डि॑पँ॑ - क्षेपे, चुरादि 10.0145 (not आकुस्मीय) - उभयपदी.

divu!

दिवु परिकूजने - आत्मनेपदी
दिवु अर्जने - उभयपदी

daSi!

दशि दर्शने - आत्मनेपदी
दशि भाषार्थः - उभयपदी

kuRa! (There doesn't seem to be a kuRa!. It should be kURa!)

Separate verbs.
[श्रावणे निमन्त्रणे च] सङ्कोचनेऽपि - उभयपदी
सङ्कोचने - आत्मनेपदी

lakza!

दर्शनाङ्कनयोः - उभयपदी
आलोचने - आत्मनेपदी

kuwWa! (कुट्ट) not कुट्ठ

प्रतापने - आत्मनेपदी
छेदनभर्त्सनयोः - उभयपदी

mAna!-

स्तम्भे - आत्मनेपदी
पूजायाम् - उभयपदी

Therefore, the data in sanverb regarding this problem is correct.
Nothing to be updated.

smaya! - not in verbdata, nor is smaya

The book from which I fetched the AkusmIya list has print error here. (Ashtadhyayi sahajabodha of Pushpa Dikshit).
capture

Whereas the correct root is 'syama!'
capture

Corrected in verbdata.

vikza!

vizka changed to vizka! for 10.0207

kUwa!

kUwa changed to kUwa! for 10.0225

@drdhaval2785 Will wait for your resolution of these.

@funderburkjim
The corrections are made and documented above.
drdhaval2785/SanskritVerb@f699092 commit made the changes to $verbdata and its derivatives.

Now you can move further in the track.

Seems I've missed a lot of fun, guys. It's because I do not receive notifications on Jim's repo and he did not let us now what is going on. 💃

This phenomenon is seen only in 10th gaNa 'a' ending verbs like aMsa, aNka, aNga

Yeah, but there are quite many of this kind.

So, wherever there is 10th gaNa verb, it has to have both 'A' and 'P'. Therefore, mwvlex misses out on this detail.

That means not only mwvlex missed it, but also MW source?

assume that this spelling difference is a convention

Fully agree.

anudAtta section heading?

It seems that Mihail hand coded it. I am not sure. I started with an excel
file of dhaatupaatha at starting, but there were some corrections. So made
a PHP array and made corrections thereto. But it seems that the pada
designation were added by hand. Doesn't seem to be based on udAtta,
anudAtta etc.

Not sure if Dhaval understood Jim's question.

Howard Kyoto

Minor detail. It is Harvard, not Howard. (Harvard University and Kyoto University)

spaSa!

I misread the log file; agree that spaSa! is not an issue; it is 10A and 01u, as you say.

As I understand your explanation of the other cases ( qipa! , divu!, daSi!, kuRa!, lakza! , kuwwa!, mAna!),
where both 10A and 10u appear in verbdata: it is just a feature of the dhatupatha that there are two distinct entries with class 10 and different pada designations for these roots.

As far as the mw-based pseudo-dhatupatha that I am using as a basis for constructing conjugations
these two forms are not distinguished, so the class-pada information for, say, qip, would be 10A,10P since
one of the two forms of qipa! is 10u.

It remains to be seen whether this (pysan) condensation of a 'real' dhatupatha to the 'pseudo' one will run into
more serious obstacles. But this obstacle doesn't seem serious to me.

That means not only mwvlex missed it, but also MW source?

Right. The default understanding is that MW has incomplete class-pada information in some cases.

There has not been hand-checking of MW scans in all these cases, so it is possible that a few cases will be
due to mistakes in the digitization of and/or programmatic interpretation of the digitization.

re: sTUla

@drdhaval2785 Suggest you remove the duplicate in $AgarvIya array of function.php.

After rerunning check of class10s for revised verbdata,
only two questions remain, in AkusmIya:

  • 'gf' of AkusmIya - has no class 10 form
  • SaWa!: It's class 10 form is marked as 'u' (should be 'A'?)

Further observations, but not problems:

  • 29 of AkusmIya roots have exactly 1 class10 record in verbdata, and pada is 'A' (as expected)
  • All 10 of AgarvIya have this desired property
  • 7 of AkusmIya have two class 10 records, one with 'A' pada, and one with 'u' pada. These have been
    vetted by Dhaval above. The roots are:
    • qipa! divu! daSi! kURa! lakza! kuwwa! mAna!

qip, would be 10A,10P since
one of the two forms of qipa! is 10u.

Fair enough, as there would be A and pa in at least one verb.

But this obstacle doesn't seem serious to me.

Not to me either.

'gf' of AkusmIya - has no class 10 form

It is a print error in my book. The AkusmIya one is 'gF' (capital F). Corrected to gF.

SaWa!: It's class 10 form is marked as 'u' (should be 'A'?)

Done.

Have now carried out the 6-fold way suggested by Dhaval above, and based on latest update to SanskritVerb.

The updated programmatic comparison between root-class-pada information from SanskritVerb and from MWvlex is now in file sanverb1_mwvlex1_cp.txt.

Here are summary statistics from the comparison:

  • 1013 (v. 652) roots where the class-pada information is identical from the two sources
  • 358 (v. 282) roots where the mwvlex information extends that of verbdata (sanverb)
  • 53 (v. 121) roots where the verbdata information extends the mwvlex information
  • 122 (v. 457) roots where there is incommensurate class-pada information in the two sources
  • 408 (v. 442) roots that appear only in MWvlex
  • 107 (v.172) roots that appear only in SanskritVerb verbdata.

The numbers in parentheses (like (v. 652)) are the numbers from the first comparison above.

Here are the sources of changes thus far:

  • Small number of changes to SanskritVerb (per various of Dhaval's comments above)
  • sanverb1_cp contains the root-class-pada informtion aggregated from the verbdata file of SanskritVerb. The aggregation is done on the basis of the 'root-without-anubandha' spelling of the roots with one adjustment:
    • The adjustment is to the class10 roots (a) whose spelling has the pattern Xa (ends in 'a') and (b) for which the spelling 'X' is a root spelling in MWvlex (per mwvlex_cp.txt), For this group of 63 roots (see roots_a for the list), the sanverb root spelling is changed to 'X' (the 'a' is dropped). For about half of these, (example aNka), there already existed a SanskrtVerb root whose spelling was 'X' (e.g., aNk is also a root in sanverb). In these cases, the class-pada information of the Xa and X forms from sanverb were merged into the new X spelling for sanverb.
  • mwvlex1_cp contains the root-class-pada information from MWvlex, with two adjustments, each taking into account the class-pada information from sanverb. So, consider a particular root of
    MWvlex which also occurs as a root in sanverb1_cp (i.e., as a root-without-anubandha, with the
    adjustment noted above).
    • If this root has class 10 forms in mwvlex and class 10 forms in sanverb, then the class 10 forms of sanverb are preferred. Since verbdata is now consistent with AkusmIya and AgarvIya lists, we are now
      taking into account the special class10 Atmanepada injunction for roots in these two lists.
    • If this root has class 10 forms in mwvlex, but no class 10 forms in sanverb, then the default class 10 ubhayapada is applied (i.e., both 10A and 10P class-padas appear for this root).
    • Similarly, if the class10 root appears only in mwvlex, then ubhayapada applies for class 10 forms.
  • For some of the roots in MWvlex (refer to mwvlex_cp , there is missing class or pada data.
    • If there is missing class information, (example aMs), then the class-pada information from the corresponding root of sanverb1_cp (if the root is found here) is assigned to that root in mwvlex1.
    • If the mwvlex root has class information without pada information, then the pada information for that class in sanverb1_cp (if available) is used in mwvlex1.
      • An example is root akz. The given class-pada information from mwvlex is 01,05 (first class and
        5th class, no pada information). Sanverb information is '01P'. The final mwvlex1 inform is
        01P,05.

The changes made in going from mwvlex to mwvlex1 are itemized in mwvlex1_cp_log.txt.
As the akz example shows, there is still some missing class-pada information for a few roots in mwvlex1_cp (121 roots have missing class information; 119 roots have missing pada information).

I think discussion in this issue is probably at a good pause point, and should be considered finished. Let further adjustments to the root-class-pada story for SanskritVerb and MWvlex be continued in another issue.