Document and Publish
Opened this issue · 13 comments
In 2021, let's consider
a) Creating a good document explaining the software design
b) Publishing on arxiv.org and in a relevant conference/publication if feasible.
There are gaps (like a good test suite, and a way to measure metrics better as in #84 )
Thoughts, @avinashvarna @vvasuki ?
Chuck the "good document". Use a website instead - (eg. https://jyotisham.github.io/jyotisha/software/ hosted from https://github.com/jyotisham/jyotisha/tree/master/hugo-source ) for design, and readthedocs type service for API doc .
For propaganda via conference/ publication - you anyway need a separate doc (with some content copied form the site).
Good idea! More documentation is always better. I am not very familiar with the Sanskrit NLP conferences, but if there is a novel component in our implementation, then we can submit it.
- I don't think @kmadathil meant a word doc, and I am fine with wiki/website as appropriate.
- Regarding metrics, I read a few papers published recently on this topic. Several use the sandhikosh described in this paper as a benchmark. I will open an issue to add this to our testing and use the same metrics.
An online document/web-page is fine, but we're at a sufficiently advanced stage to document our software design. Once it's documented, we can explore exactly how novel it is. I suspect at least part of it must be, since there aren't many competitors that start from any string and provide a grammatical parse, splitting sandhi along the way. (Are there any that do all of this?)
Avinash, can you post some of the recent literature on sandhi that you read? Thanks for posting sandhikosh, we should migrate our test infrastructure to it as the first step.
Don't the INRIA and UoH tools provide a parse along with the splits? A search also found this link, but I couldn't find an interface/source - https://www.appliedsyntax.com/sanskrit-parsing
Most of the recent literature has been focused on using Neural Network based approaches. Most of the papers on this list for example that use the sandhikosh as the benchmark are a good place to start. The latest paper I read was actually posted on the sanskrit-programmers list - https://www.linkedin.com/posts/prathosh-ap-phd-50ab9511_preprint-of-sandhi-paper-ugcPost-6716237804055199744-JBDu/
It would be good to survey alternatives. UoH provides a parse, if you go by their publications. I haven't tried INRIAs parser. A good testset for parses (analogous to SandhiKosh) would be a way to compare ourselves and look for areas of improvement.
From what I could read
- UoH does not start from an un-split sentence - they assume a sandhi split sentence. However, their graph based approach is most akin to our vakya analyzer.
- SHR (Sanskrit Heritage Reader - Goyal and Huet 2016) splits sentences, but does not do sentence analysis
- IIT-Kgp (Pawan Goyal &co) do splitting and joint splitting/morphological analysis, but using NN methods. They have the best results.
If we can grab the DCS10K used by the IIT-Kgp papers, we can compare against them
I am not very familiar with the Sanskrit NLP conferences
There are only two around.
I have updated the documentation with more information about our internals.
a) Creating a good document explaining the software design
Installing instructions would be a good place to start from.
@gasyoun - Install is as simple as pip install sanskrit_parser
, which is documented.
If your concern is about the MS VC++ dependency on gensim, please take it up with them. Your alternative is to run Linux (which I recommend), but you'll need to install build-essentials there as well. This is a gensim requirement, not for our code.
We will explore alternatives to gensim.
Your alternative is to run Linux (which I recommend)
To split 4000 samasas? No, thanks. I will have to find an another alternative.
Your alternative is to run Linux (which I recommend)
To split 4000 samasas? No, thanks. I will have to find an another alternative.
Since I like to keep track of such, it'd be useful if you notify us of any alternative you find (preferably accompanied by a brief review).