goru001/inltk

telugu support

soumith opened this issue ยท 22 comments

Hey, great repository.
I'd like to add Telugu support. If you have a framework I should follow to download Telugu wikipedia and train it, I'd love some instructions and get going

Thanks for the initiative!
I had a look at Telugu Wikipedia Homepage and it looks like, it does not have all of its pages indexed by alphabets at the homepage like some other languages. I'd faced a similar issue with Marathi, so the notebooks I'd used to scrape Marathi wikipedia will be quite useful. So,

  1. Use this notebook to get all the Telugu wikipedia articles' links. What this notebook does is that it starts collecting article links from this page, then goes to next page - collects from there and moves to next page. It keeps doing this till we're able to add more article links and eventually stops. You should be able to get all the Telugu article links just by changing the starting page to this.
  2. Then use this notebook to scrape the articles corresponding to the urls you would have saved in step 1. I don't think you will need to make any changes to this notebook because articles' pages have the same structure, irrespective of the language.
  3. Once you have the Wikipedia Articles Dataset, you can use this notebook as a reference to train LM. To train the LM, you'll need tokenization, for that I've been using sentencepiece - you can use notebook here as a reference for that.

And that should be it. It might be worth scraping some Telugu news website for building a classification model as well on top of the LM. Let me know if I can help you with anything along the way!

thanks a ton for the detailed pointers. @binga said he'd cleanup what he already has over here: https://github.com/binga/fastai_notes/tree/master/experiments/notebooks/lang_models and send a PR to inltk ( reference ). I'll follow his lead and take up and tasks that he needs help on.

Okay - That'd be great!

I had built a Telugu dataset which contains 1,58,000 articles scraped from a news paper website https://github.com/AnushaMotamarri/Telugu-Newspaper-Article-Dataset , This dataset should be useful for classification. Dataset is divided into 3 years, data under each year is further divided into several categories. Each file has date&time, title and content.

and i had built another dataset which has around 26,000 files scraped from 300 novels https://github.com/AnushaMotamarri/Telugu-Books-Dataset .

Datasets can be directly downloaded from links https://drive.google.com/file/d/1IbqM335M7imzG-2ZV0d8-JbRqCnyAii3/view and https://drive.google.com/file/d/1MDiP-_S2RtAN7c9TLnKi8I2pxIgONIP0/view Respectively.

Here is the Tokenizer I had built for Telugu https://github.com/AnushaMotamarri/TeluguTokenizer

I am currently working on creating a lemmatizer for Telugu Language.
I would like to contribute.

@AnushaMotamarri Thanks for reaching out! You would like to contribute with building Language Model? @binga will be contributing the LM to iNLTK. So, it'd be great, in order to avoid duplicating efforts, if you could contribute with Telugu NER or translation.

yes,
Is there any previous work done in any other language on NER or translation to iNLTK ? It would be great if I can get some standard references to get started with.

No, I'd just started with it. So nothing in iNLTK yet.

ok, i will work on them

Asrst commented

Hey, I would love contribute my part and can I plz collaborate with you guys ?

@Asrst It will be worth tagging and asking @AnushaMotamarri or @binga if you can help them out with something, or elaborating on how you would like to contribute!

I would like to contribute as well. @goru001 may be a gitter channel would help for easier/faster conversation here ๐Ÿค”

@sainathadapa Yes right! Here it is!

Hi All, just wanted to introduce myself and see if I can help with something to add Telugu support. Please let me know if you have any initial thoughts on where I can contribute

PS: posted on the Gitter channel and wasn't sure if it was being monitored. So posting here.

Hi @praveenc1, It will be worth tagging and asking @AnushaMotamarri or @binga if you can help them out with something. Or else, you can start with anything NER, Coreference resolution etc, almost everything is unexplored territory.

@goru001 After training on new Language how to integrate that model in inltk to get sentence vector?

I haven't found a great source for Telugu languages. We shall make a collection by scraping the data from Telugu webpages

Hi, I can help you with telugu language source

Hi all,
I would like to help you guys in building this project. Can you please let me know where to get started and whom to reach out to.
Thanks.

Any previous work done on Tenglish(Telugu typed in English) ? The usage Telugu we converse with on whatsapp etc everyday.

With the latest release of iNLTK, i.e. v0.9 Telugu support has been added, thanks to @Shubhamjain27 . Hence, closing this issue.

@hariperavali Tenglish (Telugu+English) support is not there yet, code-mixed support has been added for Hinglish, Tanglish and Manglish in v0.9. Feel free to work on it and raise a PR if you want to.

i need telugu sentiwordnet, i was try in many sites not getting, please help me