telugu support

Question

telugu support

soumith opened this issue 6 years ago · 22 comments

Hey, great repository.
I'd like to add Telugu support. If you have a framework I should follow to download Telugu wikipedia and train it, I'd love some instructions and get going

Answer 1 · 2019-03-29T03:52:48.000Z

Thanks for the initiative!
I had a look at Telugu Wikipedia Homepage and it looks like, it does not have all of its pages indexed by alphabets at the homepage like some other languages. I'd faced a similar issue with Marathi, so the notebooks I'd used to scrape Marathi wikipedia will be quite useful. So,

Use this notebook to get all the Telugu wikipedia articles' links. What this notebook does is that it starts collecting article links from this page, then goes to next page - collects from there and moves to next page. It keeps doing this till we're able to add more article links and eventually stops. You should be able to get all the Telugu article links just by changing the starting page to this.
Then use this notebook to scrape the articles corresponding to the urls you would have saved in step 1. I don't think you will need to make any changes to this notebook because articles' pages have the same structure, irrespective of the language.
Once you have the Wikipedia Articles Dataset, you can use this notebook as a reference to train LM. To train the LM, you'll need tokenization, for that I've been using sentencepiece - you can use notebook here as a reference for that.

And that should be it. It might be worth scraping some Telugu news website for building a classification model as well on top of the LM. Let me know if I can help you with anything along the way!

Answer 2 · 2019-03-29T03:56:05.000Z

thanks a ton for the detailed pointers. @binga said he'd cleanup what he already has over here: https://github.com/binga/fastai_notes/tree/master/experiments/notebooks/lang_models and send a PR to inltk ( reference ). I'll follow his lead and take up and tasks that he needs help on.

Answer 3 · 2019-03-29T03:58:56.000Z

Okay - That'd be great!

Answer 4 · 2019-04-03T16:00:28.000Z

I had built a Telugu dataset which contains 1,58,000 articles scraped from a news paper website https://github.com/AnushaMotamarri/Telugu-Newspaper-Article-Dataset , This dataset should be useful for classification. Dataset is divided into 3 years, data under each year is further divided into several categories. Each file has date&time, title and content.

and i had built another dataset which has around 26,000 files scraped from 300 novels https://github.com/AnushaMotamarri/Telugu-Books-Dataset .

Datasets can be directly downloaded from links https://drive.google.com/file/d/1IbqM335M7imzG-2ZV0d8-JbRqCnyAii3/view and https://drive.google.com/file/d/1MDiP-_S2RtAN7c9TLnKi8I2pxIgONIP0/view Respectively.

Here is the Tokenizer I had built for Telugu https://github.com/AnushaMotamarri/TeluguTokenizer

I am currently working on creating a lemmatizer for Telugu Language.
I would like to contribute.

Answer 5 · 2019-04-04T08:23:26.000Z

@AnushaMotamarri Thanks for reaching out! You would like to contribute with building Language Model? @binga will be contributing the LM to iNLTK. So, it'd be great, in order to avoid duplicating efforts, if you could contribute with Telugu NER or translation.

Answer 6 · 2019-04-04T14:26:10.000Z

yes,
Is there any previous work done in any other language on NER or translation to iNLTK ? It would be great if I can get some standard references to get started with.

Answer 7 · 2019-04-05T06:08:54.000Z

No, I'd just started with it. So nothing in iNLTK yet.

Answer 8 · 2019-04-05T18:38:32.000Z

ok, i will work on them

Answer 9 · 2019-04-06T06:36:56.000Z

Hey, I would love contribute my part and can I plz collaborate with you guys ?

Answer 10 · 2019-04-18T07:48:36.000Z

@Asrst It will be worth tagging and asking @AnushaMotamarri or @binga if you can help them out with something, or elaborating on how you would like to contribute!

Answer 11 · 2019-04-18T07:50:49.000Z

I would like to contribute as well. @goru001 may be a gitter channel would help for easier/faster conversation here 🤔

Answer 12 · 2019-04-18T08:00:29.000Z

@sainathadapa Yes right! Here it is!

Answer 13 · 2019-04-22T23:50:30.000Z

Hi All, just wanted to introduce myself and see if I can help with something to add Telugu support. Please let me know if you have any initial thoughts on where I can contribute

PS: posted on the Gitter channel and wasn't sure if it was being monitored. So posting here.

Answer 14 · 2019-04-23T18:31:46.000Z

Hi @praveenc1, It will be worth tagging and asking @AnushaMotamarri or @binga if you can help them out with something. Or else, you can start with anything NER, Coreference resolution etc, almost everything is unexplored territory.

Answer 15 · 2019-11-11T05:19:17.000Z

@goru001 After training on new Language how to integrate that model in inltk to get sentence vector?

Answer 16 · 2019-12-16T07:07:51.000Z

I haven't found a great source for Telugu languages. We shall make a collection by scraping the data from Telugu webpages

Answer 17 · 2020-08-10T08:44:18.000Z

Hi, I can help you with telugu language source

Answer 18 · 2020-08-11T11:14:50.000Z

Hi all,
I would like to help you guys in building this project. Can you please let me know where to get started and whom to reach out to.
Thanks.

Answer 19 · 2020-10-10T17:02:21.000Z

Any previous work done on Tenglish(Telugu typed in English) ? The usage Telugu we converse with on whatsapp etc everyday.

Answer 20 · 2020-10-12T05:36:30.000Z

With the latest release of iNLTK, i.e. v0.9 Telugu support has been added, thanks to @Shubhamjain27 . Hence, closing this issue.

Answer 21 · 2020-10-12T05:38:23.000Z

@hariperavali Tenglish (Telugu+English) support is not there yet, code-mixed support has been added for Hinglish, Tanglish and Manglish in v0.9. Feel free to work on it and raise a PR if you want to.

Answer 22 · 2021-01-12T01:30:34.000Z

i need telugu sentiwordnet, i was try in many sites not getting, please help me