tamil support

Question

tamil support

loretoparisi opened this issue 5 years ago · 17 comments

Thank you for this project. It would be worth to add Tamil support!

Answer 1 · 2019-03-29T07:31:34.000Z

Thank you for the appreciation!
Yes, that's on the To Do list! It'd be great if you find someone who's willing to contribute!

Answer 2 · 2019-03-29T21:18:41.000Z

I'd like to try this. I saw your pointers for Telugu. I'll start from there!

Answer 3 · 2019-03-30T02:41:35.000Z

Gaurav and Amrit

last week I took your Hindi notebooks as a starting point for tamil. The wiki crawler did not work for tamil as is, so I switched to parse wikidump

After a few days of experiments I was able to use spm tokenizer and 60000 articles and get to perplexity in 35-40 range for tamil

Since I have all in one large notebook (wikiparser tokenizer model) I was planning to split to three notebooks and nice to see this great list of lot more languages

I will try to either share as is or a proper pull request this weekend
Keep up the great work

Answer 4 · 2019-03-30T02:46:40.000Z

Amrit please continue to make your version, you may do it better than my result:)

Answer 5 · 2019-03-30T03:55:05.000Z

I have shared my Tamil LM notebook. Please note that this version was run on GCP, but should work with Colab or kaggle. Thanks.
https://github.com/ravi-annaswamy/tamil_lm_spm_fai

Answer 6 · 2019-03-30T04:13:19.000Z

Here are some wiki article (imagined) by the language model.

These are not facts, but we can see how well the grammar and even topicality and semantics are learned by it. This quality was unthinkable just a year ago. And hats of to Jeremy Howard for creating fastai code base and also tirelessly training all.

The model learns to properly open and close the tags. It creates well formed urls (though nonexistent). It learns to repeat the title.
And in each article it stays on related places and concepts!

""" விக்கிரமசிங்க விக்கிரமசிங்க ("vancarare") என்பது இலங்கையின் கிழக்கு மாகாணத்தில் யாழ்ப்பாண மாவட்டத்தில் அம்பாறை மாவட்டத்தில் உள்ள ஒரு கிராமம் ஆகும். இது யாழ்ப்பாண மாவட்டத்தின் தலைநகரை உள்ளடக்கி இருந்தது. இது யாழ்ப்பாண மாவட்டத்தின் வடமத்தியப் பகுதியில் அமைந்துள்ளது. இது 1770 இல் கட்டப்பட்டு, பின்னர் 1844 இல் பிரித்தானியர் ஆட்சிக்குட்பட்டது. இங்கு இந்துக்களும், இந்துக்களும் பெரும்பான்மையாகக் கொண்ட பேரூர், யாழ்ப்பாணம், மாவேலிக்கரசி, யாழ்ப்பாணம் ஆகிய இடங்களில் வாழ்கின்றனர். இலங்கையில் பெரும்பான்மையாக வாழும் அம்பக்கரக்கள், தமிழ் முஸ்லிம்கள், முஸ்லிம்கள், சிங்களவர், தமிழருக்கு ஒரு பிரிவினர் ஏனையோர் ஆவர். ஏனைய தமிழின மக்கள் தொகை 4, 33,000

மகாகவி மகாகவி, (பிறப்பு: பிப்ரவரி 10, 1954) இலங்கை அரசியல்வாதியும், நாடாளுமன்ற உறுப்பினரும் ஆவார். இவர் பேராதனைப் பல்கலைக்கழகத்தின் (united school of indian) தேசிய சபை (ac)யில் (mc) சட்டமன்ற உறுப்பினராக உள்ளார். இலங்கையின் நாடாளுமன்ற உறுப்பினராக இருந்தும், 2004 முதல் 2010 வரை நாடாளுமன்றத்தில் 35 ஆண்டுகள் பிரதிநிதித்துவப் பதவி வகித்தார். 2010 ஆம் ஆண்டில் ஐக்கிய தேசியக் கட்சியில் இணைந்து நாடாளுமன்ற உறுப்பினராகவும், ஐக்கிய மக்கள் சுதந்திரக் கூட்டணியின் உறுப்பினராகவும் தேர்ந்தெடுக்கப்பட்டார். அட்சாவின் மரணத்திற்குப் பின்னர், இவர் நாடாளுமன்றத்தின் அதிபராக தெரிவுசெய்யப்பட்டார். இவர் தற்போது ஐக்கிய மக்கள் சுதந்திரக் கூட்டணியில் 5வது மக்களவை

சபீதுர் கான் (துடுப்பாட்டக்காரர்) சபீதுர் கான் (இறப்பு: மார்ச் 10, 2016) ஒரு தென்னிந்தியத் திரைப்பட நடிகர். இவர் மூன்று தமிழ் திரைப்படங்களுக்கு பின்னணியிசைகளுக்கான பின்னணி இசையை இயக்கியுள்ளார். இவர் தற்போது தமிழ் பாடகி மற்றும் திரைப்படத் தயாரிப்பாளர். இவர் தற்போது தமிழ் திரைப்படங்களில் நடிக்கத் துவங்கினார். இவர் திரைப்படத் துறையில் சென்னைக்கு வரும் நடிகராக விளங்குகிறார். தற்போது திரைப்பட இயக்குநர் நிர்மலா "அறிமுகம்" திரைப்படத்தில் நடித்துள்ளார். இவர் "சௌந்தரபாணி" என்ற படத்தில் நடித்ததற்காக சிறந்த நடிகைக்கான தேசிய விருது பெற்றார். இவர் பெரியார், "சத்யஜித்குமாரர்" என்னும் படத்தில் நடித்து

குருதியிழையங்கள் குருதியிழையங்கள் அல்லது குருதியணுக்களின் குருதியணுக்கள் (ecg plasma) அல்லது குருதியணுக் கலங்கள் ("pyrmond periods") என்பவை புரத நோய்களை கட்டுப்படுத்துவதற்கும், குருதிக் கலங்களுக்குள் உள்ள தாக்கங்களை ஏற்படுத்தி, அவற்றை அணுகவும் பயன்படும் உயிரணுக்களைக் கொண்ட உயிரணுக்களைக் குறிக்கும். இவற்றில் முக்கியக் காரணிகள் உயிரணுக்களின் இனப்பெருக்க உறுப்புக்களே ஆகும். இழையங்கள், இழையங்களின் தொழிற்பாடு, இழையம், இழையம், இழையம் என்பன பொதுவாக ஒரு தனியன் தொகுதியாகவோ அல்லது ஒரே தொகுதியாகவோ இருக்கும். உயிரணுக்கள், இழையங்கள், கொம்புகள், தண்டுகள் போன்றன இழையுரு

பால்வீடசர் பால்வீடசர் ("marter") என்பது நாற்புறமும், புறப்பரப்பில் உள்ள மனிதரின் மூளையின் அமைப்பைப் கூறுவதுமாகும். இதன் வழியாகச் செல்லும் அளவு, உடலின் இன்னொரு பகுதி, நாற்புறமும், ஒருவருக்கும் இடையே அமைந்த மடக்கையின் உடல்கள், மற்றும் நாண்கள், நாண்கள், கண்கள், நாக்கு, முள் போன்ற சில உறுப்புகள் இணைந்து இருக்கும். இந்த நாக்கு மேல்முனையில் ஓடும். நாக்கு பகுதி வயிற்றில் செம்பு, கழுத்தின் கீழ் பகுதி, இளம், மார்பு, வயிறு, மார்பு உட்பட பாதங்களை பிடித்து விடுகிறது. நாக்கின் வெளிப்புறத்தில் இருக்கும். தாயின் உடலின் அடி"""

Answer 7 · 2019-03-30T15:36:53.000Z

Great work @ravi-annaswamy . One question though, I see from the notebook you shared that you had 400 files out of which you used 200 to build vocabulary and then build LM, was that choice made because of computational constraints? Approximately how many articles did those 200 files have?

Also, do you have the bandwidth to build something on top of this LM - may be a news classifier by scraping news articles from Tamil websites? That would be great!

Also, would you like to contribute your Tamil model to iNLTK? Let me know if you're willing to, I'll help you along with the PR, or you can upload your model on dropbox and I can add it to iNLTK.

Thanks a lot for your contribution!

Answer 8 · 2019-03-30T18:22:43.000Z

First of all thanks Gaurav for the inspiration and the full recipe It is an honor to be part of this initiative, with your recipe anyone could do it. So yes I will share the models and data similar to what you did Some more notes below:

Sent from my iPhone

On Mar 30, 2019, at 11:36 AM, Gaurav ***@***.***> wrote: Great work @ravi-annaswamy . One question though, I see from the notebook you shared that you had 400 files out of which you used 200 to build vocabulary and then build LM, was that choice made because of computational constraints? Approximately how many articles did those 200 files have?

Let me do a proper rerun and count those. Let me provide an easy to reproduce version this weekend

Also, do you have the bandwidth to build something on top of this LM - may be a news classifier by scraping news articles from Tamil websites? That would be great!

Yes that is a good idea I will do that in a week. Also I love your inltk interface this will help unify so many things across all cultures hopefully extendible to european, Chinese and African And other languages May I request you to add two More functions : get word vector (word) And get phrase embedding (phrase or sentence)

Also, would you like to contribute your Tamil model to iNLTK? Let me know if you're willing to, I'll help you along with the PR, or you can upload your model on dropbox and I can add it to iNLTK.

Yes please that is an honour thanks.

…

Thanks a lot for your contribution! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 9 · 2019-03-30T18:38:05.000Z

I think the PR is the right thing My work is simply extension of yours

…

Sent from my iPhone

On Mar 30, 2019, at 11:36 AM, Gaurav ***@***.***> wrote: Great work @ravi-annaswamy . One question though, I see from the notebook you shared that you had 400 files out of which you used 200 to build vocabulary and then build LM, was that choice made because of computational constraints? Approximately how many articles did those 200 files have? Also, do you have the bandwidth to build something on top of this LM - may be a news classifier by scraping news articles from Tamil websites? That would be great! Also, would you like to contribute your Tamil model to iNLTK? Let me know if you're willing to, I'll help you along with the PR, or you can upload your model on dropbox and I can add it to iNLTK. Thanks a lot for your contribution! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 10 · 2019-03-31T12:18:59.000Z

Sure, let me know whenever you have the LM and Classifier.
Shout out in case you need any help!

Regarding those two new functions - Thanks for the idea! Let me put that into the To Do list, they'll be a great addition!

Answer 11 · 2019-03-31T12:19:36.000Z

Sorry, I closed it by mistake! Ignore that part!

Answer 12 · 2019-03-31T15:21:40.000Z

@ravi-annaswamy Just a heads-up, to save you from some pain - When you'll be exporting your model to upload to dropbox - make sure your tokenizer class TamilTokenizer is being imported from inltk.tokenizer i.e

Comment the TamilTokenizer class in your notebook
Create a inltk package (inltk folder with __init__.py file (which would be empty)) in the same directory as that of your notebook
Create inltk/tokenizer.py and copy paste your TamilTokenizer class there.
Import this Tokenizer class in your notebook like this: from inltk.tokenizer import TamilTokenizer

If you're wondering why the hell you need to do this, then read the explanation below:
The reason behind this is, that when you export your model, along with the model, the reference to your Tokenizer class is also exported/pickled. And then when we load that pickled file in inltk, it searches for the class definition at that reference. So if you'll export your model with class definition in notebook - the class definition will be saved with reference __main__.TamilTokenizer, which we'll not be able to find in inltk. Hence, to be able to read your model from inltk, you'll have to do the above four steps and then define your TamilTokenizer class (which like others will inherit from LanguageTokenizer, see tokenizer.py). Things should be pretty straightforward from then on.

Let me know how it goes!
Thanks!

Answer 13 · 2019-03-31T15:58:36.000Z

Gaurav, I really appreciate the care and in the instructions to help me out.

Today I started a fresh colab run for total reproducibility. After that I will try to do the packaging as you have suggested. I just did a repull of ta-wiki and rebuilt spm for 8000 word vocab. And just about to create the tokenizer, so I will try your instructions right now.

I am really amazed by your dedication and care in this.

Your example and drive (and of others like Selva) is inspiring me to share more and more of what I have.

Answer 14 · 2019-05-17T02:36:29.000Z

@ravi-annaswamy Thanks for the great work! I've added your LM to iNLTK, and trained the language classifier to to have support for tamil (and urdu).

Now, iNLTK's v0.3 supports Tamil as well. Thank you for your contribution.

I'll close this issue out, but feel free to re-open if there's anything.

Answer 15 · 2019-05-17T06:05:04.000Z

Awesome thank you !

…

Sent from my iPhone

On May 16, 2019, at 10:36 PM, Gaurav ***@***.***> wrote: Closed #2. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 16 · 2020-11-15T13:48:08.000Z

Sir I am trying to get embedded vectors from iNLTK .but i am getting error in setup('ta').kindly help me.I am using it in colab.

Answer 17 · 2021-01-04T12:28:52.000Z

@diviyalouis Sorry I missed the notification for this. Are you still facing the issue? It'll be great if you can post a notebook with steps to reproduce.