AI4Bharat/indicTrans

Chunks lost in translation

PS-AI opened this issue · 5 comments

PS-AI commented

Hi,
Thank you for your work on indicTrans. I have been using this to translate some short paragraphs( 3-4 sentences) in various supported Indic languages. I noticed that there is a certain amount of data that gets lost in translation. For example- I am trying to translate this English sentence to Tamil:
"In order to make the French capital safer, quieter and less dirty,a speed limit of 30 kmph for cars came into force in Paris on Monday"
This is translated as:
பிரான்ஸ் தலைநகர் பாரிஸில் கார்களுக்கு மணிக்கு 30 கிலோமீட்டர் வேகத்தில் செல்லலாம் என்ற கட்டுப்பாடு விதிக்கப்பட்டுள்ளது

The chunk- "In order to make the French capital safer, quieter and less dirty" is lost in the translation

I assumed that with the Transformer architecture, long sentences too could be translated more accurately.

I would like to know what could be done to fix this issue.

@PS-AI
I can reproduce this issue and we are not quite sure why this is happening.

I tried rephrasing the sentence or changing the order (A speed limit was imposed ... to make French capital safer, quieter, and less dirty) and in some cases, I get the full translation.

So It might not be an issue of just the sentence being long (as reordering the sentence seems to work), but maybe the translation system ignoring "safer, quieter and less dirty". "The French capital" is still used in the translation (பிரான்ஸ் தலைநகர் பாரிஸில் -> France's capital Paris).

@anoopkunchukuttan Sir, do you have any thoughts on this?

Another note, if you are using the command-line interface, is that you need to segment paragraphs into sentences before feeding them to the model ( In the python interface, this is automatically handled).

You can use the code snippet on our readme to split paragraphs into lines before translating. Our models also have a max sequence length of 200 and hence sentences that have >200 length will get truncated before being fed to the model.

P.S. This doesn't apply to the sentence you shared as it's a single sentence and is less than 200 tokens.

PS-AI commented

@gowtham1997 Thank you for your response. I am using the python interface. The paragraph is split into lines before attempting translation.

The Samanantar paper mentions - "sentence pairs with longer sentences are unlikely to have high alignment on LaBSE representations and thus be included in Samanantar"

Could this be an issue due to the data that IndicTrans was trained on( i.e no such long sentences in the training data) ?

I am not sure about the cause @PS-AI .

But we have now released newer version of the translation models (V0.3) and it seems to preserve the whole sentence during translation.

"In order to make the French capital safer, quieter, and less dirtier, a speed limit of 30 kmph for cars came into force in Paris on Monday"
translates to

"பிரான்ஸ் தலைநகரை பாதுகாப்பானதாகவும், அமைதியானதாகவும், தூய்மையற்றதாகவும் மாற்றுவதற்காக, கார்களுக்கான வேக வரம்பு மணிக்கு 30 கி. மீ. என பாரீஸ் நகரில் திங்கள்கிழமை முதல் அமலுக்கு வந்துள்ளது."

^ Note that I have changed "less dirty" to "less dirtier" (cleaner also works). For some reason, the model gives "அழுக்கானதாகவும்" for "less dirty", so it still needs some improvement.

image
Even google translate seems to make the same mistake for En to Tamil which is surprising.

Thanks for bringing this to our notice. If you find similar sentences where translations quality is bad, please feel free to let us know