tremby/py-translate

Doesn't retain line separations

Closed this issue · 5 comments

$ py-translate/translate -s fr $'un\ndeux\n\ntrois'
one two three

I should get

one
two

three

I've no idea if this is due to limitations in the API; this sort of thing does work correctly at http://translate.google.com.

I'd like to use your util to translate subtitle files, so this functionality is pretty essential.

Hmm.. actually I guess I can just parse the subtitle files and translate each caption separately rather than doing it all in one request. This will be more reliable anyway. Also I took a look through the code and verified that the newlines are being sent out in the request and that the response just contains spaces, so there's not much that can be done here.

translate.google.com uses a different url; my request started with: http://translate.google.com/translate_a/t?client=t&text=un%0Adeux%0Atrois&hl=en

I guess there are two different versions of the API, so maybe translate.google.com uses the newer one. There is also a python module on pypi called pytranslate which uses that same URL. Unfortunately it will only output the first line of any translation. Well, no big deal, anyway. Thanks for making this available.

Just in case somebody is looking to do subtitle file translation, I might as well mention the module/CLI util I've written. It takes the approach mentioned in my above comment, and uses the pytranslate module I referred to there.

I noticed during debugging that it seems to get blocked after a large number of requests: I ran the same hour-long TV program through about 20 times in an hour or two, and was unable to do any more translations with it for the rest of the day. But today it is working again. Obviously this has little to do with my issue report, but it may be relevant to people using this script.

As you've found, the string is given to Google as is, and the output given is Google's output. I haven't found anything in the API docs which will preserve newlines/whitespace, but there are some things you could try.

  • Obviously as you've seen you could make multiple calls to the script, one for each line
  • The script could be modified to preserve newlines by making multiple calls to Google for each line (perhaps as an option, since sometimes newlines are meaningless as in linewrapped plain text)

Both of those have the problem as you've also found that Google might block you eventually. Have a look at the translation API docs -- there are some extra vars you can send like your IP address, which I saw just now reduces the chance that you'll be seen as abuse.

  • Replace newlines with some placeholder string -- <br> or #newline# or something like that -- something which won't get translated -- then undo that when you get the reply
  • There's an option in the API to treat the string to be translated as HTML. You could convert to HTML (which would just mean escaping characters like < and &, and replacing newlines with <br>) and mod the script to use that option.

The last option there looks the best to me. I'll maybe add an option to treat input as HTML and another to internally convert to HTML and back so that newlines are preserved.

It turns out that adding "format=text" to the request causes newlines to be preserved, so I've added that as a new -p/--preserve-newlines option. It's odd Google behaves that way, since the docs say "text" is the default. It'll do for now.