vitalets/google-translate-api

Add support for translations over 5000 characters

rawr51919 opened this issue · 3 comments

Is it possible for us to implement something similar to what https://github.com/Localize/node-google-translate / https://www.npmjs.com/package/google-translate does in order to handle translations over 5000 characters? That's the only thing that API has over this one.

I think it should be out of the library itself.
The straightforward way is to split large text into chunks by 5k symbols manually:

let text = 'large text...'.split('');  // >5k chars
const chunks = [];
while (text.length) chunks.push(text.splice(0, 5000).join(''));

Promise.all(chunks.map(chunk => translate(chunk)));

More advanced approach is to cut text only by sentence endings. I assume there should be a library for that. Anyway I suggest to keep library scope minimal.

I think it should be out of the library itself.
The straightforward way is to split large text into chunks by 5k symbols manually:

let text = 'large text...'.split('');  // >5k chars
const chunks = [];
while (text.length) chunks.push(text.splice(0, 5000).join(''));

Promise.all(chunks.map(chunk => translate(chunk)));

More advanced approach is to cut text only by sentence endings. I assume there should be a library for that. Anyway I suggest to keep library scope minimal.

So you're saying that this is fixable by way of whatever you attempt to use for it? I believe the repo I had the idea from uses this code for that:

// Split into multiple calls if string array is longer than allowed by Google (5k for POST)
    var stringSets;
    if (shouldSplitSegments(strings)) {
      stringSets = [];
      splitArraysForGoogle(strings, stringSets);
    } else if (!Array.isArray(strings)) {
      stringSets = [[strings]];
    } else {
      stringSets = [strings];
    }

    // Request options
    var data = { target: targetLang };
    if (sourceLang) data.source = sourceLang;

    // Run queries async
    async.mapLimit(stringSets, concurrentLimit, function(stringSet, done) {

      post('', _.extend({ q: stringSet }, data), parseTranslations(stringSet, done));

    }, function(err, translations) {
      if (err) return done(err);

      // Merge and return translation
      translations = _.flatten(translations);
      if (translations.length === 1) translations = translations[0];
      done(null, translations);
    });

And SplitArraysForGoogle from said code:

// Return array of arrays that are short enough for Google to handle
var splitArraysForGoogle = function(arr, result) {
  if (arr.length > maxSegments || (encodeURIComponent(arr.join(',')).length > maxGetQueryLen && arr.length !== 1)) {
    var mid = Math.floor(arr.length / 2);
    splitArraysForGoogle(arr.slice(0, mid), result);
    splitArraysForGoogle(arr.slice(mid, arr.length), result);
  } else {
    result.push(arr);
  }
};

This approach would also work without the concurrent translation limit, but if we ever go through with it, the limit could be imposed to help prevent Google Translate server flooding.

Yes, I mean this is fixable by application level code. Not inside the library.