albertosantini/node-rio

Large returning dataset results in Invalid Response

sanderd opened this issue · 11 comments

Hi,

For a recent project, I have a very large JSON result-set, calculated by R.
It is being sent as string (explicitely converted in R using jsonlite + toString)

However, as soon as the resulting data (string) gets too big, RIO exits with the following:

Connected to Rserve
Supported capabilities --------------

Sending command to Rserve
00000000: 0300 0000 b400 0000 0000 0000 0000 0000  ....4...........
00000010: 04b0 0000 6c69 6272 6172 7928 276a 736f  .0..library('jso
00000020: 6e6c 6974 6527 2920 0a63 616c 6c6d 6520  nlite')..callme.
00000030: 3c2d 2066 756e 6374 696f 6e28 7029 207b  <-.function(p).{
00000040: 200a 2020 7361 6d70 6c65 4c69 7374 203c  ....sampleList.<
00000050: 2d20 6c69 7374 2829 200a 2020 7361 6d70  -.list()....samp
00000060: 6c65 4c69 7374 2476 616c 7565 7320 3c2d  leList$values.<-
00000070: 2073 6571 2830 2c20 3530 3030 3030 302c  .seq(0,.5000000,
00000080: 2031 2920 0a20 2072 6574 7572 6e28 746f  .1)....return(to
00000090: 5374 7269 6e67 2874 6f4a 534f 4e28 7361  String(toJSON(sa
000000a0: 6d70 6c65 4c69 7374 2929 2920 0a7d 200a  mpleList)))..}..
000000b0: 0a63 616c 6c6d 6528 277b 7d27 2900 0101  .callme('{}')...
000000c0: 0101 0101                                ....

Data Header
00000000: 0100 0100 e065 5102 0000 0000 0000 0000  ....`eQ.........

Invalid response (expecting SEXP)

The following (test-)code reproduces this scenario (please note, jsonlite should be installed @ R):

describe('can receive large json result', () => {
    it('should return correctly', function(done) {
      this.timeout(30000);

      Rio.bufferAndEval(
        "library('RJSONIO') \n" +

        "callme <- function(p) { \n" +
        "  sampleList <- list() \n" +
        "  sampleList$values <- seq(0, 5000000, 1) \n" +
        "  return(toJSON(sampleList)) \n" +
        "} \n",
        {
          entryPoint: 'callme',
          data: {},
          callback: function(err, res) {
            if(err) {
              console.log('error!');
              console.log(err);

              expect(true).to.equal(false);
            } else {
              console.log('result:');
              console.log(res)
              done();
            }
          }
        }
      );
    });
  });

I've uploaded the RServe debug log @ http://www.sanderd.be/tmp/rservelog.gz
In this log, the following looks interesting to me (head dump + body dump, RIO seems to fail on head dump and skip body alltogether) - although I have no clue how RServe internally works. (SEXP etc is Chinese to me ;))

getStorageSize(00000000133EFC60,type=16,len=1) getStorageSize(00007FF5FD7B0010,type=9,len=38888910) = 38888920
= 38888928
result storage size = 48611160 bytes
Trying to allocate temporary send buffer of 48611328 bytes.
stored 00000000133EFC60 at 0000000013405048, 38888916 bytes
stored SEXP; length=38888928 (incl. DT_SEXP header)
OUT.sendRespData
HEAD DUMP [16]: 01 00 01 00 e0 65 51 02 00 00 00 00 00 00 00 00  |.....eQ.........
BODY DUMP [38888928]: 4a d8 65 51 02 00 00 00 62 d0 65 51 02 00 00 00 7b 22 76 61 6c 75 65     73 22 3a .....

edit - some additional info:
OS: Windows 10, same issue reported @ OS X latest by colleague
R: [1] "R version 3.2.2 (2015-08-14)"
RServe: Rserve 1.7-3 ()

(Any pointers towards how I could potentially help, shoot!)
Thanks,
Sander

Thanks for the feedback.

Firstly, try to use the package RJSONIO instead of jsonlite.

Hi @albertosantini

I've updated the code snippet to use RJSONIO. However, not the source of the problem. In R-studio (and logs), the JSON is correctly generated.

It's the transmission of the result, over RServe, to RIO, that's failing.

Thank you!

From R console:

> library("RJSONIO")
> 
> callme <- function(p) {
+     sampleList <- list()
+     sampleList$values <- seq(0, 5000000000, 1)
+     return(toJSON(sampleList))
+ }
> 
> x <- callme()
Error in seq.default(0, 5e+09, 1) : 'by' argument is much too small

Indeed no matter RJSONIO or jsonlite
Sometimes I get that error with 5000000.
With 5000000000 always.

Reference script:

#library("RJSONIO")
library("jsonlite")

callme <- function(p) {
    sampleList <- list()
    # sampleList$values <- seq(0, 5000000, 1)
    sampleList$values <- seq(0, 5000000000, 1)
    # return(toJSON(sampleList))
    return(toString(toJSON(sampleList)))
}

I give a try with RSclient to isolate the problem.

You could always replace the sampleList-code with a JSON-file you read from disk.
This seq-issue is because seq() as a protection built in to prevent memory explosion ;)

As long as the resulting output string is 'large' (~15mb at least?), I get the RServe+RIO error.

Off topic, but I think you figured that out already:
jsonlite toJSON returns a 'json' object. Rjsonio returns a 'string'. Hence the toString() when using jsonlite.

After hours of investigation, I give you a feedback.

It should be if (responseType & 0x0F !== 10) {.

  • There is a feature to be implemented. If you take the first bytes of body dump (client or server side, of course, is the same), you see the bytes 4a d8 65 51 02 ....

4a is the response type and we need to take only the first half a. Then there are three bytes for the length of the payload and, eventually, there is 02: it is the type of the packet.

https://github.com/albertosantini/node-rio/blob/master/lib/rio.js#L106

That kind of type is not implemented in rio.
Weirdly, when the length of buffer changes, the type of packet is changed.

Firstly I need to merge dev-2.x branch, then I will give another try to this issue.

At last the issue is about supporting large packet feature.
I added an example, ex7.js.

There is a last glitch before releasing a new version: JSON.parse fails, while json_parse is ok.

I need to investigate with Node.js, because JSON.parse doesn't work.

In the meantime, please, you may use master and give a try.

It seems there is a sort of bug (or edge case) in JSON.parse implementation.
The array contains the legit number 4.e+05.

I found the offending number using jsonlint.
I saved as file the rio response and used it as input for jsonlint.

$ jsonlint ex7.txt                                   
[Error: Parse error on line 1:                       
...5,  4e+05,  4e+05, 4.e+05, 4.0001e+05, 4          
-----------------------^                             
Expecting 'EOF', '}', ',', ']', got 'undefined']     

However in Node.js REPL or Chrome console the statement JSON.parse("[4.e+05]") raises the following message: SyntaxError: Unexpected token e or Uncaught SyntaxError: Unexpected token e.

So I was tempted to fill in issue in Chrome/V8 issue tracker, but, then, I tried it in Firefox: SyntaxError: JSON.parse: unterminated fractional number at line 1 column 4 of the JSON data.

I am afraid that implementation would not be changed very soon and I didn't fill any issue in Chrome, V8 or Firefox.

However you should not get that "bug", because you need to export the numbers with all the figures. Indeed, using scientific notation rounds the value when there is the string conversion. I think you should set options there before converting the R response to string. Otherwise you have the same number representation like ... 5e+06, 5e+06, 5e+06, 5e+06 ].

In a few hours I close the issue, publishing a new release.

Hi @albertosantini - thanks for your work on this. Sorry for not getting back earlier.

The issue you are describing around the JSON format is indeed not something to be fixed in node-rio, but jsonlite/rjsonio.

Looking forward to the release / will get back to that part of the project real soon.

2.1.0 published with large packet support.

Feel free to open new issues.
Thanks for the feedback.

I report here a few debugging details. Just as memo. :)

The most import thing is starting a Rserve debug instance.
When rio calls Rserve, it displays a few info.

The valuable info is at the end of the debug dump.
Indeed you added this info when you opened the issue. Great!
If the packet is very large you need to wait a few minutes.

...
OUT.sendRespData
HEAD DUMP [16]: 01 00 01 00 d0 fc 84 03 00 00 00 00 00 00 00 00  |................
BODY DUMP [59047120]: 4a c8 fc 84 03 00 00 00 62 c0 fc 84 03 00 00 00 5b 20 20 20 20 20 20 30 2c 20 20 20 20 20 20 31 2c 20 20 20 20 20 20 32 2c 20 20 20 20 20 20 33 2c 20 20 20 20 20 20 34 2c 20 20 20 20 20 20 35 2c 20 20 20 20 20 20 36 2c 20 20 20 20 20 20 37 2c 20 20 20 20 20 20 38 2c 20 20 20 20 20 20 39 2c 20 20 20 20 20 31 30 2c 20 20 20 20 20 31 31 2c 20 20 20 20 20 31 32 2c 20 20 20 20 20 31 33 2c ...  |J.......b.......[      0,      1,      2,      3,      4,      5,      6,      7,      8,      9,     10,     11,     12,     13,
Releasing temporary sendbuf and restoring old size of 2097152 bytes.
reply sent.
Connection closed by peer.
done.

In large data packet the first 16 bytes of BODY DUMP are the header of the payload: 4a c8 fc 84 03 00 00 00 62 c0 fc 84 03 00 00 00.

  • 4A 0x4A & 0x40 = 0x40 // 0x40 is a large packet, usually is zero.
  • 4A 0x4A & ~0x40 = 0x0A // this is the correct value (0x0A) for a response type
  • c8 fc 84 03 00 00 00 hex 0384FCC8 - dec 59047112 // 32 bytes packet length with the header (-8)
  • 62 0x62 & 0x40 = 0x40 // confirmed large packet, again
  • 62 0x62 & ~0x40 = 0x22 // dec 34 - packet type: array of string (nice, it is implemented)
  • c0 fc 84 03 00 00 00 hex 0384FCC0 - dec 59047104 // 32 bytes packet length of the payload (BODY length 59047120 minus 16 bytes of the header)

Was able to reproduce it in 2.4.1.
Node: v0.12.5
NPM - 2.11.2

Is this issue still open.

@yashdosi without any further detail, it is impossible to reproduce it.

Please, read the thread and when you collect new details, open a new issue.

You may give a look at the example 7.
Consider also to update the stack, even if , I think, it is not relevant, but v0.12.5 is really old.