rdfjs/rdfxml-streaming-parser.js

Problem parsing stream

happybeing opened this issue · 14 comments

I'm using RdfXmlParser in the browser and can parse an XML-RDF response successfully if I first get the body as text and pass it to this function:

  consumeXmlText (sourceResultStore, statusTextStore, xmlText, {mimeType, size, validateOnly}) {
    console.log('SourceResult.consumeXmlText()');
    let success = false;
    try {
      console.log('Size: ', xmlText.length);
      const xmlParser = new RdfXmlParser;
      const stream = new Readable;
      stream.push(xmlText);
      stream.push(null);
      stream.pipe(xmlParser)
        .on('data', console.log)
        .on('error', console.error)
        .on('end', () => console.log('All triples were parsed!'));
    } catch (e) {
      if (!validateOnly) throw e;
    }
    return success;
  }

The above outputs a bunch of triples and shows the completion message. All good.

I also have a version of the above function for consuming streams, and although I can see the chunks being consumed from my (browser whatwg) stream and written to the XML parser (nodejs stream), the on.('data') handler is never called. I do still get the completion message, and no errors are reported.

Below is the consumeXmlStream() function I'm using, which calls readableStreamToConsumer() to take the whatwg stream and write it to the XML parser.

  consumeXmlStream (sourceResultStore, statusTextStore, stream,  {mimeType, size}) {
    console.log('SourceResult.consumeXmlStream()');
    this.sourceResultStore = sourceResultStore;
    
    try {
      const rdfDataset = RdfDataset();
      const xmlParser = new XmlToDatasetParser;
      xmlParser
        .on('data', console.log)
        .on('error', console.error)
        .on('end', () => console.log('All triples were parsed!'));
      readableStreamToConsumer(stream, xmlParser);
    } catch(e) {
      console.error(e);
    }
  }

and...

function readableStreamToConsumer(readableStream, consumer) {
  const bodyReader = readableStream.getReader();

  function next () {
    bodyReader.read().then(readChunk);
  }

  function readChunk ({value, done}) {
    if (done) {
      consumer.end();
      return;
    }
    console.log('CONSUMER value:'); console.dir(value);
    consumer.write(value);
    next();
  }

  next();
}

As noted, I see the chunks in the console output, so data is being passed to consumer.write(value) and so to the XML parser, but it is as if nothing is being written before consumer.end() is called. In the console the chunks are UInt8Arrays , and converting the ASCII values to characters I can see the first one starts with <?xml.

The NodeJs streams docs say write accepts Uint8Array so I'm not sure what I'm doing wrong or if this is a bug. Here's the console output of the start of first chunk logged by write:

Uint8Array(87046)
​[0…999]
​​[0…99]
​​​0: 60
​​​1: 63
​​​2: 120
​​​3: 109
​​​4: 108
​​​5: 32
​​​6: 118
​​​7: 101

There are definitely triples there to be parsed, as verified by the consumeXmlText() function working as expected (see start). Any ideas? I'm stuck! Thanks.

To be honest, I don't have much experience with WhatWG streams, so I don't have an immediate answer for you.

If the encoding would be wrong, I would expect an error at least, but this does not appears to be the case.

Are you sure that the data you are parsing is identical in Node and the browser? If not, then it could be that the data parsed in the browser simply does not contain any triples. (which could be explained by a missing baseIRI)

Thanks for responding Ruben. Yes, I'm sure the content is the same so it is baffling. It's the same query in both cases so I can't see how the content can be different.

It isn't a Node.js v Browser issue. In the first instance I'm getting the response content as text and passing that rather than making a stream and using that. BTW, both methods work with graphy's Turtle parser (which also requires NodeJs streams).

The only difference in the code is that in the case that works I'm using response.body.text() and passing that, and the second case I'm using the stream from response.body via readableStreamToConsumer() which as I say, works fine with graphy.

I would suspect graphy is doing something more than what the RDFJS stream spec requires, is this correct @blake-regalia?

Sorry, but I'm not entirely clear on what the problem is @theWebalyst , can you please clarify this part?

but it is as if nothing is being written before consumer.end() is called.

What exactly is happening or not happening that is expected and how is that observed?

@blake-regalia When I switch from the text version to the stream version, I can see the data is being processed/consumed, but no triples are being generated, so:

consumer.write(value); is being called and the data looks as expected (the U8IntArray dump above), but .on('data', console.log) is never called in consumeXmlStream() which is what I mean by it being as if the input didn't contain any triples.

I'm sure there are triples there or the text version would not generate any, so the confusion is why this same approach (also using readableStreamToConsumer() works with graphy.js when parsing Turtle, but isn't working here when parsing XML?

@rubensworks in constructor() and #import(), you are using objectMode: true but I think you only want readableObjectMode: true (see options here) because the writable side of the Transform (via Duplex) should accept Buffers/strings, otherwise it will not handle encodings. Am I evaluating this correctly?

If so, I'd be happy to add test case(s) and PR it.

I think you only want readableObjectMode: true

TIL about this option!
This definitely makes sense. I'm surprised though that it took this long for issues with this to occur, as this library has been used quite often both within Node and browsers.

In any case, a PR is definitely welcome.

Note to self: apply this change to my other parsers as well.

Not everyone reports issues. It's a good feeling all round 😄 I'll test it out when ready. Thanks both of you.

Although the option should still be changed, I haven't been able to reproduce an error yet. @rubensworks have you had any luck?

I've created a branch and deployed it so you can see the error live and check the code on github.

To try it live:

  • go to http://vlab.happybeing.com
  • from the "Example SPARQL Queries" drop down select "Test Yago/XML-RDF - Alber Einstein" which loads a simple query which will return XML-RDF.
  • click "Run Query"
  • if you get an error notification, you may need to disable CORS in the browser for this to work - check the console if the query generates an error. I use a browser plugin (Moesif CORS in Chromium, CORS Everywhere in FF)

If you don't get an error then you can try inspecting what happens in the browser console and debugger. The last console output should be "All triples were parsed!" and if any were received from the output they should have been printed to the console - but none are printed. This is the problem. Passing the same query output as text rather than a stream does produce output using this parser, so the question is why this doesn't happen when I'm passing the response as a stream.

You can find the code for the parsing here:
https://github.com/theWebalyst/visualisation-lab/blob/rdfxml-error/src/interfaces/SourceInterface.js#L429-L453

If you want to see the response being parsed as text, change the true in the following line to false:
https://github.com/theWebalyst/visualisation-lab/blob/rdfxml-error/src/interfaces/SourceInterface.js#L857

Doing that will cause the following function to be called when you run the query again, with the RDF-XML response as text consumeXmlText()

I haven't been able to reproduce an error yet. @rubensworks have you had any luck?

Not yet unfortunately, I hope to look into this somewhere next month, I currently don't have the bandwidth for this unfortunately.

Thanks for the example @theWebalyst!

The suggestion from @blake-regalia fixed the problem!

Released as 1.3.6.

Confirmed. Thanks both of you 👍