assistunion/xml-stream

Problems with special encoded character

vincentsaluzzo opened this issue · 2 comments

I parse a big xml file (700mo) and in one line i've a special character : 
And when xml-stream reach this line, it fails with this error:

events.js:141549
      throw er; // Unhandled 'error' event
      ^

Error: reference to invalid character number in line 12482025
    at parseChunk (/Users/.../node_modules/xml-stream/lib/xml-stream.js:514:26)
    at ReadStream.<anonymous> (/Users/.../node_modules/xml-stream/lib/xml-stream.js:521:7)
    at emitOne (events.js:77:13)
    at ReadStream.emit (events.js:169:7)
    at readableAddChunk (_stream_readable.js:146:16)
    at ReadStream.Readable.push (_stream_readable.js:110:10)
    at onread (fs.js:1744:12)
    at FSReqWrap.wrapper [as oncomplete] (fs.js:576:17)

Any idea ?

I met the same problem. I have a line 2 & 3 . error when parsed here. What can I do with this?

Came here in the hopes to see if there was an option for maintaining encoding from $text but I'm interested in the above, too.

In a similar way, I've used this to stream the contents of wordpress site backups and I often run across errors like this in the beginning of that process that we end up just chalking up to cleaning the data. Microsoft's nonprinting control characters are probably the biggest problem and we've encountered it enough that we just built a find/replace tool with a list of them. While you're seeing a random there are many, many more examples than it.

Control characters and encoded characters make a bit of sense to me. I imagine there's some sort of evaluation of the item to see if it contains an xml child and that evaluation throws this error but the weirdest case to me is when an item in the xml has many spaces (as few as 10 or more) the xml stream spits the same error. I wish I had the exact error lines for that error so I could add it, but we ran across that case about this time last year so it's lost in rather old logs. It could be unrelated but if this does get fixed, I'd appreciate that getting addressed as well.