pl.readJSON Fails on JSON with Newline Characters Despite "lines" Format Setting
Closed this issue · 2 comments
denisgermano commented
Have you tried latest version of polars?
- [yes]
What version of polars are you using?
nodejs-polars@0.15.0
What operating system are you using polars on?
MacOS 14.6.1 M2 Max
What node version are you using
Node v22.7.0
Describe your bug.
When using the pl.readJSON function to load NDJSON data, the function fails if any JSON string contains a newline character (\n). This issue is present even when the format option is set to "lines" as per the documentation.
What are the steps to reproduce the behavior?
const pl = require("nodejs-polars");
let jsonData = `
{"id":"2489651051","type":"PushEvent"}
{"id":"2489651045","type":"Create\nEvent"}
{"id":"2489651053","type":"PushEvent"}
`;
let df2 = pl.readJSON(jsonData, { format: "lines" })
console.log("FROM READ", df2);
What is the actual behavior?
Raise syntax error on parsing ndjson
/Users/denis.germano/node_modules/nodejs-polars/bin/io.js:137
return (0, dataframe_1._DataFrame)(method(Buffer.from(pathOrBody, "utf-8"), options));
^
Error: Syntax at character 0
at Object.readJSON (/Users/denis.germano/node_modules/nodejs-polars/bin/io.js:137:48)
at Object.<anonymous> (/Users/denis.germano/Downloads/example_polars/poc-wip.js:19:14)
at Module._compile (node:internal/modules/cjs/loader:1546:14)
at Module._extensions..js (node:internal/modules/cjs/loader:1691:10)
at Module.load (node:internal/modules/cjs/loader:1317:32)
at Module._load (node:internal/modules/cjs/loader:1127:12)
at TracingChannel.traceSync (node:diagnostics_channel:315:14)
at wrapModuleLoad (node:internal/modules/cjs/loader:217:24)
at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:166:5)
at node:internal/main/run_main_module:30:49 {
code: 'GenericFailure'
}
Node.js v22.7.0
What is the expected behavior?
To parse correctly as in a Stream
const pl = require("nodejs-polars");
const Stream = require('stream');
const readStream = new Stream.Readable({ read() { } });
readStream.push(`${JSON.stringify({ "id": "2489651051", "type": "PushEvent" })} \n`);
readStream.push(`${JSON.stringify({ "id": "2489651045", "type": "Create\nEvent" })} \n`);
readStream.push(`${JSON.stringify({ "id": "2489651053", "type": "PushEvent" })} \n`);
readStream.push(null);
pl.readJSONStream(readStream, { format: "lines" }).then(
df1 => console.log("FROM STREAM", df1)
)
Results
FROM STREAM shape: (3, 2)
┌────────────┬───────────┐
│ id ┆ type │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪═══════════╡
│ 2489651051 ┆ PushEvent │
│ 2489651045 ┆ Create │
│ ┆ Event │
│ 2489651053 ┆ PushEvent │
└────────────┴───────────┘
What do you think polars should have done?
Escape inner \n
Bidek56 commented
This is an issue with the core Rust engine. I get the same error in py-polars
:
import polars as pl
from io import StringIO
json_str = '[{"foo":"foo\nfoo","bar":6},{"foo":2,"bar":7},{"foo":3,"bar":"8\nfoo"}]'
pl.read_json(StringIO(json_str))
pydf = PyDataFrame.read_json(
^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: Syntax at character 0
Please raise this issue with the core team and close this ticket.
I wish I could transfer this ticket to the core team but I do not have the permission.
Thx
denisgermano commented
Thanks @Bidek56
Issue on core rust polars: pola-rs/polars#18535