⚠️ NOTE:⚠️ This project is being merged into a monorepo here. This repo will soon be archived.
This is a little helper library to complement @jrc03c/js-math-tools and @jrc03c/js-data-science-helpers. It's a relatively thin wrapper around papaparse
. All it does is load CSV files or strings as DataFrame
objects and save DataFrame
objects as CSV files or strings.
npm install --save https://github.com/jrc03c/js-csv-helpers
Node & bundlers:
const { loadCSV, saveCSV } = require("js-csv-helpers")
async function doStuff() {
// load
const df = await loadCSV("path/to/my-data.csv")
// save
await saveCSV("path/to/other-data.csv", df)
}
doStuff()
Browser:
<script src="path/to/dist/js-csv-helpers"></script>
<script>
const { loadCSV, saveCSV } = JSCSVHelpers
async function doStuff() {
// load
const df = await loadCSV("path/to/my-data.csv")
// save
await saveCSV("path/to/other-data.csv", df)
}
doStuff()
</script>
NOTE: Usage in both environments is basically identical except for one thing: In the browser,
saveCSV
takes a filename and aDataFrame
; but in Node,saveCSV
takes a path and aDataFrame
. That's because the browser can only download files; it can't (in JS) specify where the files ought to be saved.
Given a path
(and optionally a config
object), this function returns a Promise
that resolves to a DataFrame
.
Given a CSV string and a config
object, this function returns a DataFrame
(synchronously).
Given a path
and a DataFrame
(data
) (and optionally a config
object), this function returns a Promise
that resolves to undefined
.
NOTE: This function currently only works for streaming files from disk. I plan to add support for streaming files over the web but just haven't gotten to it yet.
Given a path
and a config
object, this function returns chunks of a DataFrame
asynchronously. A "chunk" just means a subset of the entire CSV file containing just a few rows. Chunks are returned in same order in which they appear in the CSV file; i.e., if you're streaming the file 10 rows at a time, then the first chunk will contain rows 1-10, the second chunk will contain rows 11-20, and so on. The number of rows in each chunk can be defined using a rowsPerChunk
property on the config
object.
For example, if I wanted to stream a large CSV 10 rows at a time, I'd do this:
const { streamLoadCSV } = require("@jrc03c/js-csv-helpers")
!(async () => {
const stream = streamLoadCSV("my-data.csv", {
inferTypes: true,
rowsPerChunk: 10,
})
for await (const chunk of stream) {
chunk.print()
}
})()
Given a DataFrame
and a config
object, this function returns a CSV string (synchronously).
This library is basically a thin wrapper around papaparse
. Any configuration object you could pass into this library's functions will be passed directly into papaparse
's functions. See their documentation for more info.
As of today, Papa's default configuration values for parsing are:
{
beforeFirstChunk: undefined,
chunk: undefined,
chunkSize: undefined,
comments: false,
complete: undefined,
delimiter: "",
delimitersToGuess: [",", "\t", "|", ";", papa.RECORD_SEP, papa.UNIT_SEP],
download: false,
downloadRequestBody: undefined,
downloadRequestHeaders: undefined,
dynamicTyping: false,
encoding: "",
error: undefined,
escapeChar: '"',
fastMode: undefined,
newline: "",
preview: 0,
quoteChar: '"',
skipEmptyLines: false,
step: undefined,
transform: undefined,
transformHeader: undefined,
withCredentials: undefined,
worker: false,
// I've changed this value from the Papa defaults because, at least for my
// purposes, I anticipate that most datasets will include a header row.
header: true,
// I'm adding this option in case a dataset has (or should have) an index
// column (i.e., a first column filled with row names).
index: false,
// I'm also adding my own option to infer types using my `inferType` function
// in @jrc03c/js-math-tools. Papa offers a "dynamicTyping" option, but I
// think maybe mine is a little more extensive (i.e., I think it infers more
// data types, but may not necessarily be more robust). I'm willing to be
// wrong about that, though. By default, this value is set to `false`, which
// means that the returned `DataFrame` will only contain strings.
inferTypes: false,
}
This library only adds one extra option to the configuration object in the loadCSV
function: setting "inferTypes"
to true
or false
enables or disables dynamic type inference. By default, papaparse
doesn't try to figure out what kinds of data your CSV file contains; it merely returns a matrix of strings. They provide an option called "dynamicTyping"
which I think asks papaparse
to try to infer data types, but I don't think it's quite as extensive as the one I've written here.
Here's an example of how to use it:
// use this library's type inference
loadCSV("path/to/my-data.csv", { inferTypes: true })
// use papaparse's type inference
loadCSV("path/to/my-data.csv", { dynamicTyping: true })
As of today, Papa's default configuration values for unparsing are:
{
columns: null,
delimiter: ",",
escapeChar: '"',
header: true,
quoteChar: '"',
quotes: false,
skipEmptyLines: false,
// This is the only value that's been changed from Papa's defaults.
newline: "\n",
}
Here's an example of how to use it:
saveCSV("path/to/my-data.csv", myDataFrame, { delimiter: "\t" })