lukaswagner/csv-parser

proposal for generic parser interface

Opened this issue · 5 comments

Currently, the library supports CSV and TSV data, while external spreadsheet services (Google Sheets and Excel) deliver JSON data. In order to parse the JSON data several TransformStreams are used to decode the raw bytes to text, transform the JSON formats to CSV, and encode the text back to raw bytes. Finally, the transformed data are passed to the parser, which handles the data using the same CSV/TSV parser logic. This is obviously no sophisticated approach and could be improved.

Proposal

All logic that is related to parsing a specific format could be abstracted in a generic Parser interface. I think this would affect four functions that have no relation at the moment: parse, splitLine, splitLines, and maybe parseLine.

interface Parser {
    parse(chunks: ArrayBufferLike[], start: Position, end: Position): string[];
    parseLine(line: string[], types: DataType[]): unknown[];
    splitLine(line: string, delimiter: string): string[];
    splitLines(chunk: string, lines: string[], remainder = ''): string;
}

This would allow defining different parsers for different data sources: CsvParser implements Parser, JsonParser implements Parser, etc.
Additionally, we could expose the Parser interface, so users could define their own parser for tabular data with a custom format, e.g. parsing a table from a Markdown file. A possible usage could be:

import { Parser } from "@lukaswagner/csv-parser";

class MyCustomParser implements Parser {
   // implementations
}

// ...

await parser.open("[custom-file]", { parser: myCustomParser });

The custom parser passed could be an instance or a class. If the parser has no state, the methods could also be defined as static. But I'm open for opinions which API would be best.

Considerations

While this abstraction would enable using this library in more use cases, it would have more responsibilities than the name "CSV parser" suggests. It might be worth thinking about splitting the library into multiple packages, e.g. as main library data-parser and csv-parser, json-parser, etc. that could be integrated like plugins. Since we're already using a monorepo, introducing more packages should not be a big deal. A bigger problem would be the csv-parser library that currently contains the actual library and would become a plugin - so communicating that switching the library would be required. But maybe you have other thoughts/better ideas how to handle this case.

Yes, that's probably the best way forward.

Regarding the package: How about a base parser, a specialized plugin and a package wrapping both for easier usage?

  • base-parser
  • plugin
    • csv-plugin
    • json-plugin
  • parser
    • data-parser -> bundles base-parser and all plugins
    • csv-parser -> bundles base-parser and csv-plugin
    • json-parser -> bundles base-parser and json-plugin

While this would increase the number of packages, one would keep the option to just include a single package.

In general I agree with your proposed package structure. I think as well that we should provide all packages for maximum flexibility and convenience.
But how would the following use case be handled: As a user I want to parse CSV, JSON, and my custom data format in my application. Should I install the packages base-parser, csv-plugin, and json-plugin and write my own custom-plugin. Or should I install the data-parser package and I am able to inject my own custom-plugin? So the actual question is: Should it be possible to extend data-parser or all *-parser libs with a custom parser plugin, or should this be only possible for the base-parser?

Additionally, the complexity of the monorepo would increase. So maybe we should think about introducing more/better monorepo tooling, e.g. for better changelog management and publishing. I think np works for single packages only, so a different tool might be nicer.
Furthermore, we should reconsider how/where to publish these packages. Should all of them get published under the "@lukaswagner" namespace? We could introduce a NPM organization that owns all of these packages (NPM organizations with public packages only are free). In that case we might also want to use a custom GitHub organization with the same name (I believe, this would be free as well).

Sorry for the late reply, I thought I already replied to this.

IMO every parser should be able to be extended by a plugin. I cant really think of a downside to this. The specialized parsers could be simple wrappers that subclass the base parser and register the plugin:

class JsonParser extends BaseParser {
    public constructor(options: Partial<LoaderOptions>) {
        super(options);
        this.registerPlugin(require('json-plugin'));
    }
}

This way, both approaches are supported: Starting with the base parser and adding all required plugins, or using a pre-configured parser and adding what's missing for your use case.

Regarding the monorepo structure: Yeah, I believe you're right. With the current setup, the increased number of packages would become quite annoying to handle.

Yes, these packages should be published under their own namespace. We should choose a name which is still available both as a package and namespace. This way, the full parser could be made available as a non-namespaced package. I'm not sure if we need to create a separate GitHub organization, though.

Another question then: How would you realize automatic format detection? Given the case: I have a parser with multiple loader plugins (e.g. CSV and JSON). Would the parser try to load the data with every plugin and choose the successful one or should the user pass an identifier for the plugin they want to use?

To give some pointers for useful monorepo tools: For general monorepo management Nx or Turborepo are probably the best choice. For versioning and changelogs the tool changesets could be nice. I have no practical experience with these packages yet, but I would choose them for my next monorepo project.

Interesting question, hadn't thought about this thoroughly yet.

There should always be the option for the user to request a plugin when adding a data source. The least intrusive way of adding such an interface seems to be allowing _options.dataSources to store DataSources alongside a plugin choice:

type ExtendedDataSource = DataSource | { plugin: string; dataSource: DataSource };

This would allow keeping the simple interface, but also providing the option to choose a plugin.

For now, I wouldn't invest too much time into a elaborate format detection, as this is prone to be outdated when different plugins are added. Regardless, some thoughts on options:

  • filename/URL regex check: The most simple option would be each plugin providing a regex (or array of regexes) for supported sources. This works for file uploads and URLs.
    • /\.(c|t)sv$/ for CSV/TSV (or even /\..sv$/ or /\.txt$/, but these could lead to false positives)
    • /\.json$/, /docs\.google\.com/ and /onedrive\.live\.com/ for JSON
  • content-based check: Much more complicated. file-type seems to be a helpful lib that supports passing only the start of the file, but it doesn't support csv nor json.

A decision on the used plugin should be made before parsing begins. Parsing a data source with multiple plugins and choosing afterwards defeats the focus on performance IMO.


Regarding tooling: From a quick look, Nx seems to be the more mature of these projects, in addition to claiming to have more features. So I'd probably prefer this.
On changesets: Not sure if this is necessary. But I said the same about monorepo managing tools earlier... so I guess we could try it.