digital-preservation/csv-schema

Infinite number of (unnamed) columns

Opened this issue · 21 comments

adbre commented

We have CSV files which, for each entry, contains a "header" constituted by a fixed number of columns
then followed by a "body", a variable number of columns.

Example below with a fixed number of columns 2 (letters) followed by a variable number of columns (containing numbers).

A,B,1,2,3,4,5,6,7,8
C,D,9,10,11,12

The concept is very much similar to the "varargs" notation in Java or the params keyword in C#.
I'm looking for a way to express this in the schema file.

The schema for this could be expressed as

version 1.1
@noHeader
fixed_column_1: notEmpty
fixed_column_2: notEmpty
variable_column: positiveInteger @infinite

RFC4180 states that each line of a CSV should have the same number of fields, so this is really outside the scope of what we were trying to do.

However, an approach that could work would be to actually treat the variable portion as a single quote-wrapped field with commas in. So in your example it would become:

A,B,"1,2,3,4,5,6,7,8"
C,D,"9,10,11,12"

with schema

version 1.1
@noHeader
fixed_column_1: notEmpty
fixed_column_2: notEmpty
variable_column: regex("[1-9][0-9]*(,[1-9][0-9]*)*")
adbre commented

Ah, thanks for that clarification.

Yes, I'm aware of the work around.

If I modify the scenario a bit though..
The variable number of columns is variable from one file to another, but would be the same within each file, respectivly.

Example

File A

A,B,1,2,3
C,D,4,5,6

File B

E,F,1,2,3,4,5
G,H,6,7,8,9,10

Would it still violate the RFC and thus be out of scope for this project?
Or would that be possible to support?

It wouldn't directly violate the RFC, but as it stands a schema expects a constant number of columns, so currently all you could do would be to have different schemas for each number of columns, it was fairly fundamental part of our thinking (maybe it shouldn't have been) that the number of fields was a defining feature of a homogeneous set of csv files. It would certainly not be something that would be a priority for us right now. (of course it's not hard to include it in the schema language as such, it's developing the validator to implement the checks where most of the effort lies). Thoughts @adamretter @rhubner ?

@adbre based on your example:

File A

A,B,1,2,3
C,D,4,5,6

File B

E,F,1,2,3,4,5
G,H,6,7,8,9,10

I am not at all clear what it is you are trying to achieve, both of those appear to me to be valid CSV files for which a CSV Schema could be created. What seems to be the issue?

adbre commented

I would like a single schema file for both file A and B.

The third column, and every column after it, follows the same rules. So I'm wondering if it would be possible to add support in the schema for specifying the last column to be repeated 1..N times (as long as the total number of columns for each record is still consistent for the entire file).

@adbre Okay so I think you are maybe talking about reuse of CSV Schema components.

If so, I would suggest that this is outside of CSV Schema, and you could easily create a tool that allows you to assembly fragments of CSV Schema.

e.g. You for the above example, you would have 3 schema components.

  1. The schema component that is common to File A and File B
  2. The schema component that is unique to File A
  3. The schema component that is unique to File B

I would imagine that you could just concatenate (1) and either (2) or (3) to give you a complete CSV Schema, a simple command would do this, or if you wanted to be fancier and have some options then maybe a short python script.

adbre commented

@adamretter I do not think you understood my question.
I understand how I can work around this. Dynamically generating new schemas is trivial.

I just had an idea how to extend the CSV schema, in a (for me) not unreasonable way, that I wanted to share with the community.

It's fine by me if you do not feel this "feature" belongs in the CSV schema for one reason or another.
And I would respect your decision regardless what.

@adbre I think you are correct that I do not understand what you are looking for :-/

Can you perhaps give me an example of what you think needs to be added to CSV Schema to support your needs along with example CSV files and CSV Schema?

adbre commented

Gladly!

file-a.csv

A,B,1,2,3
C,D,4,5,6

file-b.csv

E,F,1,2,3,4,5
G,H,6,7,8,9,10

schema.csvs

version 1.1
@noHeader
fixed_column_1: notEmpty
fixed_column_2: notEmpty
variable_column: positiveInteger @columns(1,*)

In my original post I included @infinite, but since then I've given it some thought and in the example immediately above used the pattern from the length and range expressions.

Edit: As I also included in my original post, this would behave very much like the "varargs" notation in Java or the params keyword in C#.

@adbre Okay that makes more sense to me now. Describing it as varags doesn't relate to me what you wanted.

I think we would need to work on the syntax though, I don't like the syntax variable_column: as that implies a column named variable_column.

We would need something that we could disambiguate in the AST which currently looks like:

BodyPart                ::=  Comment* ColumnDefinition Comment*
ColumnDefinition        ::=  (ColumnIdentifier | QuotedColumnIdentifier) ":" ColumnRule
ColumnIdentifier        ::=  PositiveNonZeroIntegerLiteral | Ident	
QuotedColumnIdentifier  ::=  StringLiteral

Perhaps something like:

BodyPart                ::=  Comment* (ColumnDefinition | ColumnGroup) Comment*
ColumnDefinition        ::=  (ColumnIdentifier | QuotedColumnIdentifier) ":" ColumnRule
ColumnIdentifier        ::=  PositiveNonZeroIntegerLiteral | Ident	
QuotedColumnIdentifier  ::=  StringLiteral
ColumnGroup             ::=  "group(" positiveInteger "," positiveIntegerOrAny "):" ColumnRule

The column group would allow you to specify the number of following columns (from the current index) in the form:

Exactly 4 columns - group(4)
Zero or more columns (i.e. up to infinite) - group(0, *)
One or more columns (i.e. up to infinite) - group(1, *)
Minimum of X and maximum of Y columns - group(2, 4)

Obviously we would need some rules about how these interact with each other, for example you couldn't have two sequential lines like:

group(0, *): positiveInteger
group(1, *): regexp("[A-Z]+")

I think the rules would be that group(n) could be used anywhere, but that group(x, y) must be the final rule in a CSV Schema.

How does that sound? Obviously the syntax could be improved, the above was just off the top of my head ;-)

adbre commented

That sounds perfectly reasonable to me!

I realized something though when you mentioned named columns... I don't use them, but if someone would, these "group" columns would be unnamed and without possibility to give them any name.

A solution could be a third/first argument with the column/group name which could be used as a prefix followed by the numerical index for the column within that group.

BodyPart                ::=  Comment* (ColumnDefinition | ColumnGroup) Comment*
ColumnDefinition        ::=  (ColumnIdentifier | QuotedColumnIdentifier) ":" ColumnRule
ColumnIdentifier        ::=  PositiveNonZeroIntegerLiteral | Ident	
QuotedColumnIdentifier  ::=  StringLiteral
ColumnGroup             ::=  "group(" StringLiteral "," positiveInteger "," positiveIntegerOrAny "):" ColumnRule

Which would allow something similar to

group("groupName", 1, *): positiveInteger

Would match the CSV header (given zero based indexes)

groupName0,groupName1,groupName2...

@adbre I did consider allowing you to name column groups, but then I thought that we could just use the underlying column name or index (if there is no column header) in any validation messages for the specific column.

adbre commented

That works too

@DavidUnderdown What do you reckon then?

Thank you for your suggestion. We have discussed this but have decided that as this is not something we would use it will not be prioritised for the 1.2 schema language update. Any pull requests however would be welcome.

@minColumns and @maxColumns from #17 would be useful additions to this enhancement

Forgive my naiveté @adbre but couldn't the CSV files be pre-proccessed to pad them with max - n blank columns before validation, where max = most columns seen (this time) in a csv, and n = number of columns in a given csv? then update the schema with max columns, do the validation?

The first sentence describing CSV Schema on https://digital-preservation.github.io/csv-schema

A text based schema language (CSV Schema) for describing data in CSV files for the purposes of validation.

I would argue that if you need to do pre-processing of CSV files before using CSV Schema, then the CSV Schema spec has already failed. Beyond its inability to actually describe the data in the CSV - if you are able to reliably preform pre-processing of a CSV without validation first - then the usefulness of many of the directives involving headers is largely diminished.

There are multiple instances in this project where the solution is to dynamically generate a schema to match the data in the CSV (1) (2). The fact that this is considered a valid solution by the authors only serves to make it clear that the goal of CSV Schema is not to describe data in CSV files for the purposes of validation. I can only speculate that the true goal of CSV Schema is to be a pre-processor for the limited and specific needs of the National Archives, other use cases that do not align specifically with the needs of National Archives are considered out of scope.

With the current direction of CSV Schema, it will only ever be able to validate CSV files for those who have complete control of the CSV file generation process, making it useless for any who wish to use it in an ETL process.

We don't have enough developers currently to implement additional features I'm afraid, but we are certainly not opposed to the addition of further use cases in principle. If others wish to create appropriate EBNF for additional features, please do propose it as a pull request for inclusion in the 1.2 (or later) draft.

In effect we saw CSV Schema as something to be used after the transform stage (what else is preprocessing except a form of transformation?), to verify that your data will be OK to load. It seems you want to be able to use it earlier in the cycle, immediately after extraction?

To backup my claim, #16 laid out some very clear arguments for an addition to the 1.2 draft, and #17 created the appropriate EBNF changes. I no longer work with this tool, but if the request had been approved, then I had a couple developers that could have worked full time on implementing the features.

The use case was that the company was receiving CSV from multiple and very inconsistent sources of which we had no control over. To make things worse, they periodically modified the columns they would include. If we had a robust schema validator at the time, then we could have linked various schemas with specific transformation logic. This matches the use case for schema validation perfectly - validating that a CSV matches the expected data type. A schema is only useful if it is actually able to describe the document. Missing simple features like @optionalColumn makes this an extremely limited schema definition when dealing with real world CSV files

adbre commented

Forgive my naiveté @adbre but couldn't the CSV files be pre-proccessed to pad them with max - n blank columns before validation, where max = most columns seen (this time) in a csv, and n = number of columns in a given csv? then update the schema with max columns, do the validation?

@mhGLEIF It cannot.
In my use case, the number of records are in the millions and the variable number of columns (n) can be around 30 - 360 per record.
So I would have to have a pre-pre-processor to determine the maximum n before running the pre-processor you suggest, padding all those millions of records to n columns.