pkiraly/metadata-qa-api

Read column names from the CSV header

Closed this issue · 1 comments

So far the Schema should list all the fields. The tool should let users to select only a subset of fields in the Schema, and the tool use the first row of a CSV file to read the column names from there.

for the CalculatorFacade a new method has been added

setCsvReader(boolean headerAware)

where true enables using the very first row as the header (column names) row, false (default) disables it. One can still use the existing API call setCsvReader(CsvReader csvReader).

This makes the life a bit easier if you do not want to analyse all columns, because you can make the Schema slimmer.

Previously:

    Schema schema = new BaseSchema()
      .setFormat(Format.CSV)
      .addField(new JsonBranch("url", Category.MANDATORY).setExtractable())
      .addField(new JsonBranch("name"))
      .addField(new JsonBranch("alternateName"))
      .addField(new JsonBranch("description"))
      .addField(new JsonBranch("variablesMeasured"))
     ... // all fields shouldd be listed here
     ;

    CalculatorFacade facade = new CalculatorFacade()
      .setSchema(schema)
      .setCsvReader(
        new CsvReader()
          .setHeader(((CsvAwareSchema) schema).getHeader())) // read column names from the schema
      .enableCompletenessMeasurement()
      .enableFieldCardinalityMeasurement();

with this change:

    Schema schema = new BaseSchema()
      .setFormat(Format.CSV)
      .addField(new JsonBranch("url", Category.MANDATORY).setExtractable())
      .addField(new JsonBranch("name"))
      .addField(new JsonBranch("alternateName"))
      .addField(new JsonBranch("description")) // list only important fields
    ;

    CalculatorFacade facade = new CalculatorFacade()
      .setSchema(schema)
      .setCsvReader(true) // read column names from the first row
      .enableCompletenessMeasurement()
      .enableFieldCardinalityMeasurement();

Isn't it lovely? ;-)