fabrice-etanchaud/dbt-dremio

"file" materialization : add more export formats

Closed this issue · 4 comments

Currently, dremio's documentation only mentions parquet tables,
but as read in the sources, CTAS allows for undocumented options :

CREATE TABLE xxx STORE AS (format options) [WITH SINGLE WRITER] AS SELECT yyy

where format options can be :

- type => 'json', prettyPrint => false
- type => 'text', fieldDelimiter => ',', lineDelimiter => '\r\n'
- type => 'parquet', outputExtension => 'myparquet'

to investigate : are these memory tables ? how long do they persist ?

- type => 'arrow'

to investigate :

SELECT * FROM TABLE(pds/vds path (type => 'excel', extractHeader => true, hasMergedCells => true, xls => true))

can we export as excel ?

  • arrow tables are dumps of arrow memory structures (input/output format).

  • excel is an input only format

  • input : excel

    public String sheet;
    public boolean extractHeader;
    public boolean hasMergedCells;
    public boolean xls; /** true if we are reading .xls, false if it's .xlsx */

  • in/output : text

    public String lineDelimiter = "\n";
    public char fieldDelimiter = '\u0000';
    public char quote = '"';
    public char escape = '"';
    public char comment = '#';
    public boolean skipFirstLine = false;
    public boolean extractHeader = false;
    public boolean autoGenerateColumnNames = false;
    public boolean trimHeader = true;
    public String outputExtension = "txt";

  • in/output : json

    public String outputExtension = "json";
    public boolean prettyPrint = true;

  • in/out : parquet

    public boolean autoCorrectCorruptDates = true;
    public String outputExtension = "parquet";

delta, avro, iceberg ???

So, let's go for text and json extra formats.

To stay consistent with dbt's sources/exposures, I think formating should only be specified in sources and exposed datasets, so I will start with :

  • for souces : use external map, and override builtins.source() to decorate the rendered relation (hoping that seeking for the right source in the graph will stay lightweight)
  • for file materialization : add the same external amp in configuration (excluding excel parameters, as dremio can only read excel files)

overriding builtins.xxx won't work as they return a Relation, not a rendered form of it.
Will have to override Relation.render() instead.