"file" materialization : add more export formats
Closed this issue · 4 comments
Currently, dremio's documentation only mentions parquet tables,
but as read in the sources, CTAS allows for undocumented options :
CREATE TABLE xxx STORE AS (format options) [WITH SINGLE WRITER] AS SELECT yyy
where format options can be :
- type => 'json', prettyPrint => false
- type => 'text', fieldDelimiter => ',', lineDelimiter => '\r\n'
- type => 'parquet', outputExtension => 'myparquet'
to investigate : are these memory tables ? how long do they persist ?
- type => 'arrow'
to investigate :
SELECT * FROM TABLE(pds/vds path (type => 'excel', extractHeader => true, hasMergedCells => true, xls => true))
can we export as excel ?
-
arrow tables are dumps of arrow memory structures (input/output format).
-
excel is an input only format
-
input : excel
public String sheet;
public boolean extractHeader;
public boolean hasMergedCells;
public boolean xls; /** true if we are reading .xls, false if it's .xlsx */ -
in/output : text
public String lineDelimiter = "\n";
public char fieldDelimiter = '\u0000';
public char quote = '"';
public char escape = '"';
public char comment = '#';
public boolean skipFirstLine = false;
public boolean extractHeader = false;
public boolean autoGenerateColumnNames = false;
public boolean trimHeader = true;
public String outputExtension = "txt"; -
in/output : json
public String outputExtension = "json";
public boolean prettyPrint = true; -
in/out : parquet
public boolean autoCorrectCorruptDates = true;
public String outputExtension = "parquet";
delta, avro, iceberg ???
So, let's go for text and json extra formats.
To stay consistent with dbt's sources/exposures, I think formating should only be specified in sources and exposed datasets, so I will start with :
- for souces : use external map, and override builtins.source() to decorate the rendered relation (hoping that seeking for the right source in the graph will stay lightweight)
- for file materialization : add the same external amp in configuration (excluding excel parameters, as dremio can only read excel files)
overriding builtins.xxx won't work as they return a Relation, not a rendered form of it.
Will have to override Relation.render() instead.
done