The DAT project provides test cases to verify different implementations of Delta Lake all behave consistently. The expected behavior is described in the Delta Lake Protocol.
The tests cases are packaged into releases, which can be downloaded into CI jobs for automatic testing. The test cases in this repo are represented using a standard file structure, so they don't require any particular dependency or programming language.
To download and unpack:
VERSION=0.0.3
curl -OL https://github.com/delta-incubator/dat/releases/download/v$VERSION/deltalake-dat-v$VERSION.tar.gz
tar --no-same-permissions -xzf deltalake-dat-v$VERSION.tar.gz
All reader test cases are stored in the directory out/reader_tests/generated
. They follow the directory structure:
|-- {table_name}
|-- test_case_info.json
|-- delta
| |-- _delta_log
| |-- ...
| |-- part-0000-dsfsadf-adsfsdaf-asdf.snappy.parquet
| |-- ...
|-- expected
|-- latest
|-- table_version_metadata.json
|-- table_content
|-- v1
|-- table_version_metadata.json
|-- table_content
Each test case is a folder, named for its test. It contains:
test_case_info.json
: document that provides the names and human-friendly description of the test.delta
: the root directory of the Delta Lake table.expected
: a folder containing expected results, potentially for multiple versions of the table. At a minimum, there is a folder calledlatest
containing the expected data for the current version of the table. There may be other folders such asv1
,v2
, and so on for other versions for testing time travel. There are two types of files in each version folder:- Parquet files, that contain the expected data.
- A JSON file,
table_version_metadata.json
which contains the metadata about that version of the table. For example, it contains the protocol versionsmin_reader_version
andmin_writer_version
.
To test a reader, readers should first identify the test cases:
- List the
out/reader_tests/generated/
directory to identify the root of each Delta table. - List each subdirectory's
expected
directory; each of these folders represents a test case.
Then for each test case:
- Load the corresponding version of the Delta table
- Verify the metadata read from the Delta table matches that in the
table_version_metadata.json
. For example, verify that the connector parsed the correctmin_reader_version
from the Delta log. This step may be skipped if the reader connector does not expose such details in its public API. - Attempt to read the Delta table's data:
a. If the Delta table uses a version unsupported by the reader connector (as determined from
table_version_metadata.json
), verify an appropriate error is returned. b. If the Delta table is supported by the reader connector, assert that the read data is equal to the data read fromtable_content
. In order to make it easy to sort the tables for comparison, some tables have a columnpk
which is an ascending integer sequence.
For an example implementation of this, see the example PySpark tests in tests/pyspark_delta/
.
TBD.
all_primitive_types
Table containing all non-nested types.
+----+-----+-----+-----+----+-------+-------+-----+-------------+-------+----------+-------------------+
|utf8|int64|int32|int16|int8|float32|float64| bool| binary|decimal| date32| timestamp|
+----+-----+-----+-----+----+-------+-------+-----+-------------+-------+----------+-------------------+
| 0| 0| 0| 0| 0| 0.0| 0.0| true| []| 10.000|1970-01-01|1970-01-01 00:00:00|
| 1| 1| 1| 1| 1| 1.0| 1.0|false| [00]| 11.000|1970-01-02|1970-01-01 01:00:00|
| 2| 2| 2| 2| 2| 2.0| 2.0| true| [00 00]| 12.000|1970-01-03|1970-01-01 02:00:00|
| 3| 3| 3| 3| 3| 3.0| 3.0|false| [00 00 00]| 13.000|1970-01-04|1970-01-01 03:00:00|
| 4| 4| 4| 4| 4| 4.0| 4.0| true|[00 00 00 00]| 14.000|1970-01-05|1970-01-01 04:00:00|
+----+-----+-----+-----+----+-------+-------+-----+-------------+-------+----------+-------------------+
basic_append
A basic table with two append writes.
+------+------+-------+
|letter|number|a_float|
+------+------+-------+
| a| 1| 1.1|
| b| 2| 2.2|
| c| 3| 3.3|
| d| 4| 4.4|
| e| 5| 5.5|
+------+------+-------+
basic_partitioned
A basic partitioned table.
+------+------+-------+
|letter|number|a_float|
+------+------+-------+
| b| 2| 2.2|
| NULL| 6| 6.6|
| c| 3| 3.3|
| a| 1| 1.1|
| a| 4| 4.4|
| e| 5| 5.5|
+------+------+-------+
multi_partitioned
Multiple levels of partitioning, with boolean, timestamp, and decimal partition columns.
+-----+-------------------+--------------------+---+
| bool| time| amount|int|
+-----+-------------------+--------------------+---+
|false|1970-01-02 08:45:00|12.00000000000000...| 3|
| true|1970-01-01 00:00:00|200.0000000000000...| 1|
| true|1970-01-01 12:30:00|200.0000000000000...| 2|
+-----+-------------------+--------------------+---+
multi_partitioned_2
Multiple levels of partitioning, with boolean, timestamp, and decimal partition columns.
+-----+-------------------+--------------------+---+
| bool| time| amount|int|
+-----+-------------------+--------------------+---+
|false|1970-01-02 08:45:00|12.00000000000000...| 3|
| true|1970-01-01 00:00:00|200.0000000000000...| 1|
| true|1970-01-01 12:30:00|200.0000000000000...| 2|
+-----+-------------------+--------------------+---+
nested_types
Table containing various nested types.
+---+------------+---------------+--------------------+
| pk| struct| array| map|
+---+------------+---------------+--------------------+
| 0| {0.0, true}| [0]| {}|
| 1|{1.0, false}| [0, 1]| {0 -> 0}|
| 2| {2.0, true}| [0, 1, 2]| {0 -> 0, 1 -> 1}|
| 3|{3.0, false}| [0, 1, 2, 3]|{0 -> 0, 1 -> 1, ...|
| 4| {4.0, true}|[0, 1, 2, 3, 4]|{0 -> 0, 1 -> 1, ...|
+---+------------+---------------+--------------------+
no_replay
Table with a checkpoint and prior commits cleaned up.
+------+---+----------+
|letter|int| date|
+------+---+----------+
| a| 93|1975-06-01|
| b|753|2012-05-01|
| c|620|1983-10-01|
| a|595|2013-03-01|
| NULL|653|1995-12-01|
+------+---+----------+
no_stats
Table with no stats.
+------+---+----------+
|letter|int| date|
+------+---+----------+
| a| 93|1975-06-01|
| b|753|2012-05-01|
| c|620|1983-10-01|
| a|595|2013-03-01|
| NULL|653|1995-12-01|
+------+---+----------+
stats_as_structs
Table with stats only written as struct (not JSON) with Checkpoint.
+------+---+----------+
|letter|int| date|
+------+---+----------+
| a| 93|1975-06-01|
| b|753|2012-05-01|
| c|620|1983-10-01|
| a|595|2013-03-01|
| NULL|653|1995-12-01|
+------+---+----------+
with_checkpoint
Table with a checkpoint.
+------+---+----------+
|letter|int| date|
+------+---+----------+
| a| 93|1975-06-01|
| b|753|2012-05-01|
| c|620|1983-10-01|
| a|595|2013-03-01|
| NULL|653|1995-12-01|
+------+---+----------+
with_schema_change
Table which has schema change using overwriteSchema=True.
+----+----+
|num1|num2|
+----+----+
| 22| 33|
| 44| 55|
| 66| 77|
+----+----+
The test cases contain several JSON files to be read by connector tests. To make it easier to read them, we provide JSON schemas for each of the file types in out/schemas/
. They can be read to understand
the expected structure, or even used to generate data structures in your preferred programming language.
See contributing.md.