/dat

Delta Acceptance Testing

Primary LanguagePython

Delta Acceptance Testing (DAT)

The DAT project provides test cases to verify different implementations of Delta Lake all behave consistently. The expected behavior is described in the Delta Lake Protocol.

The tests cases are packaged into releases, which can be downloaded into CI jobs for automatic testing. The test cases in this repo are represented using a standard file structure, so they don't require any particular dependency or programming language.

To download and unpack:

VERSION=0.0.3
curl -OL https://github.com/delta-incubator/dat/releases/download/v$VERSION/deltalake-dat-v$VERSION.tar.gz
tar  --no-same-permissions -xzf deltalake-dat-v$VERSION.tar.gz

Testing Readers

All reader test cases are stored in the directory out/reader_tests/generated. They follow the directory structure:

|-- {table_name}
  |-- test_case_info.json
  |-- delta
  | |-- _delta_log
  |   |-- ...
  | |-- part-0000-dsfsadf-adsfsdaf-asdf.snappy.parquet
  | |-- ...
  |-- expected
    |-- latest
      |-- table_version_metadata.json
      |-- table_content
    |-- v1
      |-- table_version_metadata.json
      |-- table_content

Each test case is a folder, named for its test. It contains:

  • test_case_info.json: document that provides the names and human-friendly description of the test.
  • delta: the root directory of the Delta Lake table.
  • expected: a folder containing expected results, potentially for multiple versions of the table. At a minimum, there is a folder called latest containing the expected data for the current version of the table. There may be other folders such as v1, v2, and so on for other versions for testing time travel. There are two types of files in each version folder:
    • Parquet files, that contain the expected data.
    • A JSON file, table_version_metadata.json which contains the metadata about that version of the table. For example, it contains the protocol versions min_reader_version and min_writer_version.

To test a reader, readers should first identify the test cases:

  1. List the out/reader_tests/generated/ directory to identify the root of each Delta table.
  2. List each subdirectory's expected directory; each of these folders represents a test case.

Then for each test case:

  1. Load the corresponding version of the Delta table
  2. Verify the metadata read from the Delta table matches that in the table_version_metadata.json. For example, verify that the connector parsed the correct min_reader_version from the Delta log. This step may be skipped if the reader connector does not expose such details in its public API.
  3. Attempt to read the Delta table's data: a. If the Delta table uses a version unsupported by the reader connector (as determined from table_version_metadata.json), verify an appropriate error is returned. b. If the Delta table is supported by the reader connector, assert that the read data is equal to the data read from table_content. In order to make it easy to sort the tables for comparison, some tables have a column pk which is an ascending integer sequence.

For an example implementation of this, see the example PySpark tests in tests/pyspark_delta/.

Testing Writers

TBD.

Generated tables

all_primitive_types

Table containing all non-nested types.

+----+-----+-----+-----+----+-------+-------+-----+-------------+-------+----------+-------------------+
|utf8|int64|int32|int16|int8|float32|float64| bool|       binary|decimal|    date32|          timestamp|
+----+-----+-----+-----+----+-------+-------+-----+-------------+-------+----------+-------------------+
|   0|    0|    0|    0|   0|    0.0|    0.0| true|           []| 10.000|1970-01-01|1970-01-01 00:00:00|
|   1|    1|    1|    1|   1|    1.0|    1.0|false|         [00]| 11.000|1970-01-02|1970-01-01 01:00:00|
|   2|    2|    2|    2|   2|    2.0|    2.0| true|      [00 00]| 12.000|1970-01-03|1970-01-01 02:00:00|
|   3|    3|    3|    3|   3|    3.0|    3.0|false|   [00 00 00]| 13.000|1970-01-04|1970-01-01 03:00:00|
|   4|    4|    4|    4|   4|    4.0|    4.0| true|[00 00 00 00]| 14.000|1970-01-05|1970-01-01 04:00:00|
+----+-----+-----+-----+----+-------+-------+-----+-------------+-------+----------+-------------------+

basic_append

A basic table with two append writes.

+------+------+-------+                                                         
|letter|number|a_float|
+------+------+-------+
|     a|     1|    1.1|
|     b|     2|    2.2|
|     c|     3|    3.3|
|     d|     4|    4.4|
|     e|     5|    5.5|
+------+------+-------+

basic_partitioned

A basic partitioned table.

+------+------+-------+                                                         
|letter|number|a_float|
+------+------+-------+
|     b|     2|    2.2|
|  NULL|     6|    6.6|
|     c|     3|    3.3|
|     a|     1|    1.1|
|     a|     4|    4.4|
|     e|     5|    5.5|
+------+------+-------+

multi_partitioned

Multiple levels of partitioning, with boolean, timestamp, and decimal partition columns.

+-----+-------------------+--------------------+---+                            
| bool|               time|              amount|int|
+-----+-------------------+--------------------+---+
|false|1970-01-02 08:45:00|12.00000000000000...|  3|
| true|1970-01-01 00:00:00|200.0000000000000...|  1|
| true|1970-01-01 12:30:00|200.0000000000000...|  2|
+-----+-------------------+--------------------+---+

multi_partitioned_2

Multiple levels of partitioning, with boolean, timestamp, and decimal partition columns.

+-----+-------------------+--------------------+---+                            
| bool|               time|              amount|int|
+-----+-------------------+--------------------+---+
|false|1970-01-02 08:45:00|12.00000000000000...|  3|
| true|1970-01-01 00:00:00|200.0000000000000...|  1|
| true|1970-01-01 12:30:00|200.0000000000000...|  2|
+-----+-------------------+--------------------+---+

nested_types

Table containing various nested types.

+---+------------+---------------+--------------------+                         
| pk|      struct|          array|                 map|
+---+------------+---------------+--------------------+
|  0| {0.0, true}|            [0]|                  {}|
|  1|{1.0, false}|         [0, 1]|            {0 -> 0}|
|  2| {2.0, true}|      [0, 1, 2]|    {0 -> 0, 1 -> 1}|
|  3|{3.0, false}|   [0, 1, 2, 3]|{0 -> 0, 1 -> 1, ...|
|  4| {4.0, true}|[0, 1, 2, 3, 4]|{0 -> 0, 1 -> 1, ...|
+---+------------+---------------+--------------------+

no_replay

Table with a checkpoint and prior commits cleaned up.

+------+---+----------+                                                         
|letter|int|      date|
+------+---+----------+
|     a| 93|1975-06-01|
|     b|753|2012-05-01|
|     c|620|1983-10-01|
|     a|595|2013-03-01|
|  NULL|653|1995-12-01|
+------+---+----------+

no_stats

Table with no stats.

+------+---+----------+                                                         
|letter|int|      date|
+------+---+----------+
|     a| 93|1975-06-01|
|     b|753|2012-05-01|
|     c|620|1983-10-01|
|     a|595|2013-03-01|
|  NULL|653|1995-12-01|
+------+---+----------+

stats_as_structs

Table with stats only written as struct (not JSON) with Checkpoint.

+------+---+----------+                                                         
|letter|int|      date|
+------+---+----------+
|     a| 93|1975-06-01|
|     b|753|2012-05-01|
|     c|620|1983-10-01|
|     a|595|2013-03-01|
|  NULL|653|1995-12-01|
+------+---+----------+

with_checkpoint

Table with a checkpoint.

+------+---+----------+                                                         
|letter|int|      date|
+------+---+----------+
|     a| 93|1975-06-01|
|     b|753|2012-05-01|
|     c|620|1983-10-01|
|     a|595|2013-03-01|
|  NULL|653|1995-12-01|
+------+---+----------+

with_schema_change

Table which has schema change using overwriteSchema=True.

+----+----+                                                                     
|num1|num2|
+----+----+
|  22|  33|
|  44|  55|
|  66|  77|
+----+----+

Models

The test cases contain several JSON files to be read by connector tests. To make it easier to read them, we provide JSON schemas for each of the file types in out/schemas/. They can be read to understand the expected structure, or even used to generate data structures in your preferred programming language.

Contributing

See contributing.md.