/parquet-helpers

Git diff/show plugins for Parquet files, Bash scripts/aliases wrapping parquet2json

Primary LanguageShell

parquet-helpers

Bash scripts/aliases and git {diff,show} plugins for Parquet files.

parquet2json helpers

parquet-2-json.sh wraps parquet2json, but can read from stdin when no positional argument is provided:

cat foo.parquet | parquet2json - rowcount  # ❌ doesn't work, can't pipe, difficult to define partially-applied aliases
# thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', /Users/ryan/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2json-2.0.1/src/main.rs:144:54
# stack backtrace:
#    0: _rust_begin_unwind
#    1: core::panicking::panic_fmt
#    2: core::result::unwrap_failed
#    3: tokio::runtime::park::CachedParkThread::block_on
#    4: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
#    5: tokio::runtime::runtime::Runtime::block_on
#    6: parquet2json::main
# note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
cat foo.parquet | parquet-2-json.sh rowcount  # βœ… works
# 4
cat foo.parquet | pqc  # πŸŽ‰ even easier
# 4

.pqt-rc: Bash aliases/functions

.pqt-rc can be sourced from ~/.bashrc, and provides useful aliases:

  • pqn (parquet-2-json.sh rowcount): # rows
  • pqs (parquet-2-json.sh schema): schema
  • pqc (parquet-2-json.sh cat): print all rows (as JSONL)
  • pql (parquet-2-json.sh cat -l <n>): print n rows
  • pqa (parquet2json-all): overview including:
    • MD5 sum
    • File size
    • Row count
    • First 10 rows
      • Configurable via -n <n>
      • -c: "compact" one row per line (by default, rows are piped through jq, which pretty-prints them, one field per line)

Examples

Inspecting test.parquet@63dcdba:

pqn: row count

git show 63dcdba:test.parquet | pqn
# 20

pqs: schema

git show 63dcdba:test.parquet | pqs
# message schema {
#   OPTIONAL BYTE_ARRAY Ride ID (STRING);
#   OPTIONAL BYTE_ARRAY Rideable Type (STRING);
#   OPTIONAL INT64 Start Time (TIMESTAMP(MICROS,false));
#   OPTIONAL INT64 Stop Time (TIMESTAMP(MICROS,false));
#   OPTIONAL BYTE_ARRAY Start Station Name (STRING);
#   OPTIONAL BYTE_ARRAY Start Station ID (STRING);
#   OPTIONAL BYTE_ARRAY End Station Name (STRING);
#   OPTIONAL BYTE_ARRAY End Station ID (STRING);
#   OPTIONAL DOUBLE Start Station Latitude;
#   OPTIONAL DOUBLE Start Station Longitude;
#   OPTIONAL DOUBLE End Station Latitude;
#   OPTIONAL DOUBLE End Station Longitude;
#   OPTIONAL INT32 Gender (INTEGER(8,true));
#   OPTIONAL BYTE_ARRAY User Type (STRING);
#   OPTIONAL BYTE_ARRAY Start Region (STRING);
#   OPTIONAL BYTE_ARRAY End Region (STRING);
# }

pqc / pql: print rows (as JSONL)

git show 63dcdba:test.parquet | pql 3
# {"Ride ID":"47D7696609CD77E4","Rideable Type":"classic_bike","Start Time":"2024-10-31T03:53:24.765","Stop Time":"2024-11-01T00:10:45.107","Start Station Name":"Cedar St & Myrtle Ave","Start Station ID":"4751.01","End Station Name":"Moffat St & Bushwick","End Station ID":"4357.01","Start Station Latitude":40.697842,"Start Station Longitude":-73.926241,"End Station Latitude":40.68458,"End Station Longitude":-73.90925,"Gender":0,"User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
# {"Ride ID":"ADE40852FD10329E","Rideable Type":"classic_bike","Start Time":"2024-10-31T05:18:29.219","Stop Time":"2024-11-01T01:03:53.219","Start Station Name":"9 Ave & W 39 St","Start Station ID":"6644.08","End Station Name":"11 Ave & W 59 St","End Station ID":"7059.01","Start Station Latitude":40.756403523272496,"Start Station Longitude":-73.99410143494606,"End Station Latitude":40.77149671054441,"End Station Longitude":-73.99046033620834,"Gender":0,"User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
# {"Ride ID":"9E5F3D963655B207","Rideable Type":"electric_bike","Start Time":"2024-10-31T13:19:29.118","Stop Time":"2024-11-01T05:15:23.984","Start Station Name":"Union Ave & E 169 St","Start Station ID":"8064.03","End Station Name":"Franklin Ave & E 169 St","End Station ID":"8118.02","Start Station Latitude":40.82995,"Start Station Longitude":-73.898802,"End Station Latitude":40.83171,"End Station Longitude":-73.90208,"Gender":0,"User Type":"Customer","Start Region":"NYC","End Region":"NYC"}

git {diff,show} plugins

git-diff-parquet.sh wraps parquet2json-all for use as a Git diff driver:

Setup

# From a clone of this repo: ensure git-diff-parquet.sh is on your $PATH
echo "export PATH=$PATH:$PWD" >> ~/.bashrc && . ~/.bashrc

# Git configs
git config --global diff.parquet.command git-diff-parquet.sh      # For git diff
git config --global diff.parquet.textconv "parquet2json-all -n2"  # For git show, include 2 rows by default

# Git attributes (map globs/extensions to commands above):
git config --global core.attributesfile ~/.gitattributes
echo "*.parquet diff=parquet" >> ~/.gitattributes

# Or, initialize just one repo:
git config diff.parquet.command git-diff-parquet.sh      # For git diff
git config diff.parquet.textconv "parquet2json-all -n2"  # For git show, include 2 rows by default
echo "*.parquet diff=parquet" >> .gitattributes

Examples

Using commits from the @test branch:

Field dtype changed

63dcdba converted a field from int64 to int8 (test.py):

git diff '63dcdba^..63dcdba'
test.parquet (3a84f68..27fb7a1)
1,2c1,2
< MD5: 7957c8cc859f03517dcdac05dcdfee8a
< 13274 bytes
---
> MD5: 7c079c1420c5edffc54955a54ca38795
> 13245 bytes
17c17
<   OPTIONAL INT64 Gender;
---
>   OPTIONAL INT32 Gender (INTEGER(8,true));

We see diffs in the MD5, file size, and schema. Better than nothing!

Similarly, with git show:

git show 63dcdba
commit 63dcdbabf9c97833a11571f2bab65a487835a67d
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Sun Dec 22 20:30:03 2024 -0500

    `test.parquet`: "make Gender" an int8
    
    ran `test.py`

diff --git test.parquet test.parquet
index 3a84f68..27fb7a1 100644
--- test.parquet
+++ test.parquet
@@ -1,5 +1,5 @@
-MD5: 7957c8cc859f03517dcdac05dcdfee8a
-13274 bytes
+MD5: 7c079c1420c5edffc54955a54ca38795
+13245 bytes
 20 rows
 message schema {
   OPTIONAL BYTE_ARRAY Ride ID (STRING);
@@ -14,7 +14,7 @@ message schema {
   OPTIONAL DOUBLE Start Station Longitude;
   OPTIONAL DOUBLE End Station Latitude;
   OPTIONAL DOUBLE End Station Longitude;
-  OPTIONAL INT64 Gender;
+  OPTIONAL INT32 Gender (INTEGER(8,true));
   OPTIONAL BYTE_ARRAY User Type (STRING);
   OPTIONAL BYTE_ARRAY Start Region (STRING);
   OPTIONAL BYTE_ARRAY End Region (STRING);

Field values changed

34d2b1d changed the "Gender" field to a categorical string type:

git diff '34d2b1d^..34d2b1d' -- test.parquet
test.parquet (27fb7a1..5ca9743)
1,2c1,2
< MD5: 7c079c1420c5edffc54955a54ca38795
< 13245 bytes
---
> MD5: 0bf2c7f825a70660319e578201a04543
> 13343 bytes
17c17
<   OPTIONAL INT32 Gender (INTEGER(8,true));
---
>   OPTIONAL BYTE_ARRAY Gender (STRING);
35c35
<   "Gender": 0,
---
>   "Gender": "U",
53c53
<   "Gender": 0,
---
>   "Gender": "U",

Here we see diffs to the first two rows of data (in addition to the MD5, size, and schema).

Similarly, with git show:

git show 34d2b1d
commit 34d2b1ddc93f3a3cd04270268338c41309e41fa3
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Sun Dec 22 14:08:07 2024 -0500

    `test.parquet`: make "Gender" a categorical

diff --git test.parquet test.parquet
index 27fb7a1..5ca9743 100644
--- test.parquet
+++ test.parquet
@@ -1,5 +1,5 @@
-MD5: 7c079c1420c5edffc54955a54ca38795
-13245 bytes
+MD5: 0bf2c7f825a70660319e578201a04543
+13343 bytes
 20 rows
 message schema {
   OPTIONAL BYTE_ARRAY Ride ID (STRING);
@@ -14,7 +14,7 @@ message schema {
   OPTIONAL DOUBLE Start Station Longitude;
   OPTIONAL DOUBLE End Station Latitude;
   OPTIONAL DOUBLE End Station Longitude;
-  OPTIONAL INT32 Gender (INTEGER(8,true));
+  OPTIONAL BYTE_ARRAY Gender (STRING);
   OPTIONAL BYTE_ARRAY User Type (STRING);
   OPTIONAL BYTE_ARRAY Start Region (STRING);
   OPTIONAL BYTE_ARRAY End Region (STRING);
@@ -32,7 +32,7 @@ message schema {
   "Start Station Longitude": -73.926241,
   "End Station Latitude": 40.68458,
   "End Station Longitude": -73.90925,
-  "Gender": 0,
+  "Gender": "U",
   "User Type": "Customer",
   "Start Region": "NYC",
   "End Region": "NYC"
@@ -50,7 +50,7 @@ message schema {
   "Start Station Longitude": -73.99410143494606,
   "End Station Latitude": 40.77149671054441,
   "End Station Longitude": -73.99046033620834,
-  "Gender": 0,
+  "Gender": "U",
   "User Type": "Customer",
   "Start Region": "NYC",
   "End Region": "NYC"
diff --git test.py test.py
index b18c424..7f0177a 100644
--- test.py
+++ test.py
@@ -3,5 +3,6 @@
 import pandas as pd
 
 df = pd.read_parquet("test.parquet")
-df = df.astype({'Gender': 'Int8'})
+gender_map = { 0: "U", 1: "M", 2: "F" }
+df["Gender"] = df["Gender"].map(gender_map).astype("category")
 df.to_parquet('test.parquet')

File added

c232deb came before the 2 above, and added test.parquet:

PQT_TXT_OPTS=-s git diff 'c232deb^..c232deb'
test.parquet (000000..3a84f68, ..100644)
0a1,23
> MD5: 7957c8cc859f03517dcdac05dcdfee8a
> 13274 bytes
> 20 rows
> message schema {
>   OPTIONAL BYTE_ARRAY Ride ID (STRING);
>   OPTIONAL BYTE_ARRAY Rideable Type (STRING);
>   OPTIONAL INT64 Start Time (TIMESTAMP(MICROS,false));
>   OPTIONAL INT64 Stop Time (TIMESTAMP(MICROS,false));
>   OPTIONAL BYTE_ARRAY Start Station Name (STRING);
>   OPTIONAL BYTE_ARRAY Start Station ID (STRING);
>   OPTIONAL BYTE_ARRAY End Station Name (STRING);
>   OPTIONAL BYTE_ARRAY End Station ID (STRING);
>   OPTIONAL DOUBLE Start Station Latitude;
>   OPTIONAL DOUBLE Start Station Longitude;
>   OPTIONAL DOUBLE End Station Latitude;
>   OPTIONAL DOUBLE End Station Longitude;
>   OPTIONAL INT64 Gender;
>   OPTIONAL BYTE_ARRAY User Type (STRING);
>   OPTIONAL BYTE_ARRAY Start Region (STRING);
>   OPTIONAL BYTE_ARRAY End Region (STRING);
> }
> {"Ride ID":"47D7696609CD77E4","Rideable Type":"classic_bike","Start Time":"2024-10-31T03:53:24.765","Stop Time":"2024-11-01T00:10:45.107","Start Station Name":"Cedar St & Myrtle Ave","Start Station ID":"4751.01","End Station Name":"Moffat St & Bushwick","End Station ID":"4357.01","Start Station Latitude":40.697842,"Start Station Longitude":-73.926241,"End Station Latitude":40.68458,"End Station Longitude":-73.90925,"Gender":0,"User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
> {"Ride ID":"ADE40852FD10329E","Rideable Type":"classic_bike","Start Time":"2024-10-31T05:18:29.219","Stop Time":"2024-11-01T01:03:53.219","Start Station Name":"9 Ave & W 39 St","Start Station ID":"6644.08","End Station Name":"11 Ave & W 59 St","End Station ID":"7059.01","Start Station Latitude":40.756403523272496,"Start Station Longitude":-73.99410143494606,"End Station Latitude":40.77149671054441,"End Station Longitude":-73.99046033620834,"Gender":0,"User Type":"Customer","Start Region":"NYC","End Region":"NYC"}

Similarly, with git show:

PQT_TXT_OPTS=-s git show c232deb
commit c232deb412dae45046da37f9680a08122073a641
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Sun Dec 22 13:51:08 2024 -0500

    initial `test.parquet`

diff --git test.parquet test.parquet
new file mode 100644
index 0000000..3a84f68
--- /dev/null
+++ test.parquet
@@ -0,0 +1,23 @@
+MD5: 7957c8cc859f03517dcdac05dcdfee8a
+13274 bytes
+20 rows
+message schema {
+  OPTIONAL BYTE_ARRAY Ride ID (STRING);
+  OPTIONAL BYTE_ARRAY Rideable Type (STRING);
+  OPTIONAL INT64 Start Time (TIMESTAMP(MICROS,false));
+  OPTIONAL INT64 Stop Time (TIMESTAMP(MICROS,false));
+  OPTIONAL BYTE_ARRAY Start Station Name (STRING);
+  OPTIONAL BYTE_ARRAY Start Station ID (STRING);
+  OPTIONAL BYTE_ARRAY End Station Name (STRING);
+  OPTIONAL BYTE_ARRAY End Station ID (STRING);
+  OPTIONAL DOUBLE Start Station Latitude;
+  OPTIONAL DOUBLE Start Station Longitude;
+  OPTIONAL DOUBLE End Station Latitude;
+  OPTIONAL DOUBLE End Station Longitude;
+  OPTIONAL INT64 Gender;
+  OPTIONAL BYTE_ARRAY User Type (STRING);
+  OPTIONAL BYTE_ARRAY Start Region (STRING);
+  OPTIONAL BYTE_ARRAY End Region (STRING);
+}
+{"Ride ID":"47D7696609CD77E4","Rideable Type":"classic_bike","Start Time":"2024-10-31T03:53:24.765","Stop Time":"2024-11-01T00:10:45.107","Start Station Name":"Cedar St & Myrtle Ave","Start Station ID":"4751.01","End Station Name":"Moffat St & Bushwick","End Station ID":"4357.01","Start Station Latitude":40.697842,"Start Station Longitude":-73.926241,"End Station Latitude":40.68458,"End Station Longitude":-73.90925,"Gender":0,"User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
+{"Ride ID":"ADE40852FD10329E","Rideable Type":"classic_bike","Start Time":"2024-10-31T05:18:29.219","Stop Time":"2024-11-01T01:03:53.219","Start Station Name":"9 Ave & W 39 St","Start Station ID":"6644.08","End Station Name":"11 Ave & W 59 St","End Station ID":"7059.01","Start Station Latitude":40.756403523272496,"Start Station Longitude":-73.99410143494606,"End Station Latitude":40.77149671054441,"End Station Longitude":-73.99046033620834,"Gender":0,"User Type":"Customer","Start Region":"NYC","End Region":"NYC"}

Customizing output with $PQT_TXT_OPTS

$PQT_TXT_OPTS can customize output formatting:

parquet2json-all -h
# Usage: parquet2json-all [-n <n_rows=10>] [-o <offset>] [-s] <path>
#   -n: number of rows to display (negative β‡’ all rows)
#   -o: offset (skip) rows; negative β‡’ last rows
#   -s: compact mode (one object per line)
#
# Opts passed via $PQT_TXT_OPTS will override those passed via CLI (to allow for configuring `git show`):
#
# The "opts var" itself ("PQT_TXT_OPTS" by default) can also be customized, by setting `$PQT_TXT_OPTS_VAR`, e.g.:
#
#   export PQT_TXT_OPTS_VAR=PQT  # This can be done once, e.g. in your .bashrc
#   PQT="-sn3" git show          # Shorter var name can then be used to configure diffs rendered by `git show` (in this case: compact output, 3 rows)

Appending rows

69e8ea3 appends 5 rows to test.parquet; -n-1 (compare all rows) and -o20 (skip first 20 rows) is a nice way to view this case:

"PQT_TXT_OPTS=-sn-1 -o20" git diff '69e8ea3^..69e8ea3'
test.parquet (5ca9743..c621f0e)
1,3c1,3
< MD5: 0bf2c7f825a70660319e578201a04543
< 13343 bytes
< 20 rows
---
> MD5: 762aeca641059e0773382adab8d23fa5
> 13786 bytes
> 25 rows
21a22,26
> {"Ride ID":"A708CB5F5B9B0A0A","Rideable Type":"classic_bike","Start Time":"2024-10-31T18:24:32.978","Stop Time":"2024-11-01T01:00:53.858","Start Station Name":"4 Ave & E 12 St","Start Station ID":"5788.15","End Station Name":"8 Ave & W 31 St","End Station ID":"6450.05","Start Station Latitude":40.732647,"Start Station Longitude":-73.99011,"End Station Latitude":40.7505853470215,"End Station Longitude":-73.9946848154068,"Gender":"U","User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
> {"Ride ID":"AF7B0AA23EA2BEEA","Rideable Type":"electric_bike","Start Time":"2024-10-31T18:30:18.577","Stop Time":"2024-11-01T00:19:32.156","Start Station Name":"Columbus Ave & W 95 St","Start Station ID":"7520.07","End Station Name":"Freeman St & Reverend James A Polite Ave","End Station ID":"8080.01","Start Station Latitude":40.7919557,"Start Station Longitude":-73.968087,"End Station Latitude":40.830529,"End Station Longitude":-73.894717,"Gender":"U","User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
> {"Ride ID":"7D719878E8164589","Rideable Type":"electric_bike","Start Time":"2024-10-31T18:30:29.155","Stop Time":"2024-11-01T00:19:43.550","Start Station Name":"Columbus Ave & W 95 St","Start Station ID":"7520.07","End Station Name":"Freeman St & Reverend James A Polite Ave","End Station ID":"8080.01","Start Station Latitude":40.7919557,"Start Station Longitude":-73.968087,"End Station Latitude":40.830529,"End Station Longitude":-73.894717,"Gender":"U","User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
> {"Ride ID":"BE959FD40D19CB5B","Rideable Type":"classic_bike","Start Time":"2024-10-31T18:41:57.297","Stop Time":"2024-11-01T03:28:43.499","Start Station Name":"W 34 St & 11 Ave","Start Station ID":"6578.01","End Station Name":"Broadway & E 21 St","End Station ID":"6098.1","Start Station Latitude":40.75594159,"Start Station Longitude":-74.0021163,"End Station Latitude":40.739888408589955,"End Station Longitude":-73.98958593606949,"Gender":"U","User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
> {"Ride ID":"A1EB017D7CB1A09F","Rideable Type":"classic_bike","Start Time":"2024-10-31T18:46:42.479","Stop Time":"2024-11-01T17:28:56.677","Start Station Name":"W 34 St & 11 Ave","Start Station ID":"6578.01","End Station Name":"E 13 St & Ave A","End Station ID":"5779.09","Start Station Latitude":40.75594159,"Start Station Longitude":-74.0021163,"End Station Latitude":40.72966729392978,"End Station Longitude":-73.98067966103554,"Gender":"U","User Type":"Subscriber","Start Region":"NYC","End Region":"NYC"}

-o<offset> can also be negative, printing the last <offset> rows of the file (though in this case it would make for a noisier diff, since the "before" side's last rows are expected to be different from the "after" side's).

And with git show:

"PQT_TXT_OPTS=-sn-1 -o20" git show 69e8ea3
commit 69e8ea39952a90a0313506dba649d789837936f2
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Mon Dec 23 11:13:48 2024 -0500

    append 5 rows

diff --git test.parquet test.parquet
index 5ca9743..c621f0e 100644
--- test.parquet
+++ test.parquet
@@ -1,6 +1,6 @@
-MD5: 0bf2c7f825a70660319e578201a04543
-13343 bytes
-20 rows
+MD5: 762aeca641059e0773382adab8d23fa5
+13786 bytes
+25 rows
 message schema {
   OPTIONAL BYTE_ARRAY Ride ID (STRING);
   OPTIONAL BYTE_ARRAY Rideable Type (STRING);
@@ -19,3 +19,8 @@ message schema {
   OPTIONAL BYTE_ARRAY Start Region (STRING);
   OPTIONAL BYTE_ARRAY End Region (STRING);
 }
+{"Ride ID":"A708CB5F5B9B0A0A","Rideable Type":"classic_bike","Start Time":"2024-10-31T18:24:32.978","Stop Time":"2024-11-01T01:00:53.858","Start Station Name":"4 Ave & E 12 St","Start Station ID":"5788.15","End Station Name":"8 Ave & W 31 St","End Station ID":"6450.05","Start Station Latitude":40.732647,"Start Station Longitude":-73.99011,"End Station Latitude":40.7505853470215,"End Station Longitude":-73.9946848154068,"Gender":"U","User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
+{"Ride ID":"AF7B0AA23EA2BEEA","Rideable Type":"electric_bike","Start Time":"2024-10-31T18:30:18.577","Stop Time":"2024-11-01T00:19:32.156","Start Station Name":"Columbus Ave & W 95 St","Start Station ID":"7520.07","End Station Name":"Freeman St & Reverend James A Polite Ave","End Station ID":"8080.01","Start Station Latitude":40.7919557,"Start Station Longitude":-73.968087,"End Station Latitude":40.830529,"End Station Longitude":-73.894717,"Gender":"U","User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
+{"Ride ID":"7D719878E8164589","Rideable Type":"electric_bike","Start Time":"2024-10-31T18:30:29.155","Stop Time":"2024-11-01T00:19:43.550","Start Station Name":"Columbus Ave & W 95 St","Start Station ID":"7520.07","End Station Name":"Freeman St & Reverend James A Polite Ave","End Station ID":"8080.01","Start Station Latitude":40.7919557,"Start Station Longitude":-73.968087,"End Station Latitude":40.830529,"End Station Longitude":-73.894717,"Gender":"U","User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
+{"Ride ID":"BE959FD40D19CB5B","Rideable Type":"classic_bike","Start Time":"2024-10-31T18:41:57.297","Stop Time":"2024-11-01T03:28:43.499","Start Station Name":"W 34 St & 11 Ave","Start Station ID":"6578.01","End Station Name":"Broadway & E 21 St","End Station ID":"6098.1","Start Station Latitude":40.75594159,"Start Station Longitude":-74.0021163,"End Station Latitude":40.739888408589955,"End Station Longitude":-73.98958593606949,"Gender":"U","User Type":"Customer","Start Region":"NYC","End Region":"NYC"}
+{"Ride ID":"A1EB017D7CB1A09F","Rideable Type":"classic_bike","Start Time":"2024-10-31T18:46:42.479","Stop Time":"2024-11-01T17:28:56.677","Start Station Name":"W 34 St & 11 Ave","Start Station ID":"6578.01","End Station Name":"E 13 St & Ave A","End Station ID":"5779.09","Start Station Latitude":40.75594159,"Start Station Longitude":-74.0021163,"End Station Latitude":40.72966729392978,"End Station Longitude":-73.98067966103554,"Gender":"U","User Type":"Subscriber","Start Region":"NYC","End Region":"NYC"}

File move

07f2234 moved test.parquet to test2.parquet:

git diff '07f2234^..07f2234'
# test.parquet..test2.parquet (14a2491..14a2491)
#
git show 07f2234
# commit 07f2234ea762caff378b55e2a8829b2d495cdc4c
# Author: Ryan Williams <ryan@runsascoded.com>
# Date:   Wed Dec 25 13:08:52 2024 -0500
#
#     `mv test{,2}.parquet`
#
# diff --git test.parquet test2.parquet
# similarity index 100%
# rename from test.parquet
# rename to test2.parquet

File move with modifications

cb6d349 moved test2.parquet back to test.parquet, and renamed "Stop Time" to "End Time":

git diff 'cb6d349^..cb6d349'
test2.parquet..test.parquet (14a2491..6aff192)
1,2c1,2
< MD5: dcc622c03f1164196dcd4a9583ba2651
< 12736 bytes
---
> MD5: f94c17ff76a51cf1acb370d065da190d
> 12732 bytes
18c18
<   OPTIONAL INT64 Stop Time (TIMESTAMP(MICROS,false));
---
>   OPTIONAL INT64 End Time (TIMESTAMP(MICROS,false));
35c35
<   "Stop Time": "2024-11-01T03:01:00.430",
---
>   "End Time": "2024-11-01T03:01:00.430",
52c52
<   "Stop Time": "2024-11-01T00:36:34.579",
---
>   "End Time": "2024-11-01T00:36:34.579",
git show cb6d349
commit cb6d34907b430e6df598de650c5ab625479d6e18
Author: Ryan Williams <ryan@runsascoded.com>
Date:   Wed Dec 25 15:07:16 2024 -0500

    `mv test{2,}.parquet`, rename "Stop Time" to "End Time"

diff --git test2.parquet test.parquet
similarity index 59%
rename from test2.parquet
rename to test.parquet
index 14a2491..6aff192 100644
--- test2.parquet
+++ test.parquet
@@ -1,5 +1,5 @@
-MD5: dcc622c03f1164196dcd4a9583ba2651
-12736 bytes
+MD5: f94c17ff76a51cf1acb370d065da190d
+12732 bytes
 25 rows
 message schema {
   OPTIONAL BYTE_ARRAY End Region (STRING);
@@ -15,7 +15,7 @@ message schema {
   OPTIONAL DOUBLE Start Station Longitude;
   OPTIONAL BYTE_ARRAY Start Station Name (STRING);
   OPTIONAL INT64 Start Time (TIMESTAMP(MICROS,false));
-  OPTIONAL INT64 Stop Time (TIMESTAMP(MICROS,false));
+  OPTIONAL INT64 End Time (TIMESTAMP(MICROS,false));
   OPTIONAL BYTE_ARRAY User Type (STRING);
 }
 {
@@ -32,7 +32,7 @@ message schema {
   "Start Station Longitude": -73.954295,
   "Start Station Name": "Amsterdam Ave & W 131 St",
   "Start Time": "2024-10-31T17:24:06.707",
-  "Stop Time": "2024-11-01T03:01:00.430",
+  "End Time": "2024-11-01T03:01:00.430",
   "User Type": "Customer"
 }
 {
@@ -49,6 +49,6 @@ message schema {
   "Start Station Longitude": -73.918316,
   "Start Station Name": "Walton Ave & E 168 St",
   "Start Time": "2024-10-31T16:42:08.174",
-  "Stop Time": "2024-11-01T00:36:34.579",
+  "End Time": "2024-11-01T00:36:34.579",
   "User Type": "Subscriber"
 }

Advanced Parquet diffing with git-diff-x

Scripts in this repo can be used with git-diff-x (from the qmdx PyPI package) for even more powerful Parquet-file diffing.

For example, git {diff,show} above (even with $PQT_TXT_OPTS) aren't much help inferring what happened to test.parquet in 9a9370c:

git diff '9a9370c^..9a9370c' -- test.parquet
test.parquet (c621f0e..14a2491)
1,2c1,2
< MD5: 762aeca641059e0773382adab8d23fa5
< 13786 bytes
---
> MD5: dcc622c03f1164196dcd4a9583ba2651
> 12736 bytes
4a5,9
>   OPTIONAL BYTE_ARRAY End Region (STRING);
>   OPTIONAL BYTE_ARRAY End Station ID (STRING);
>   OPTIONAL DOUBLE End Station Latitude;
>   OPTIONAL DOUBLE End Station Longitude;
>   OPTIONAL BYTE_ARRAY End Station Name (STRING);
7,9c12
<   OPTIONAL INT64 Start Time (TIMESTAMP(MICROS,false));
<   OPTIONAL INT64 Stop Time (TIMESTAMP(MICROS,false));
<   OPTIONAL BYTE_ARRAY Start Station Name (STRING);
---
>   OPTIONAL BYTE_ARRAY Start Region (STRING);
11,12d13
<   OPTIONAL BYTE_ARRAY End Station Name (STRING);
<   OPTIONAL BYTE_ARRAY End Station ID (STRING);
15,17c16,18
<   OPTIONAL DOUBLE End Station Latitude;
<   OPTIONAL DOUBLE End Station Longitude;
<   OPTIONAL BYTE_ARRAY Gender (STRING);
---
>   OPTIONAL BYTE_ARRAY Start Station Name (STRING);
>   OPTIONAL INT64 Start Time (TIMESTAMP(MICROS,false));
>   OPTIONAL INT64 Stop Time (TIMESTAMP(MICROS,false));
19,20d19
<   OPTIONAL BYTE_ARRAY Start Region (STRING);
<   OPTIONAL BYTE_ARRAY End Region (STRING);
23,36c22,28
<   "Ride ID": "47D7696609CD77E4",
<   "Rideable Type": "classic_bike",
<   "Start Time": "2024-10-31T03:53:24.765",
<   "Stop Time": "2024-11-01T00:10:45.107",
<   "Start Station Name": "Cedar St & Myrtle Ave",
<   "Start Station ID": "4751.01",
<   "End Station Name": "Moffat St & Bushwick",
<   "End Station ID": "4357.01",
<   "Start Station Latitude": 40.697842,
<   "Start Station Longitude": -73.926241,
<   "End Station Latitude": 40.68458,
<   "End Station Longitude": -73.90925,
<   "Gender": "U",
<   "User Type": "Customer",
---
>   "End Region": "NYC",
>   "End Station ID": "7338.02",
>   "End Station Latitude": 40.7839636,
>   "End Station Longitude": -73.9471673,
>   "End Station Name": "2 Ave & E 96 St",
>   "Ride ID": "03F9A0B025966750",
>   "Rideable Type": "electric_bike",
38c30,36
<   "End Region": "NYC"
---
>   "Start Station ID": "7842.16",
>   "Start Station Latitude": 40.816355,
>   "Start Station Longitude": -73.954295,
>   "Start Station Name": "Amsterdam Ave & W 131 St",
>   "Start Time": "2024-10-31T17:24:06.707",
>   "Stop Time": "2024-11-01T03:01:00.430",
>   "User Type": "Customer"
41,54c39,45
<   "Ride ID": "ADE40852FD10329E",
<   "Rideable Type": "classic_bike",
<   "Start Time": "2024-10-31T05:18:29.219",
<   "Stop Time": "2024-11-01T01:03:53.219",
<   "Start Station Name": "9 Ave & W 39 St",
<   "Start Station ID": "6644.08",
<   "End Station Name": "11 Ave & W 59 St",
<   "End Station ID": "7059.01",
<   "Start Station Latitude": 40.756403523272496,
<   "Start Station Longitude": -73.99410143494606,
<   "End Station Latitude": 40.77149671054441,
<   "End Station Longitude": -73.99046033620834,
<   "Gender": "U",
<   "User Type": "Customer",
---
>   "End Region": "NYC",
>   "End Station ID": "7979.17",
>   "End Station Latitude": 40.824811,
>   "End Station Longitude": -73.916407,
>   "End Station Name": "E 161 St & Park Ave",
>   "Ride ID": "08D7AFEB94079985",
>   "Rideable Type": "electric_bike",
56c47,53
<   "End Region": "NYC"
---
>   "Start Station ID": "8179.03",
>   "Start Station Latitude": 40.83649,
>   "Start Station Longitude": -73.918316,
>   "Start Station Name": "Walton Ave & E 168 St",
>   "Start Time": "2024-10-31T16:42:08.174",
>   "Stop Time": "2024-11-01T00:36:34.579",
>   "User Type": "Subscriber"

The number of rows evidently stayed the same, but the schema and first 2 previewed rows seem pretty scrambled.

Comparing sorted schemas

git-diff-x -R <commit> pqs sort is useful for inspecting schema changes: it renders the "before" and "after" schemas as text, and sorts them:

git dxr 9a9370c pqs sort test.parquet
4d3
<   OPTIONAL BYTE_ARRAY Gender (STRING);

(dxr is an alias for diff-x -R)

This immediately makes clear that:

  1. The "Gender" field was dropped, and
  2. The remaining fields were merely reordered.

Comparing rows sorted by primary key

The rows above have an (apparently unique) "Ride ID" column; we can use that to check whether rows were added/deleted or just rearranged:

git dxr 9a9370c pqc 'jq ".\"Ride ID\""' sort test.parquet

Empty diff here implies the rows were just reordered. Viewing the first 10 "Ride ID"s from the "after" version:

git show 9a9370c:test.parquet | pqh | jq -r ".\"Ride ID\""
03F9A0B025966750
08D7AFEB94079985
0C6AC59991FDA228
1D1C1A99053BD6B2
203BC6AB04336C9E
2357DBB7281E26E8
2ECB677DB071F76A
35AD489DAF340A5A
47D7696609CD77E4
4B6716B2215DEC6D

implies that 9a9370c sorted rows by "Ride ID". Let's check that…

Comparing sorted rows and columns

Diffing again, but sorting the rows by "Ride ID", and only comparing the first row:

git dxr 9a9370c pqc 'jq -s "sort_by(.[\"Ride ID\"])[0]"' test.parquet
1a2,6
>   "End Region": "NYC",
>   "End Station ID": "7338.02",
>   "End Station Latitude": 40.7839636,
>   "End Station Longitude": -73.9471673,
>   "End Station Name": "2 Ave & E 96 St",
4,6c9
<   "Start Time": "2024-10-31T17:24:06.707",
<   "Stop Time": "2024-11-01T03:01:00.430",
<   "Start Station Name": "Amsterdam Ave & W 131 St",
---
>   "Start Region": "NYC",
8,9d10
<   "End Station Name": "2 Ave & E 96 St",
<   "End Station ID": "7338.02",
12,17c13,16
<   "End Station Latitude": 40.7839636,
<   "End Station Longitude": -73.9471673,
<   "Gender": "U",
<   "User Type": "Customer",
<   "Start Region": "NYC",
<   "End Region": "NYC"
---
>   "Start Station Name": "Amsterdam Ave & W 131 St",
>   "Start Time": "2024-10-31T17:24:06.707",
>   "Stop Time": "2024-11-01T03:01:00.430",
>   "User Type": "Customer"

It seems to be the same object, but with the keys reordered. Here we check by sorting the keys within the first row (after sorting by "Ride ID"), examining just the first 5 rows:

git dxr 9a9370c pqc 'jq -s "sort_by(.[\"Ride ID\"])[:5][] | to_entries | sort_by(.key) | from_entries"' test.parquet
# 7d6
# <   "Gender": "U",
# 25d23
# <   "Gender": "U",
# 43d40
# <   "Gender": "U",
# 61d57
# <   "Gender": "U",
# 79d74
# <   "Gender": "U",

Putting it all together, we can see that 9a9370c changed test.parquet by:

  • Dropping the "Gender" column
  • Sorting the rows by "Ride ID"
  • Sorting the columns in the schema / within each row.

It's a contrived example, but based on real comparisons I did on Parquet files in ctbk.dev. See also this similar example, from dvc-utils, dealing with gzipped CSVs of the same Citi Bike data.