For development, we need to install overcommit. This can be installed through Ruby. See instructions how
here. Once installed, run overcommit --install
in this directory.
This will run the pre-commit hook for scalastyle.
base ❯ git commit -m 'added overcommit for hooks'
Running pre-commit hooks
✓ All pre-commit hooks passed
Running commit-msg hooks
Check subject line................................[SingleLineSubject] OK
Check subject capitalization.....................[CapitalizedSubject] WARNING
Subject should start with a capital letter
Check for trailing periods in subject................[TrailingPeriod] OK
Check text width..........................................[TextWidth] OK
⚠ All commit-msg hooks passed, but with warnings
Space filling curves allow us to represent an n-dimensional curve
in one dimensional while preserving locality. Techniques such as
z-ordering
allow big data platforms to efficiently store and
process large chunks of data.
- Processing Petabytes of Data in Seconds with Databricks Delta
- Z-order curve
- Z-order indexing for multifaceted queries in Amazon DynamoDB: Part 1
- Z-order indexing for multifaceted queries in Amazon DynamoDB: Part 2
Spark-2.3.1 on Scala 2.11.12
Spark-2.4.7 on Scala 2.11.12 and Scala 2.12.13
Spark-3.1.0 on Scala 2.12.13 Java 11 version 0.1.0 and 0.2.0
How to determine Morton (Z) or Hilbert Ordering.
Given the dataframe below, we want to Morton (Z Order) our data by id
, x
, y
// Currently, this isn't setup to use Maven.
// For now, publish local or just assembly and use the jar.
val orderingCols: Array[String] = Array("id", "x", "y")
val df: DataFrame = Seq(
(1, 1, 12.23, "a", "m"),
(4, 9, 5.05, "b", "m"),
(3, 0, 1.23, "c", "f"),
(2, 2, 100.4, "d", "f"),
(1, 25, 3.25, "a", "m")
).toDF("x", "y", "amnt", "id", "sex")
val mortonOrdering: Morton = new Morton(df, orderingCols)
// this will order your whole dataframe by the z_index
val zIndexedDF: DataFrame = mortonOrdering
.mortonIndex.sort("z_index")
Hilbert is only available in version 0.2.0 on Spark 3.
How do space filling curves benefit? Let's consider the Chicago crime data set available
at Crimes - 2001 to Preset.
This data was pulled on 8 August 2021. The downloaded csv
file is 1.74
GB and 7374374
records. First, I converted the csv
to parquet
with defualt compression of snappy
.
File Type | Compression | Number of Leaf Files | Optimization | Size (MB) |
---|---|---|---|---|
CSV | None | 1 | None | 1781.76 |
Parquet | Snappy | 13 | None | 470.02 |
Parquet | gzip | 13 | None | 315.22 |
Parquet | gzip | 1 | Semi-linear | 269.81 |
Parquet | gzip | 1 | Z-order | |
Parquet | gzip | 1 | Hilbert |
which resulted in 13
leaf files all approximately 38
MB for a total size of 0.459
GB.
- README
- Better organization
Looking for help with those experienced with creating decent READMEs and publishing code to Maven.