apache/arrow-rs

Have a parquet file not able to be deduped via arrow-rs, complains about Decimal precision?

EMCP opened this issue · 9 comments

EMCP commented

Describe the bug

I am attempting to try out arrow-rs for the first time, with the eventual goal to migrate off of the python implementation. one of the newest files that came across my bench started to throw an exception during this routine to dedupe data.. and I am unsure why..

Here's the routine :


fn example_get_frame(some_file_path: &str) -> PolarsResult<DataFrame> {
    let r = fs::File::open(some_file_path).unwrap();
    let reader = ParquetReader::new(r);
    return reader.finish()
}

fn dedupe_parquet_file(entry: walkdir::DirEntry, output_dir: String) {

    println!("modifying !");
    let df = example_get_frame(entry.path().to_str().unwrap());

    let mut new_df = df.expect("").unique(None, UniqueKeepStrategy::First).expect("");

    //TODO: build and verify a proper path
    let new_output_filepath = Path::join(Path::new( output_dir.as_str()), entry.file_name().to_str().unwrap());
    println!("{}", new_output_filepath.to_str().unwrap());
    let mut file = fs::File::create(new_output_filepath).unwrap();
    ParquetWriter::new(&mut file).finish(&mut new_df).unwrap();

    println!();

}

The Error

thread 'main' panicked at ': ArrowError(ExternalFormat("File out of specification: Invalid DECIMAL: scale (1) cannot be greater than or equal to precision (1)"))', src/main.rs:21:25
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::result::unwrap_failed
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/result.rs:1750:5
   3: core::result::Result<T,E>::expect
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/result.rs:1047:23
   4: parquet_dedupe_data::dedupe_parquet_file
             at ./src/main.rs:21:22
   5: parquet_dedupe_data::main
             at ./src/main.rs:53:13
   6: core::ops::function::FnOnce::call_once
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

To Reproduce
As you can see I walk the input DIR.. find parquet files.. and attempt to dedupe them.

Expected behavior

I am thinking either there's an error in my data... or this case of the decimal is not supported well by arrow-rs..

Additional context

Here's the schema of the offending file

{
  "type" : "record",
  "name" : "schema",
  "fields" : [ {
    "name" : "category",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "maturity",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "liquid_hours",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "long_name",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "contract_month",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "real_expiration_date",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "under_sec_type",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "trading_hours",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "ev_rule",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "time_zone_id",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "next_option_partial",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "next_option_date",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "price_magnifier",
    "type" : [ "null", {
      "type" : "fixed",
      "name" : "price_magnifier",
      "size" : 2,
      "logicalType" : "decimal",
      "precision" : 4,
      "scale" : 1
    } ],
    "default" : null
  }, {
    "name" : "agg_group",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "stock_type",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "under_symbol",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "market_rule_ids",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "query_start_time",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "last_trade_time",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "convertible",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "coupon",
    "type" : [ "null", {
      "type" : "fixed",
      "name" : "coupon",
      "size" : 1,
      "logicalType" : "decimal",
      "precision" : 1,
      "scale" : 1
    } ],
    "default" : null
  }, {
    "name" : "cusip_check_digit",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "callable",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "isin",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "issue_date",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "ratings",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "putable",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "min_tick",
    "type" : [ "null", {
      "type" : "fixed",
      "name" : "min_tick",
      "size" : 2,
      "logicalType" : "decimal",
      "precision" : 4,
      "scale" : 4
    } ],
    "default" : null
  }, {
    "name" : "market_name",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "order_types",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "next_option_type",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "suggested_size_increment",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "bond_type",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "industry",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "contract_id",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "ev_multiplier",
    "type" : [ "null", {
      "type" : "fixed",
      "name" : "ev_multiplier",
      "size" : 1,
      "logicalType" : "decimal",
      "precision" : 1,
      "scale" : 1
    } ],
    "default" : null
  }, {
    "name" : "subcategory",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "min_size",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "under_contract_id",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "cusip",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "coupon_type",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "desc_append",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "size_increment",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "notes",
    "type" : [ "null", "string" ],
    "default" : null
  } ]
}
EMCP commented

seems perhaps related to #2852

This is a bug in whatever produced your data, a scale of 1 implies that the data is stored multiplied by 10, but it only has a precision of a single decimal digit. Scale must be strictly less than the precision. The parquet data is invalid

EMCP commented

ah hah, thank you!.. I was thrown off as it was working with the pyarrow implementation without warning. will close and see about the upstream data creation in target-parquet.

Actually coming back to this, I may have misled you. The precision and the scale can be equal, it merely implies a value less than 1. However, this was fixed in #1607. Is it posssible you are using a very old arrow version?

EMCP commented

I personally pushed an update to the library https://github.com/estrategiahq/target-parquet .. however it still defaults to instantiating parquet spec 1.x files for backwards compatibility..

if I bump explicitly to ouptut parquet 2.4+ will this perhaps get fixed?

https://github.com/estrategiahq/target-parquet/blob/master/setup.py#L16 here you can see it calls for pyarrow 10.x

I was referring to a very old arrow-rs version to read it, recent versions of the Rust library shouldn't produce the linked panic

EMCP commented

ah my bad,

checking my cargo.lock.. I am seeing


[[package]]
name = "polars-arrow"
version = "0.27.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "06e57a7b929edf6c73475dbc3f63d35152f14f4a9455476acc6127d770daa0f6"
dependencies = [
 "arrow2",
 "hashbrown 0.13.2",
 "num",
 "thiserror",
]


Aah it would appear you are using https://github.com/jorgecarleitao/arrow2 not this repo. Arrow2 forked large portions of arrow-rs and it would appear to have copied across a bug that has since been fixed in arrow-rs.

There have been discussions about polars migrating off arrow2, but they appear to have stalled, so I suspect you should probably file an issue on either the polars and/or arrow2 repositories

alamb commented

FYI @ritchie46 (this bug was reported against arrow-rs but is in actually a bug in polars)