manojkarthick/pqrs

pqrs fails to read valid parquet file

Hoeze opened this issue · 1 comments

Hoeze commented

Reading the schema works:

#> RUST_BACKTRACE=full pqrs schema example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet
Metadata for file: example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet

version: 2
num of rows: 4770
created by: Arrow2 - Native Rust implementation of Arrow
metadata:
  ARROW:schema: /////+8DAAAEAAAA8v///xQAAAAEAAEAAAAKAAsACAAKAAQA+P///wwAAAAIAAgAAAAEAAoAAACAAwAAMAMAALACAABsAgAA5AEAAKABAAAgAQAA0AAAAEgAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACwAAAGluZm9fU1ZUWVBFAOz///9wAAAAZAAAABgAAAAMAAAAEAARAAQAAAAQAAgAAAAMAAEAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACQAAAGluZm9fVFlQRQAAAPz///8EAAQACQAAAGluZm9fVFlQRQAAAOz///84AAAAIAAAABgAAAACAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD0////IAAAAAEAAAAIAAkABAAIAAgAAABpbmZvX0VORAAAAADs////bAAAAGAAAAAYAAAADAAAABAAEQAEAAAAEAAIAAAADAABAAAABAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAYAAABmaWx0ZXIAAPz///8EAAQABgAAAGZpbHRlcgAA7P///zAAAAAgAAAAGAAAAAEDAAAQABIABAAQABEACAAAAAwAAAAAAPr///8BAAYABgAEAAcAAABxdWFsaXR5AOz///9wAAAAZAAAABgAAAAMAAAAEAARAAQAAAAQAAgAAAAMAAEAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACQAAAGFsdGVybmF0ZQAAAPz///8EAAQACQAAAGFsdGVybmF0ZQAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAkAAAByZWZlcmVuY2UAAADs////aAAAAFwAAAAYAAAADAAAABAAEQAEAAAAEAAIAAAADAABAAAABAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAIAAABpZAAA/P///wQABAAKAAAAaWRlbnRpZmllcgAA7P///zgAAAAgAAAAGAAAAAIAAAAQABEABAAAABAACAAAAAwAAAAAAPT///8gAAAAAQAAAAgACQAEAAgACAAAAHBvc2l0aW9uAAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAoAAABjaHJvbW9zb21lAA==
message root {
  REQUIRED BYTE_ARRAY chromosome (STRING);
  REQUIRED INT32 position;
  REQUIRED group identifier (LIST) {
    REPEATED group list {
      REQUIRED BYTE_ARRAY id (STRING);
    }
  }
  REQUIRED BYTE_ARRAY reference (STRING);
  REQUIRED group alternate (LIST) {
    REPEATED group list {
      REQUIRED BYTE_ARRAY alternate (STRING);
    }
  }
  OPTIONAL FLOAT quality;
  REQUIRED group filter (LIST) {
    REPEATED group list {
      REQUIRED BYTE_ARRAY filter (STRING);
    }
  }
  REQUIRED INT32 info_END;
  REQUIRED group info_TYPE (LIST) {
    REPEATED group list {
      REQUIRED BYTE_ARRAY info_TYPE (STRING);
    }
  }
  REQUIRED BYTE_ARRAY info_SVTYPE (STRING);
}

cat'ting it does not:

#> RUST_BACKTRACE=full pqrs head example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("insufficient values read from column - expected: 1024, got: 0")', /data/ouga/home/ag_gagneur/hoelzlwi/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-40.0.0/src/record/reader.rs:577:36
stack backtrace:
   0:     0x55c1eab8c3a1 - std::backtrace_rs::backtrace::libunwind::trace::h6aeaf83abc038fe6
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55c1eab8c3a1 - std::backtrace_rs::backtrace::trace_unsynchronized::h4f9875212db0ad97
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55c1eab8c3a1 - std::sys_common::backtrace::_print_fmt::h3f820027e9c39d3b
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x55c1eab8c3a1 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hded4932df41373b3
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x55c1eabb114f - core::fmt::rt::Argument::fmt::hc8ead7746b2406d6
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/fmt/rt.rs:138:9
   5:     0x55c1eabb114f - core::fmt::write::hb1cb56105a082ad9
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/fmt/mod.rs:1094:21
   6:     0x55c1eab8a071 - std::io::Write::write_fmt::h797fda7085c97e57
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/io/mod.rs:1713:15
   7:     0x55c1eab8c1b5 - std::sys_common::backtrace::_print::h492d3c92d7400346
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x55c1eab8c1b5 - std::sys_common::backtrace::print::hf74aa2eef05af215
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x55c1eab8d537 - std::panicking::default_hook::{{closure}}::h8cad394227ea3de8
  10:     0x55c1eab8d324 - std::panicking::default_hook::h249cc184fec99a8a
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:288:9
  11:     0x55c1eab8d9ec - std::panicking::rust_panic_with_hook::h82ebcd5d5ed2fad4
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:705:13
  12:     0x55c1eab8d8e7 - std::panicking::begin_panic_handler::{{closure}}::h810bed8ecbe66f1a
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:597:13
  13:     0x55c1eab8c7d6 - std::sys_common::backtrace::__rust_end_short_backtrace::h1410008071796261
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:151:18
  14:     0x55c1eab8d632 - rust_begin_unwind
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5
  15:     0x55c1ea3efef3 - core::panicking::panic_fmt::ha0a42a25e0cf258d
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14
  16:     0x55c1ea3f0393 - core::result::unwrap_failed::h100c4d67576990cf
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/result.rs:1651:5
  17:     0x55c1ea58111c - parquet::record::reader::Reader::advance_columns::he78d66a8310bbc6d
  18:     0x55c1ea581179 - parquet::record::reader::Reader::advance_columns::he78d66a8310bbc6d
  19:     0x55c1ea581971 - <parquet::record::reader::RowIter as core::iter::traits::iterator::Iterator>::next::h612da20bf81bedfa
  20:     0x55c1ea40307f - pqrs::utils::print_rows::h9bf7a7f08e6bc5ee
  21:     0x55c1ea3f9ec3 - pqrs::commands::head::execute::h2058003142e3c2ac
  22:     0x55c1ea427b06 - pqrs::main::h38253338d29d66ac
  23:     0x55c1ea3fea3d - std::sys_common::backtrace::__rust_begin_short_backtrace::h2f1f623026f1777f
  24:     0x55c1ea41a5b8 - std::rt::lang_start::{{closure}}::hb53e3cd4c57743d8
  25:     0x55c1eab84755 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h5ce27e764c284c0a
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:284:13
  26:     0x55c1eab84755 - std::panicking::try::do_call::h4c1fc390ae241991
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:500:40
  27:     0x55c1eab84755 - std::panicking::try::h4d36e7eaed86af72
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:464:19
  28:     0x55c1eab84755 - std::panic::catch_unwind::h41cfb4dd65282b1e
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panic.rs:142:14
  29:     0x55c1eab84755 - std::rt::lang_start_internal::{{closure}}::hfed411c1c5fdb925
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs:148:48
  30:     0x55c1eab84755 - std::panicking::try::do_call::h6893f6f32a464342
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:500:40
  31:     0x55c1eab84755 - std::panicking::try::h52b7102f469a0567
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:464:19
  32:     0x55c1eab84755 - std::panic::catch_unwind::h62120054677916b5
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panic.rs:142:14
  33:     0x55c1eab84755 - std::rt::lang_start_internal::hd66bf6b7da144005
                               at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs:148:20
  34:     0x55c1ea428fa5 - main
  35:     0x7ffbc5575d85 - __libc_start_main
  36:     0x55c1ea3f065e - _start
  37:                0x0 - <unknown>

Here the (zipped) file:
clinvar_chr1_pathogenic.vcf.gz.parquet.zip

Hoeze commented

fyi, Pandas reads the file flawlessly:

In [1]: import pandas as pd

In [2]: df = pd.read_parquet("example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet")

In [3]: df
Out[3]: 
     chromosome   position identifier reference alternate  quality filter   info_END info_TYPE info_SVTYPE
0             1     949523         []         C       [T]      NaN     []     949523     [SNP]            
1             1     949696         []         C      [CG]      NaN     []     949696   [INDEL]            
2             1     949739         []         G       [T]      NaN     []     949739     [SNP]            
3             1     957605         []         G       [A]      NaN     []     957605     [SNP]            
4             1     957693         []         A       [T]      NaN     []     957693     [SNP]            
...         ...        ...        ...       ...       ...      ...    ...        ...       ...         ...
4765          1  247588456         []         G       [A]      NaN     []  247588456     [SNP]            
4766          1  247588456         []         G       [C]      NaN     []  247588456     [SNP]            
4767          1  247588469         []         T       [C]      NaN     []  247588469     [SNP]            
4768          1  247588631         []         A       [G]      NaN     []  247588631     [SNP]            
4769          1  247599355         []         A       [G]      NaN     []  247599355     [SNP]            

[4770 rows x 10 columns]