pqrs fails to read valid parquet file
Hoeze opened this issue · 1 comments
Hoeze commented
Reading the schema works:
#> RUST_BACKTRACE=full pqrs schema example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet
Metadata for file: example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet
version: 2
num of rows: 4770
created by: Arrow2 - Native Rust implementation of Arrow
metadata:
ARROW:schema: /////+8DAAAEAAAA8v///xQAAAAEAAEAAAAKAAsACAAKAAQA+P///wwAAAAIAAgAAAAEAAoAAACAAwAAMAMAALACAABsAgAA5AEAAKABAAAgAQAA0AAAAEgAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACwAAAGluZm9fU1ZUWVBFAOz///9wAAAAZAAAABgAAAAMAAAAEAARAAQAAAAQAAgAAAAMAAEAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACQAAAGluZm9fVFlQRQAAAPz///8EAAQACQAAAGluZm9fVFlQRQAAAOz///84AAAAIAAAABgAAAACAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD0////IAAAAAEAAAAIAAkABAAIAAgAAABpbmZvX0VORAAAAADs////bAAAAGAAAAAYAAAADAAAABAAEQAEAAAAEAAIAAAADAABAAAABAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAYAAABmaWx0ZXIAAPz///8EAAQABgAAAGZpbHRlcgAA7P///zAAAAAgAAAAGAAAAAEDAAAQABIABAAQABEACAAAAAwAAAAAAPr///8BAAYABgAEAAcAAABxdWFsaXR5AOz///9wAAAAZAAAABgAAAAMAAAAEAARAAQAAAAQAAgAAAAMAAEAAAAEAAAA7P///ywAAAAgAAAAGAAAAAUAAAAQABEABAAAABAACAAAAAwAAAAAAPz///8EAAQACQAAAGFsdGVybmF0ZQAAAPz///8EAAQACQAAAGFsdGVybmF0ZQAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAkAAAByZWZlcmVuY2UAAADs////aAAAAFwAAAAYAAAADAAAABAAEQAEAAAAEAAIAAAADAABAAAABAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAIAAABpZAAA/P///wQABAAKAAAAaWRlbnRpZmllcgAA7P///zgAAAAgAAAAGAAAAAIAAAAQABEABAAAABAACAAAAAwAAAAAAPT///8gAAAAAQAAAAgACQAEAAgACAAAAHBvc2l0aW9uAAAAAOz///8sAAAAIAAAABgAAAAFAAAAEAARAAQAAAAQAAgAAAAMAAAAAAD8////BAAEAAoAAABjaHJvbW9zb21lAA==
message root {
REQUIRED BYTE_ARRAY chromosome (STRING);
REQUIRED INT32 position;
REQUIRED group identifier (LIST) {
REPEATED group list {
REQUIRED BYTE_ARRAY id (STRING);
}
}
REQUIRED BYTE_ARRAY reference (STRING);
REQUIRED group alternate (LIST) {
REPEATED group list {
REQUIRED BYTE_ARRAY alternate (STRING);
}
}
OPTIONAL FLOAT quality;
REQUIRED group filter (LIST) {
REPEATED group list {
REQUIRED BYTE_ARRAY filter (STRING);
}
}
REQUIRED INT32 info_END;
REQUIRED group info_TYPE (LIST) {
REPEATED group list {
REQUIRED BYTE_ARRAY info_TYPE (STRING);
}
}
REQUIRED BYTE_ARRAY info_SVTYPE (STRING);
}
cat'ting it does not:
#> RUST_BACKTRACE=full pqrs head example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("insufficient values read from column - expected: 1024, got: 0")', /data/ouga/home/ag_gagneur/hoelzlwi/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-40.0.0/src/record/reader.rs:577:36
stack backtrace:
0: 0x55c1eab8c3a1 - std::backtrace_rs::backtrace::libunwind::trace::h6aeaf83abc038fe6
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: 0x55c1eab8c3a1 - std::backtrace_rs::backtrace::trace_unsynchronized::h4f9875212db0ad97
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x55c1eab8c3a1 - std::sys_common::backtrace::_print_fmt::h3f820027e9c39d3b
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:65:5
3: 0x55c1eab8c3a1 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hded4932df41373b3
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:44:22
4: 0x55c1eabb114f - core::fmt::rt::Argument::fmt::hc8ead7746b2406d6
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/fmt/rt.rs:138:9
5: 0x55c1eabb114f - core::fmt::write::hb1cb56105a082ad9
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/fmt/mod.rs:1094:21
6: 0x55c1eab8a071 - std::io::Write::write_fmt::h797fda7085c97e57
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/io/mod.rs:1713:15
7: 0x55c1eab8c1b5 - std::sys_common::backtrace::_print::h492d3c92d7400346
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:47:5
8: 0x55c1eab8c1b5 - std::sys_common::backtrace::print::hf74aa2eef05af215
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:34:9
9: 0x55c1eab8d537 - std::panicking::default_hook::{{closure}}::h8cad394227ea3de8
10: 0x55c1eab8d324 - std::panicking::default_hook::h249cc184fec99a8a
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:288:9
11: 0x55c1eab8d9ec - std::panicking::rust_panic_with_hook::h82ebcd5d5ed2fad4
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:705:13
12: 0x55c1eab8d8e7 - std::panicking::begin_panic_handler::{{closure}}::h810bed8ecbe66f1a
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:597:13
13: 0x55c1eab8c7d6 - std::sys_common::backtrace::__rust_end_short_backtrace::h1410008071796261
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/sys_common/backtrace.rs:151:18
14: 0x55c1eab8d632 - rust_begin_unwind
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5
15: 0x55c1ea3efef3 - core::panicking::panic_fmt::ha0a42a25e0cf258d
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14
16: 0x55c1ea3f0393 - core::result::unwrap_failed::h100c4d67576990cf
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/result.rs:1651:5
17: 0x55c1ea58111c - parquet::record::reader::Reader::advance_columns::he78d66a8310bbc6d
18: 0x55c1ea581179 - parquet::record::reader::Reader::advance_columns::he78d66a8310bbc6d
19: 0x55c1ea581971 - <parquet::record::reader::RowIter as core::iter::traits::iterator::Iterator>::next::h612da20bf81bedfa
20: 0x55c1ea40307f - pqrs::utils::print_rows::h9bf7a7f08e6bc5ee
21: 0x55c1ea3f9ec3 - pqrs::commands::head::execute::h2058003142e3c2ac
22: 0x55c1ea427b06 - pqrs::main::h38253338d29d66ac
23: 0x55c1ea3fea3d - std::sys_common::backtrace::__rust_begin_short_backtrace::h2f1f623026f1777f
24: 0x55c1ea41a5b8 - std::rt::lang_start::{{closure}}::hb53e3cd4c57743d8
25: 0x55c1eab84755 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h5ce27e764c284c0a
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:284:13
26: 0x55c1eab84755 - std::panicking::try::do_call::h4c1fc390ae241991
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:500:40
27: 0x55c1eab84755 - std::panicking::try::h4d36e7eaed86af72
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:464:19
28: 0x55c1eab84755 - std::panic::catch_unwind::h41cfb4dd65282b1e
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panic.rs:142:14
29: 0x55c1eab84755 - std::rt::lang_start_internal::{{closure}}::hfed411c1c5fdb925
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs:148:48
30: 0x55c1eab84755 - std::panicking::try::do_call::h6893f6f32a464342
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:500:40
31: 0x55c1eab84755 - std::panicking::try::h52b7102f469a0567
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:464:19
32: 0x55c1eab84755 - std::panic::catch_unwind::h62120054677916b5
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panic.rs:142:14
33: 0x55c1eab84755 - std::rt::lang_start_internal::hd66bf6b7da144005
at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/rt.rs:148:20
34: 0x55c1ea428fa5 - main
35: 0x7ffbc5575d85 - __libc_start_main
36: 0x55c1ea3f065e - _start
37: 0x0 - <unknown>
Here the (zipped) file:
clinvar_chr1_pathogenic.vcf.gz.parquet.zip
Hoeze commented
fyi, Pandas reads the file flawlessly:
In [1]: import pandas as pd
In [2]: df = pd.read_parquet("example/output/vcf.parquet/clinvar_chr1_pathogenic.vcf.gz.parquet")
In [3]: df
Out[3]:
chromosome position identifier reference alternate quality filter info_END info_TYPE info_SVTYPE
0 1 949523 [] C [T] NaN [] 949523 [SNP]
1 1 949696 [] C [CG] NaN [] 949696 [INDEL]
2 1 949739 [] G [T] NaN [] 949739 [SNP]
3 1 957605 [] G [A] NaN [] 957605 [SNP]
4 1 957693 [] A [T] NaN [] 957693 [SNP]
... ... ... ... ... ... ... ... ... ... ...
4765 1 247588456 [] G [A] NaN [] 247588456 [SNP]
4766 1 247588456 [] G [C] NaN [] 247588456 [SNP]
4767 1 247588469 [] T [C] NaN [] 247588469 [SNP]
4768 1 247588631 [] A [G] NaN [] 247588631 [SNP]
4769 1 247599355 [] A [G] NaN [] 247599355 [SNP]
[4770 rows x 10 columns]