tidyverse/vroom

segfault error reading large compressed file

n3ssuno opened this issue · 2 comments

Hi,

I cannot read a large zip file with vroom (v. 1.6.5) but I can read it once unzipped.

I saw there are closed similar issues, but none of the solutions seem to work for me and they are also supposed to have been fixed in the version I installed (if I understand correctly).

r$> download.file("https://s3.amazonaws.com/data.patentsview.org/download/g_inventor_disambiguated.tsv.zip")

r$> df <- "g_inventor_disambiguated.tsv.zip" |>
  vroom::vroom(
    col_select = c(inventor_id),
    col_types = vroom::cols(inventor_id = vroom::col_character())
  )

r$> df

 *** caught segfault ***
address 0x7f4dafee9009, cause 'memory not mapped'

Traceback:
 1: vec_slice(x, seq_len(n))
 2: vec_head(as.data.frame(x), n)
 3: df_head(x, n)
 4: tbl_format_setup.tbl(x, width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines, focus = focus)
 5: tbl_format_setup_dispatch(x, width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines, focus = focus)
 6: tbl_format_setup(x, width = width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines, focus = attr(x, "pillar_focus"))
 7: format_tbl(x, width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines)
 8: format.tbl(x, width = width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines)
 9: format(x, width = width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines)
10: writeLines(format(x, width = width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines))
11: print_tbl(x, width, ..., n = n, max_extra_cols = max_extra_cols,     max_footer_lines = max_footer_lines)
12: print.tbl(x)
13: (function (x, ...) UseMethod("print"))(x)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

Here some information about my environment

r$> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Debian GNU/Linux 12 (bookworm)

Matrix products: default
BLAS/LAPACK: /home/******/miniforge3/envs/******/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Rome
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.3.1

Please let me know if you need any extra information.
I hope this helps to solve the issue.

I second this

can be fixed by re-setting TMPDIR to a disk with enough space