Pre-filtering through pipe connections does not work on remote files
DiabbZegpi opened this issue · 1 comments
Trying to use the pre-filtering capacity of awk
or grep
through pipe()
connections works in a flawless fashion on local text files, BUT this functionality does not seem to work on remote files.
For reproducibility, this is an example with the UK's cats dataset from the Tidy Tuesday Project. The code below attempts to filter for those rows in which the first field has the string "Ares", but returns an empty tibble.
library(vroom)
cats_file <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv"
# Example with `awk`
vroom(
file = pipe(paste("awk -F ',' '$1 ~ /Ares/'", cats_file)),
delim = ","
)
# Same filter but with `grep`
vroom(
file = pipe(paste("grep '^Ares,'", cats_file)),
delim = ","
)
Downloading the csv and performing the same operation, but with the local path, works smoothly.
Am I making a mistake? What is the correct way of performing pre-filtering of remote text files with vroom?
Yes, you can do this by using curl
to stream the data then (unix) pipe that into grep
, awk
, or (see below) perl
and put all of that into (R) pipe()
.
Here I'm using the techniques shown by @jimhester in this YouTube video: https://youtu.be/RYhwZW6ofbI?si=4uRdHcgsSBckd8n4&t=825
Note the last example, which gives a nicer result than the first two, because it also gets the column names.
library(vroom)
library(glue)
cats_file <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv"
cmd <- glue("curl -sL {cats_file} | grep '^Ares,'")
vroom(pipe(cmd), col_names = FALSE)
#> Rows: 103 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): X1, X11
#> dbl (5): X2, X5, X6, X7, X8
#> lgl (3): X3, X9, X10
#> dttm (1): X4
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 103 × 11
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
#> <chr> <dbl> <lgl> <dttm> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 Ares 3.40e9 TRUE 2017-06-24 01:03:57 -5.11 50.2 684 155. FALSE FALSE
#> 2 Ares 3.40e9 TRUE 2017-06-24 01:11:20 -5.11 50.2 936 155. FALSE FALSE
#> 3 Ares 3.40e9 TRUE 2017-06-24 02:58:16 -5.11 50.2 2340 81.4 FALSE FALSE
#> 4 Ares 3.40e9 TRUE 2017-06-24 03:01:26 -5.11 50.2 0 67.8 FALSE FALSE
#> 5 Ares 3.40e9 TRUE 2017-06-24 03:51:58 -5.11 50.2 4896 118. FALSE FALSE
#> 6 Ares 3.40e9 TRUE 2017-06-24 03:55:03 -5.11 50.2 504 123. FALSE FALSE
#> 7 Ares 3.40e9 TRUE 2017-06-24 05:24:17 -5.11 50.2 108 27.1 FALSE FALSE
#> 8 Ares 3.40e9 TRUE 2017-06-24 05:27:23 -5.11 50.2 504 36.3 FALSE FALSE
#> 9 Ares 3.40e9 TRUE 2017-06-24 09:14:37 -5.11 50.2 252 67.4 FALSE FALSE
#> 10 Ares 3.40e9 TRUE 2017-06-24 09:17:42 -5.11 50.2 4068 78.9 FALSE FALSE
#> # ℹ 93 more rows
#> # ℹ 1 more variable: X11 <chr>
cmd <- glue("curl -sL {cats_file} | awk -F ',' '$1 ~ /Ares/'")
vroom(pipe(cmd), col_names = FALSE)
#> Rows: 103 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): X1, X11
#> dbl (5): X2, X5, X6, X7, X8
#> lgl (3): X3, X9, X10
#> dttm (1): X4
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 103 × 11
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
#> <chr> <dbl> <lgl> <dttm> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 Ares 3.40e9 TRUE 2017-06-24 01:03:57 -5.11 50.2 684 155. FALSE FALSE
#> 2 Ares 3.40e9 TRUE 2017-06-24 01:11:20 -5.11 50.2 936 155. FALSE FALSE
#> 3 Ares 3.40e9 TRUE 2017-06-24 02:58:16 -5.11 50.2 2340 81.4 FALSE FALSE
#> 4 Ares 3.40e9 TRUE 2017-06-24 03:01:26 -5.11 50.2 0 67.8 FALSE FALSE
#> 5 Ares 3.40e9 TRUE 2017-06-24 03:51:58 -5.11 50.2 4896 118. FALSE FALSE
#> 6 Ares 3.40e9 TRUE 2017-06-24 03:55:03 -5.11 50.2 504 123. FALSE FALSE
#> 7 Ares 3.40e9 TRUE 2017-06-24 05:24:17 -5.11 50.2 108 27.1 FALSE FALSE
#> 8 Ares 3.40e9 TRUE 2017-06-24 05:27:23 -5.11 50.2 504 36.3 FALSE FALSE
#> 9 Ares 3.40e9 TRUE 2017-06-24 09:14:37 -5.11 50.2 252 67.4 FALSE FALSE
#> 10 Ares 3.40e9 TRUE 2017-06-24 09:17:42 -5.11 50.2 4068 78.9 FALSE FALSE
#> # ℹ 93 more rows
#> # ℹ 1 more variable: X11 <chr>
cmd <- glue("curl -sL {cats_file} | perl -ne 'use English; print if $INPUT_LINE_NUMBER == 1 || /^Ares,/'")
vroom(pipe(cmd))
#> Rows: 103 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): tag_id, study_name
#> dbl (5): event_id, location_long, location_lat, ground_speed, height_above_...
#> lgl (3): visible, algorithm_marked_outlier, manually_marked_outlier
#> dttm (1): timestamp
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 103 × 11
#> tag_id event_id visible timestamp location_long location_lat
#> <chr> <dbl> <lgl> <dttm> <dbl> <dbl>
#> 1 Ares 3395610551 TRUE 2017-06-24 01:03:57 -5.11 50.2
#> 2 Ares 3395610552 TRUE 2017-06-24 01:11:20 -5.11 50.2
#> 3 Ares 3395610553 TRUE 2017-06-24 02:58:16 -5.11 50.2
#> 4 Ares 3395610554 TRUE 2017-06-24 03:01:26 -5.11 50.2
#> 5 Ares 3395610555 TRUE 2017-06-24 03:51:58 -5.11 50.2
#> 6 Ares 3395610556 TRUE 2017-06-24 03:55:03 -5.11 50.2
#> 7 Ares 3395610557 TRUE 2017-06-24 05:24:17 -5.11 50.2
#> 8 Ares 3395610558 TRUE 2017-06-24 05:27:23 -5.11 50.2
#> 9 Ares 3395610559 TRUE 2017-06-24 09:14:37 -5.11 50.2
#> 10 Ares 3395610560 TRUE 2017-06-24 09:17:42 -5.11 50.2
#> # ℹ 93 more rows
#> # ℹ 5 more variables: ground_speed <dbl>, height_above_ellipsoid <dbl>,
#> # algorithm_marked_outlier <lgl>, manually_marked_outlier <lgl>,
#> # study_name <chr>
Created on 2023-09-28 with reprex v2.0.2.9000