`rm_between` same left right boundaries gives undesired output
trinker opened this issue · 4 comments
trinker commented
Determine if the following is a bug and if so how to fix:
x <- 'Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"'
rm_between(
x,
left = '"', right = '"',
extract=TRUE
)
## [[1]]
## [1] "Salmo salar" "and Danube salmon" "Hucho hucho"
When we expect:
## [[1]]
## [1] "\"Salmo salar\"" "\"Hucho hucho\""
trinker commented
I think this is because the default regex of rm_between
is to not include the left/right bounds. This uses the following regex "(?<=\").*?(?=\")"
(S("@rm_between2", '"')
). This use of lookaheads cause the left/right bounds to not be consumed and thus allows the quotation marks to be available for: " and Danube salmon "
. This is (IMO) a bug that I will address but am unsure how yet.
trinker commented
@hwnd you suggested:
x <- 'Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"'
rm_default(
x,
pattern = '(?<=")[^"]*',
extract=TRUE
)
But this gives:
## [[1]]
## [1] "Salmo salar" " and Danube salmon " "Hucho hucho" ""
``
Not:
```r
## [[1]]
## [1] "Salmo salar" "Hucho hucho"
hwndx commented
In the case of quotes, lookarounds should be avoided because of the "in between".
One possible workaround would be:
x <- 'Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"'
gsub('^"|"$', '',
rm_default(
x,
pattern = '"[^"]*"',
extract=TRUE)[[1]]
)
Output
## [1] "Salmo salar" "Hucho hucho"