trinker/qdapRegex

`rm_between` same left right boundaries gives undesired output

trinker opened this issue · 4 comments

Determine if the following is a bug and if so how to fix:

x <- 'Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"'

rm_between(
    x, 
    left = '"', right = '"',
    extract=TRUE
)

## [[1]]
## [1] "Salmo salar"       "and Danube salmon" "Hucho hucho"  

When we expect:

## [[1]]
## [1] "\"Salmo salar\"" "\"Hucho hucho\""

I think this is because the default regex of rm_between is to not include the left/right bounds. This uses the following regex "(?<=\").*?(?=\")" (S("@rm_between2", '"')). This use of lookaheads cause the left/right bounds to not be consumed and thus allows the quotation marks to be available for: " and Danube salmon ". This is (IMO) a bug that I will address but am unsure how yet.

@hwnd you suggested:

x <- 'Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"'

rm_default(
    x, 
    pattern = '(?<=")[^"]*',
    extract=TRUE
)

But this gives:

## [[1]]
## [1] "Salmo salar"         " and Danube salmon " "Hucho hucho"         ""
``

Not:

```r
## [[1]]
## [1] "Salmo salar" "Hucho hucho"

In the case of quotes, lookarounds should be avoided because of the "in between".

One possible workaround would be:

x <- 'Fresh or chilled Atlantic salmon "Salmo salar" and Danube salmon "Hucho hucho"'

gsub('^"|"$', '', 
   rm_default(
       x, 
       pattern = '"[^"]*"', 
       extract=TRUE)[[1]]
   )

Output

## [1] "Salmo salar" "Hucho hucho"

@hwndx I incorporated your idea into rm_between. Thanks for the help.