[Bug]test-utils/ts-match-string doesn't work with trailing lookaheads.

Question

[Bug]test-utils/ts-match-string doesn't work with trailing lookaheads.

PaulJPhilp opened this issue 8 months ago · 14 comments

PaulJPhilp commented 8 months ago

Describe the bug
A regular expression that ends with a lookahead not supported by ts-match-string.

To Reproduce
Steps to reproduce the behaviour:

Building a regex to match the Scheme ("ftp:") of a URL.
Use a positive lookahead to match the trailing ':'.

Create a regex that has a trailing positive lookahead:
const urlScheme = buildRegExp(
[choiceOf(schemePlural, schemeSingular), lookahead(':')],
);
Create tests using the character from the positive lookahead as the final character.
1) expect(urlSchemeValidator).toMatchString('ftp:');
2) expect("ftp:").toMatch(urlSchemeValidator);
3) expect('ftp').toMatch(urlSchemeValidator); // for completeness
4) expect(urlSchemeValidator).toMatchString('ftp'); // for completeness

Expected behaviour
Test 1) and 2) should pass and 3) and 4) should not pass.
Instead, all 4 tests fail.

Screenshots

Package version
ts-regex-builder:

Additional context

I have a solution to propose in an upcoming PR.

Answer 1 · 2024-04-14T14:17:21.000Z

@PaulJPhilp thank you for reporting it. Does it also occurs when you code regex literal by hand? Is it library issues or JS regex limitation?

Answer 2 · 2024-04-16T23:50:25.000Z

No. It is just using the library. I understand the issue better now.

The problem is in the testing utility: ToMatchString(). I came to think the issue is missing functionality:

I created a a regex builder (urlSheme) to recognize a URL scheme e.g. (http:). I used a positive lookahead to match the ':'.
Now I want to test it. I can:

expect(urlScheme).toMatchString("http") - the result is false because 'http' does have the ':' character to match the lookahead.
expect(urlScheme).toMatchString("http:") - the result is false because because there is not character after the ':'. It's a subtle issue. In the end, I had to manually walk through the state machine to understand what is happening. The positive lookahead matches the ':' but does not consume it. So there is one more character left but the pattern is exhausted so the match fails.

Conceptually, you need a toMatchSubstring('http', 'http:') which matches because 'http' is a substring of 'http:'.

I tried a few APIs and settled on:
`interface MatchTypeOptions {
exactString: boolean;
substring?: string;
}

export function toMatchString(
this: jest.MatcherContext,
received: RegExp | RegexSequence,
expected: string,
matchType?: MatchTypeOptions,
)` in my PR.

I changed a few existing tests (currency, filename, suffixes) to use the new API. I think that those test were getting the correct result for the wrong reason. Here's how it looks:

 expect(currencyRegex).toMatchString('$10', { exactString: false, substring: '10' });
 expect(filenameRegex).toMatchString('index.ts', { exactString: false });
 expect(regex).toMatchString('democracy', { exactString: false });

This is the biggest reason my PR got so large. Take a look.

I'm not married to the API, so feel free to suggest an alternative.

Answer 3 · 2024-04-17T07:02:24.000Z

As far as I understand the issue is in that you use a lookahead + immediate end of string, but without any other pattern that would consume the ":" character.

This should be easy to fix by just swapping lookahead(":") with ":". Otherwise you pattern requires that both happen an the same time:

there is a : char right after ftp
there is end of string right after ftp

By using plain ":", you reconcile these two by matching the colon as well as consuming it before matching end of string.

Answer 4 · 2024-04-17T17:50:15.000Z

At first I thought it was that easy ... but I spent 4 days on it. I found this because I found a list of valid URLs for my test suite. I proved good stress testing. If matching the scheme is the only job, sure, but I need the ':' later on. [image: image.png] For the finder and validator I do that: export const urlSchemeFinder = buildRegExp([urlScheme, schemeSeperator], { ignoreCase: false, global: true, }); export const urlSchemeValidator = buildRegExp( [startOfString, urlScheme, schemeSeperator, endOfString], { ignoreCase: false, global: false, }, ); but for the component used to match the entire URL, it's problematic because "http:/" is a valid path. // // These two patterns are needed to disambiguate between "http:/path" and " http://authority/path". // "http://" is technically a valid URL: urlScheme = http, urlAuthority = null, urlPath = / // By convention, an empty path is considered invalid, if it follows an empty authority. // const noAuthority = regex([pathSeparator, negativeLookahead(pathSeparator), urlPath]); const hasAuthority = regex([authoritySeperator, urlAuthority, optional( urlPath)]); export const url = buildRegExp([ urlScheme, schemeSeperator, choiceOf(noAuthority, hasAuthority), optional(urlQuery), optional(urlFragment), ]); BTW, the same issue applies to initial lookbehinds (currencies). There are related but different issues with mid-pattern lookarounds as well. My conclusion is that toMatchString() isn't powerful enough to handle many (all?) types of lookarounds. Thus I proposed some additions to its API.

…

On Wed, Apr 17, 2024 at 3:02 AM Maciej Jastrzebski ***@***.***> wrote: As far as I understand the issue is in that you use a lookahead + immediate end of string, but without any other pattern that would consume the ":" character. This should be easy to fix by just swapping lookahead(":") with ":". Otherwise you pattern requires that both happen an the same time: - there is a : char right after ftp - there is end of string right after ftp By using plain ":", you reconcile these two by matching the colon as well as consuming it before matching end of string. — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BA6V6ZUJZUAX6QG7RHBNP2LY5YNBLAVCNFSM6AAAAABF3T2CYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRQGUZDENJYGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 5 · 2024-04-17T21:21:26.000Z

toMatchString is just an internal testing helper, it's not part of our public API. It's basically a wrapper around String.match method.

Could you verify the issues you are observing using plain RegExp (String) methods like RegExp.test, RegExp.exec or String.match?

Answer 6 · 2024-04-17T22:03:13.000Z

Yes, I understand what toMatchString() is. It wasn't always clear that the issue was in the testing helper. Blaming the test tools first is not a great idea. I've used regex testing tools (https://regex101.com/) and visualizers ( https://www.debuggex.com/). Here's a tester with a simple regex that demonstrates the issue. ```ts const regex = /[a-zA-z]{2,4}(?=:)/ const result1 = "http".match(regex) const result2 = "http:".match(regex) const result3 = regex.test("http") const result4 = regex.test("http:") const result5 = regex.exec("http") const result6 = regex.exec("http:") ``` ``` console.log("RESULT 1\n", result1) console.log("RESULT 2\n", result2) console.log("RESULT 3\n", result3) console.log("RESULT 4\n", result4) console.log("RESULT 5\n", result5) console.log("RESULT 6\n", result6) ``` Here are the results. Notice that 'http:' matches the 'http' substring of 'http:'. ``` ***@***.*** ts-regex-builder % bun issue.ts RESULT 1 null RESULT 2 [ "http" ] RESULT 3 false RESULT 4 true RESULT 5 null RESULT 6 [ "http" ] ```

…

On Wed, Apr 17, 2024 at 5:21 PM Maciej Jastrzebski ***@***.***> wrote: toMatchString is just an internal testing helper, it's not part of our public API. It's basically a wrapper around String.match method. Could you verify the issues you are observing using plain RegExp (String) methods like RegExp.test, RegExp.exec or String.match? — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BA6V6ZU7MY54YIE74TYW3ILY53RWXAVCNFSM6AAAAABF3T2CYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGM4DQNRWHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 7 · 2024-04-18T20:25:56.000Z

Not sure how relevant is it, but your first example used startOfString endOfString to wrap the pattern, but your last example does not use it (nor ^ & $).

Answer 8 · 2024-04-18T21:06:15.000Z

Yes. The first 2 are the 'finder' and 'validator' pattern. They can use the start and end of string characters. The reusable 'urlScheme' cannot use them.

…

On Thu, Apr 18, 2024 at 4:26 PM Maciej Jastrzebski ***@***.***> wrote: Not sure how relevant is it, but your first example used startOfString endOfString to wrap the pattern, but your last example does not use it (nor ^ & $). — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BA6V6ZUL4OA4S3TN3KNFB73Y6AT6VAVCNFSM6AAAAABF3T2CYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRVGI2DMOJXGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 9 · 2024-04-18T21:15:06.000Z

I have an idea. In my most recent PR, I have a proposed solution to this issue. I'd be happy to pull it out of that PR and create a PR for just that solution. That would give you a chance to see it and play with it. The current API is my 3rd version, so I'm sure that there is a 4th that would improve it.

…

On Thu, Apr 18, 2024 at 4:26 PM Maciej Jastrzebski ***@***.***> wrote: Not sure how relevant is it, but your first example used startOfString endOfString to wrap the pattern, but your last example does not use it (nor ^ & $). — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BA6V6ZUL4OA4S3TN3KNFB73Y6AT6VAVCNFSM6AAAAABF3T2CYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRVGI2DMOJXGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 10 · 2024-04-19T10:24:04.000Z

Please do so, it will be easier to focus that issue, as for now I don't feel like I understand it exactly.

Answer 11 · 2024-04-19T22:50:40.000Z

I have submitted the PR with only the changes in the ts-match-string API, and the tests that were impacted by the issue or the fix.

…

On Fri, Apr 19, 2024 at 6:24 AM Maciej Jastrzebski ***@***.***> wrote: Please do so, it will be easier to focus that issue, as for now I don't feel like I understand it exactly. — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BA6V6ZW54FTUABD32DNSSV3Y6DWFTAVCNFSM6AAAAABF3T2CYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRWGI3TSNBXGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 12 · 2024-04-20T17:42:43.000Z

OK, after dealing with this issue for a couple of weeks now, I think I finally understand it well enough to explain it simply. It has to do with non-capturing operations in patterns. Example: expect(/[a-z]{2,6}(?=:)/).toMatchString("http:"). 1) The pattern has a non-capture operation `(?=:)` 2) The test string has that character (toMatchString("http:")) 3) The non-capturing op matches the character (:). 4) The character is not-captured. 5) The ':' is left in the test string. 6) The regex state machine has finished. Therefore 1) The full string match fails. 2) In exec(), the substring "http" is matched, so the exec() passes. In the general case, where the non-capture character is in the middle of the test string, the issue is that when the non-capture character is matched, that character is left in the test string but the state machine has moved on to the next character. In this case, there will be too many characters remaining in the test string. The state machine will finish before all the characters in the test string have been exhausted. So, only a substring of the test string can be possible matches. *Conclusion* toMatchString(testString) cannot be used to test patterns with non-capturing ops in the pattern being matched against.

…

On Fri, Apr 19, 2024 at 6:50 PM Paul Philp ***@***.***> wrote: I have submitted the PR with only the changes in the ts-match-string API, and the tests that were impacted by the issue or the fix. On Fri, Apr 19, 2024 at 6:24 AM Maciej Jastrzebski < ***@***.***> wrote: > Please do so, it will be easier to focus that issue, as for now I don't > feel like I understand it exactly. > > — > Reply to this email directly, view it on GitHub > <#81 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/BA6V6ZW54FTUABD32DNSSV3Y6DWFTAVCNFSM6AAAAABF3T2CYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRWGI3TSNBXGI> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

Answer 13 · 2024-04-20T20:46:06.000Z

Let's gather the findings so far:

the pattern /[a-z]{2,6}(?=:)/ will match "http:"
but pattern /^[a-z]{2,6}(?=:)$/ will not match "http:"
toMatchString() test utility is a simple wrapper over String.match, and the behavior you observed with expect(/^[a-z]{2,6}(?=:)$/)toMatchString("http:") would be also observed with "http:".match(/^[a-z]{2,6}(?=:)$/)
lookaheads (?=...) do not consume regex characters. As the name suggest they "look ahead" without moving the current matching position
using $ (end of string) would only work if the : got consumed, which we can get by replacing lookahead with regular character match /^[a-z]{2,6}:$/

As far as I understand your issue, and I am not sure whether I understand it correctly, you would like to have regex with both lookahead for : character (for nesting regex expressions) and be able to use end of string anchor (for validation).

If so, the I think this is not achievable using one regex pattern. There would have to be two:

validation (consuming :, end of string assertion) /^[a-z]{2,6}:$/
nesting (not consuming :, no end of string assertion) /[a-z]{2,6}(?=:)/

Let me know if I've got that right or did I miss something.

Answer 14 · 2024-04-20T20:58:31.000Z

Nope. The issue is that you can't use toMatchString() as it is to test a pattern with lookaheads in them. At all. That's the only issue I am talking about.

…

On Sat, Apr 20, 2024 at 4:46 PM Maciej Jastrzebski ***@***.***> wrote: Let's gather the findings so far: - the pattern /[a-z]{2,4}(?=:)/ will match "http:" - but pattern /^[a-z]{2,4}(?=:)$/ will not match "http:" - toMatchString() test utility is a simple wrapper over String.match, and the behavior you observed with expect(regex)toMatchString("http:") would be also observed with `"http:".match(/^[a-z]{2,4}(?=:)$/) - lookaheads <https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Lookahead_assertion> (?=...) do not consume regex characters. As the name suggest they "look ahead" - using $ (end of string) would only work if the : got consumed, which we can get by replacing lookahead with regular character match /^[a-z]{2,4}:$/ As far as I understand your issue, and I am not sure whether I understand it correctly, you would like to have regex with both lookahead for : character (for nesting regex expressions) and be able to use end of string anchor (for validation). If so, the I think this is not achievable using one regex pattern. There would have to be two: - validation (consuming :, end of string assertion) /^[a-z]{2,4}:$/ - nesting (not consuming :, no end of string assertion) /[a-z]{2,4}(?=:)/ Let me know if I've got that right or did I miss something. — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BA6V6ZWNQNITEPREWZEM5STY6LH2HAVCNFSM6AAAAABF3T2CYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXG43TONJRHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>