datafaker-net/datafaker

Using metacharacters on a character set with regexify does not produce the expected text

Closed this issue · 0 comments

Description

Using \d instead of 0-9 or \w instead of a-zA-Z0-9_ inside a character set in the regexify method generates either the expected character or an [ randomly, plus a ] character.

Sample code

import net.datafaker.Faker;

class Scratch {
    public static void main(String[] args) {
        var faker = new Faker();

        faker.regexify("[0-9]"); // a random digit
        faker.regexify("\\d");   // also a random digit
        faker.regexify("[\\d]"); // 2 characters, the first randomly being either a digit or `[`, and a `]`

        faker.regexify("[a-zA-Z0-9_]"); // a word characters
        faker.regexify("\\w");          // also a word characters
        faker.regexify("[\\w]");        // 2 characters, the first randomly being either a word character or `[`, and a `]`
    }
}

My guess is that \d and \w are being replaced to their respective character set directly without checking if it's already on a character set, generating a regex like [[0-9]] for [\d] and [[a-zA-Z0-9_]] for [\w], which would explain the actual behavior.

Expected behavior

Using \d and \w should have the same behavior as using 0-9 and a-zA-Z0-9_ respectively, regardless of being in a character set.

Versions

  • OS: Linux
  • JDK: zulu 17.44.17
  • Faker Version: 2.1.0

Workaround

As shown in the sample code, using 0-9 instead of \d and a-zA-Z0-9_ instead of \w when inside a character set works normally.

Additional context

I've also tested on the main branch, and after replacing the Generex dependency with RgxGen in this commit, it seems to be working normally.

Since the root problem is on a dependency that the project is already switching away from and there's a reasonable workaround to work with in the meantime, maybe it's better to just add a test case for this instead of fixing it on Generex (which looks like it's not maintained anymore anyway) or working around it here in the repo.