Using metacharacters on a character set with regexify does not produce the expected text
Closed this issue · 0 comments
Description
Using \d
instead of 0-9
or \w
instead of a-zA-Z0-9_
inside a character set in the regexify
method generates either the expected character or an [
randomly, plus a ]
character.
Sample code
import net.datafaker.Faker;
class Scratch {
public static void main(String[] args) {
var faker = new Faker();
faker.regexify("[0-9]"); // a random digit
faker.regexify("\\d"); // also a random digit
faker.regexify("[\\d]"); // 2 characters, the first randomly being either a digit or `[`, and a `]`
faker.regexify("[a-zA-Z0-9_]"); // a word characters
faker.regexify("\\w"); // also a word characters
faker.regexify("[\\w]"); // 2 characters, the first randomly being either a word character or `[`, and a `]`
}
}
My guess is that \d
and \w
are being replaced to their respective character set directly without checking if it's already on a character set, generating a regex like [[0-9]]
for [\d]
and [[a-zA-Z0-9_]]
for [\w]
, which would explain the actual behavior.
Expected behavior
Using \d
and \w
should have the same behavior as using 0-9
and a-zA-Z0-9_
respectively, regardless of being in a character set.
Versions
- OS: Linux
- JDK: zulu 17.44.17
- Faker Version: 2.1.0
Workaround
As shown in the sample code, using 0-9
instead of \d
and a-zA-Z0-9_
instead of \w
when inside a character set works normally.
Additional context
I've also tested on the main
branch, and after replacing the Generex dependency with RgxGen in this commit, it seems to be working normally.
Since the root problem is on a dependency that the project is already switching away from and there's a reasonable workaround to work with in the meantime, maybe it's better to just add a test case for this instead of fixing it on Generex (which looks like it's not maintained anymore anyway) or working around it here in the repo.