flink-faker is an Apache Flink table source that generates fake data based on the Java Faker expression provided for each column.
Checkout this demo web application for some example Java Faker expressions.
This project is inspired by voluble.
mvn clean package
- Download Flink from the Apache Flink website.
- Download the flink-faker JAR from the Releases page (or build it yourself).
- Put the downloaded jars under
FLINK_HOME/lib/
. - (Re)Start a Flink cluster.
- (Re)Start the Flink CLI.
- Setup Ververica Platform.
- Get the link to the flink-faker JAR from the Releases.
- Start Ververica Platorm > SQL > Connectors > Create Connector, provide the external URL from step 2 and finish the setup.
CREATE TEMPORARY TABLE heros (
`name` STRING,
`power` STRING,
`age` INT
) WITH (
'connector' = 'faker',
'fields.name.expression' = '#{superhero.name}',
'fields.power.expression' = '#{superhero.power}',
'fields.power.null-rate' = '0.05',
'fields.age.expression' = '#{number.numberBetween ''0'',''1000''}'
);
SELECT * FROM heros;
CREATE TEMPORARY TABLE location_updates (
`character_id` INT,
`location` STRING,
`proctime` AS PROCTIME()
)
WITH (
'connector' = 'faker',
'fields.character_id.expression' = '#{number.numberBetween ''0'',''100''}',
'fields.location.expression' = '#{harry_potter.location}'
);
CREATE TEMPORARY TABLE characters (
`character_id` INT,
`name` STRING
)
WITH (
'connector' = 'faker',
'fields.character_id.expression' = '#{number.numberBetween ''0'',''100''}',
'fields.name.expression' = '#{harry_potter.characters}'
);
SELECT
c.character_id,
l.location,
c.name
FROM location_updates AS l
JOIN characters FOR SYSTEM_TIME AS OF proctime AS c
ON l.character_id = c.character_id;
Currently, the faker
source supports the following data types:
CHAR
VARCHAR
STRING
TINYINT
SMALLINT
INTEGER
BIGINT
FLOAT
DOUBLE
DECIMAL
BOOLEAN
TIMESTAMP
ARRAY
MAP
MULTISET
ROW
Connector Option | Default | Description |
---|---|---|
number-of-rows |
None | The number of rows to produce. If this is options is set, the source is bounded otherwise it is unbounded and runs indefinitely. |
rows-per-second |
10000 | The maximum rate at which the source produces records. |
fields.<field>.expression |
None | The Java Faker expression to generate the values for this field. |
fields.<field>.null-rate |
0.0 | Fraction of rows for which this field is null |
fields.<field>.length |
1 | Size of array, map or multiset |
For rows of type TIMESTAMP
, the corresponding Java Faker expression needs to return a timestamp formatted as EEE MMM dd HH:mm:ss zzz yyyy
.
Typically, you would use one of the following expressions:
CREATE TEMPORARY TABLE timestamp_example (
`timestamp1` TIMESTAMP(3),
`timestamp2` TIMESTAMP(3)
)
WITH (
'connector' = 'faker',
'fields.timestamp1.expression' = '#{date.past ''15'',''SECONDS''}',
'fields.timestamp2.expression' = '#{date.past ''15'',''5'',''SECONDS''}'
);
SELECT * FROM timestamp_example;
For timestamp1
Java Faker will generate a random timestamp that lies at most 15 seconds in the past.
For timestamp2
Java Faker will generate a random timestamp, that lies at most 15 seconds in the past, but at least 5 seconds.
The usage of ARRAY
, MULTISET
, MAP
and ROW
types is shown in the following example.
CREATE TEMPORARY TABLE hp (
`character-with-age` MAP<STRING,INT>,
`spells` MULTISET<STRING>,
`locations` ARRAY<STRING>,
`house-points` ROW<`house` STRING, `points` INT>
) WITH (
'connector' = 'faker',
'fields.character-with-age.key.expression' = '#{harry_potter.character}',
'fields.character-with-age.value.expression' = '#{number.numberBetween ''10'',''100''}',
'fields.character-with-age.length' = '2',
'fields.spells.expression' = '#{harry_potter.spell}',
'fields.spells.length' = '5',
'fields.locations.expression' = '#{harry_potter.location}',
'fields.locations.length' = '3',
'fields.house-points.house.expression' = '#{harry_potter.house}',
'fields.house-points.points.expression' = '#{number.numberBetween ''10'',''100''}'
);
SELECT * FROM hp;
The Java Faker expression to pick a random value from a list of options is not straight forward to get right.
Actually, I did not manage to get Options.option
work at all.
As a workaround, I recommend using regexify
for this use case.
CREATE TEMPORARY TABLE orders (
`order_id` INT,
`order_status` STRING
)
WITH (
'connector' = 'faker',
'fields.order_id.expression' = '#{number.numberBetween ''0'',''100''}',
'fields.order_status.expression' = '#{regexify ''(RECEIVED|SHIPPED|CANCELLED){1}''}'
);
SELECT * FROM orders;
Copyright © 2020-2021 Konstantin Knauf
Distributed under Apache License, Version 2.0.