SFaker

SFaker is one data generator. It implemented with Spark DataSourceV2. SFaker can generate rows according to specified schemas.

Features

Feature
Support Batch	✅
Support Stream	TBD
Support DataFrameReader API	✅
Support Spark SQL Create Statement	✅
Support Unsafe Row	✅
Support Codegen	✅
Support Limit Push Down	✅
Support Columns Pruning	✅

Types

Support spark sql types, more details about types click here.

Spark Type
Byte	✅
Short	✅
Integer	✅
Long	✅
Float	✅
Double	✅
Decimal	TBD
String	✅
Varchar	TBD
Char	TBD
Binary	TBD
Boolean	✅
Date	TBD
Timestamp	TBD
TimestampNTZ	TBD
YearMonthInterval	TBD
DayTimeInterval	TBD
Array	✅
Map	✅
Struct	✅

Config

Conf	Type	Default	Description
`spark.sql.fake.source.unsafe.row.enable`	Boolean	false	If `true`, all row generated will been stored in `UnsafeRow`.
`spark.sql.fake.source.unsafe.codegen.enable`	Boolean	false	If `true`, the row-generated process, which produce rows according to schema, will been executed in JIT mode.
`spark.sql.fake.source.partitions`	Integer	1	Number of source partitions.
`spark.sql.fake.source.rowsTotalSize`	Integer	8	Number of rows generated according to schema.

Use Cases

DataFrameReader API

val schema = new StructType()
  .add("id", DataTypes.IntegerType)
  .add("sex", DataTypes.BooleanType)
  .add("roles", DataTypes.createArrayType(DataTypes.StringType));

val df = spark.read
  .format("FakeSource")
  .schema(schema)
  .option(FakeSourceProps.CONF_ROWS_TOTAL_SIZE, 100)
  .option(FakeSourceProps.CONF_PARTITIONS, 1)
  .option(FakeSourceProps.CONF_UNSAFE_ROW_ENABLE, true)
  .option(FakeSourceProps.CONF_UNSAFE_CODEGEN_ENABLE, true)
  .load();

Spark SQL Create Statement

val spark = SparkSession
      .builder()
      .master("local[*]")
      .appName("Case0")
      .config(
        "spark.sql.catalog.spark_catalog",
        classOf[FakeSourceCatalog].getName
      )
      .getOrCreate();
val df = spark.sql("""
          |create table fake (
          | id int,
          | sex boolean
          |)
          |using FakeSource
          |tblproperties (
          |spark.sql.fake.source.rowsTotalSize = 10000000,
          |spark.sql.fake.source.partitions = 1,
          |spark.sql.fake.source.unsafe.row.enable = true,
          |spark.sql.fake.source.unsafe.codegen.enable = true
          |)
          |""".stripMargin)
spark.sql("select id from fake limit 10").explain(true);

Star History

License

Apache 2.0 License.

IcarusDB/SFaker

SFaker

Features

Types

Config

Use Cases

DataFrameReader API

Spark SQL Create Statement

Star History

License