Adding Quenya DSL dependency to your sbt build:
libraryDependencies += "com.github.music-of-the-ainur" %% "quenya-dsl" % "1.1.4-$SPARK_VERSION"
To run in spark-shell:
spark-shell --packages "com.github.music-of-the-ainur:quenya-dsl_2.11:1.1.4-$SPARK_VERSION"
Quenya DSL(Domain Specific Language) is a language that simplifies the task to parser complex semi-structured data.
val inputDf: DataFrame = ...
val quenyaDsl = QuenyaDSL
val dsl = quenyaDsl.compile("""
|uuid$id:StringType
|id$id:LongType
|code$area_code:LongType
|names@name
| name.firstName$first_name:StringType
| name.secondName$second_name:StringType
| name.lastName$last_name:StringType
|source_id$source_id:LongType
|address[3]$zipcode:StringType""".stripMargin)
val df:DataFrame = quenyaDsl.execute(dsl,inputDf)
df.show(false)
Operator $ i.e dollar is used to select.
Example:
DSL
name.nameOne$firstName:StringType
name.nickNames[0]$firstNickName:StringType
JSON
{
"name":{
"nameOne":"Mithrandir",
"LastName":"Olórin",
"nickNames":[
"Gandalf the Grey",
"Gandalf the White"
]
},
"race":"Maiar",
"age":"immortal",
"weapons":[
"Glamdring",
"Narya",
"Wizard Staff"
]
}
Output:
+----------+----------------+
|firstName |firstNickName |
+----------+----------------+
|Mithrandir|Gandalf the Grey|
+----------+----------------+
Operator @ i.e at is used to explode arrays, "space" or "tab" is used to define the precedence.
Example:
DSL
weapons@weapon
weapon$weapon:StringType
JSON
{
"name":{
"nameOne":"Mithrandir",
"LastName":"Olórin",
"nickNames":[
"Gandalf the Grey",
"Gandalf the White"
]
},
"race":"Maiar",
"age":"immortal",
"weapons":[
"Glamdring",
"Narya",
"Wizard Staff"
]
}
Output:
+------------+
|weapon |
+------------+
|Glamdring |
|Narya |
|Wizard Staff|
+------------+
- FloatType
- BinaryType
- ByteType
- BooleanType
- StringType
- TimestampType
- DecimalType
- DoubleType
- IntegerType
- LongType
- ShortType
You can generate a DSL based on a DataFrame:
import com.github.music.of.the.ainur.quenya.QuenyaDSL
val df:DataFrame = ...
val quenyaDsl = QuenyaDSL
quenyaDsl.printDsl(df)
json:
{
"name":{
"nameOne":"Mithrandir",
"LastName":"Olórin",
"nickNames":[
"Gandalf the Grey",
"Gandalf the White"
]
},
"race":"Maiar",
"age":"immortal",
"weapon":[
"Glamdring",
"Narya",
"Wizard Staff"
]
}
output:
age$age:StringType
name.LastName$name_LastName:StringType
name.nameOne$name_nameOne:StringType
name.nickNames@name_nickNames
name_nickNames$name_nickNames:StringType
race$race:StringType
weapon@weapon
weapon$weapon:StringType
You can alias using the fully qualified name using printDsl(df,true)
, you should turn on in case of name conflict.
<dsl> ::= \{"[\r\n]*".r <precedence> <col> <operator> \}
<precedence> ::= "[\s\t]*".r
<col> ::= "a-zA-Z0-9_.".r [ element ]
<element> ::= "[" "\d".r "]"
<operator> ::= <@> | <$>
<@> ::= @ <alias>
<$> ::= $ <alias> : <datatype>
<alias> ::= "0-9a-zA-Z_".r
<datatype> ::= BinaryType | BooleanType | StringType | TimestampType | DecimalType
| DoubleType | FloatType | ByteType | IntegerType | LongType | ShortType
Software | Version |
---|---|
Java | 8 |
Scala | 2.11/2.12 |
Apache Spark | 2.4 |
Daniel Mantovani daniel.mantovani@modak.com