This is a first-pass prototype that generates a Hive schema from JSON files (one JSON object per line).
There are two other tools that generate Hive schemas from JSON:
- https://github.com/hortonworks/hive-json - Owen O'Malley of Hortonworks - c. 2013
- https://github.com/quux00/hive-json-schema - Michael Peterson - c. 2013. I added many of my features in a fork: https://github.com/amesar/hive-json.
When processing deeply nested and complex JSON such as Twitter tweets and Github Archive formats, both of these tools had hiccups. Since I had previously done work with JSON schema discovery, I wanted to give Hive schema generation a shot.
Though I addressed some the hiccups of the other tools, there are my hiccups too which need to be looked at. Inferring schema from deeply nested and dirty real-world JSON feeds at scale is not an easy task.
- Expects JSON files to have one JSON object per line
- Escapes Hive reserved keywords with back ticks
- Escapes JSON attributes that start with underscore since Hive doesn't allow such column names
- Escapes JSON attribute names that only contain numeric characters (yep, tweets have them!)
- Can specify table name
- Can specify SERDE and LOCATION
- Rudimentary handling of polympormism. When an attribute's value appears as both Int and Bigint (Long) in the data feed, its type becomes a Bigint. If we find Int and String, we generate a Union.
- Empty arrays and null values appear as comments in the DDL
- Handles .gz files
You need to make a choice as to which JSON SerDe to use:
- Hcatlog SerDe - org.apache.hcatalog.data.JsonSerDe. Comes with the Hive distribution.
- https://github.com/rcongiu/Hive-JSON-Serde
- https://github.com/cloudera/cdh-twitter-example. There is a SerDe baked into this example that is quite robust.
All had problems processing complex JSON files. It seems the Cloudera SerDe had the least issues.
Sample JSON feeds and schemas:
Tweet DDL Snippet
CREATE EXTERNAL TABLE tweets (
createdAt string,
currentUserRetweetId tinyint,
id bigint,
userMentionEntities array <
struct <
name: string,
screenName: string,
id: bigint,
start: int,
`end`: int
>
>,
mediaEntities array <
struct <
id: bigint,
url: string,
mediaURL: string,
mediaURLHttps: string,
expandedURL: string,
displayURL: string,
sizes: struct <
`0`: struct <
width: int,
height: int,
resize: int
>,
-- coordinates null,
-- hashtags array <>,
...
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/tables/tweets'
java -cp target/hive-json-schema-gen-1.0-SNAPSHOT.jar \
org.amm.hiveschema.HiveJsonSchemaDriver \
--table tweets \
--location /tables/tweets \
--serde com.cloudera.hive.serde.JSONSerDe \
--isExternalTable \
--escapeReservedKeywords \
json/tweets.json
Convenience script:
build-schema.sh tweets /tables/tweets data/tweets.json
build-schema.sh tweets_v1 /tables/tweets_v1 data/tweets_v1.json
mvn package
- --table: table name in schema. Required.
- --isExternalTable: Create external table.
- --location: HDFS path.
- --serde: Desired SerDe class path.
- --reservedKeywordsFile: Override default Hive reserved keyword list is in src/main/resources/reservedKeywords.txt.
- --escapeReservedKeywords: Escape Hive reserved keywords with a back tick. Default is not to escape.
- --output: Override output file name (table name plus ".ddl")
JSON
"user" : {
Schema
`user`: struct <
JSON
"_name" : "hello"
Schema
`_name`: string
JSON
"sizes" : {
"0" : {
"width" : 150,
"height" : 150,
"resize" : 101
},
Schema
sizes: struct <
`0`: struct <
width: int,
height: int,
resize: int
>,
JSON
{ "retweet_count" : "100+", }
{ "retweet_count" : 1, }
Schema
retweet_count uniontype<string,int>,
In Twitter API 1 the retweet_count attribute was deliberately polymorphic - its value could be either an int or string! In API 1.1 it is fortunately only an int.
From https://dev.twitter.com/overview/api/tweets:
Number of times this Tweet has been retweeted. This field is no longer capped at 99 and will not turn into a String for "100+"
- Read in reserved keyword file as a resource
- Option to lexically sort DDL attributes
- Pluggable polymorphism handlers
- Unit tests