csvtojson will generate jsonschema from a python DictReader instance. Furthermore we implement a generator to generate JSON records.
The schema of the famous Iris dataset.
>>> from csv import DictReader
>>> from pprint import pprint as pp
>>> from csvtojson import jsonschema
>>> d = DictReader(open("iris.csv"))
>>> pp(jsonschema(d))
{'properties': {'petal_length': {'maximum': 6.9,
'minimum': 1.0,
'type': 'number'},
'petal_width': {'maximum': 2.5,
'minimum': 0.1,
'type': 'number'},
'sepal_length': {'maximum': 7.9,
'minimum': 4.3,
'type': 'number'},
'sepal_width': {'maximum': 4.4,
'minimum': 2.0,
'type': 'number'},
'species': {'enum': ['setosa', 'versicolor', 'virginica'],
'type': 'string'}},
'required': ['sepal_length',
'sepal_width',
'petal_length',
'petal_width',
'species'],
'type': 'object'}
We note for this example:
- if the number of strings is large then we don't treat it as an enum
- if there are missing values in the columns we don't add it to our set of required keys
>>> from csv import DictReader
>>> from csvtojson import jsonschema
>>> from pprint import pprint as pp
>>> d = DictReader(open("titanic.csv"))
>>> pp(jsonschema(d))
{'properties': {'age': {'maximum': 80.0, 'minimum': 0.1667, 'type': 'number'},
'boat': {'maxLength': 7, 'minLength': 1, 'type': 'string'},
'body': {'maximum': 328.0, 'minimum': 4.0, 'type': 'number'},
'cabin': {'maxLength': 15, 'minLength': 1, 'type': 'string'},
'embarked': {'enum': ['C', 'S', 'Q'], 'type': 'string'},
'fare': {'maximum': 512.3292, 'minimum': 0.0, 'type': 'number'},
'home.dest': {'maxLength': 50,
'minLength': 5,
'type': 'string'},
'name': {'maxLength': 82, 'minLength': 12, 'type': 'string'},
'parch': {'maximum': 9.0, 'minimum': 0.0, 'type': 'number'},
'passenger_id': {'maximum': 1307.0,
'minimum': 1.0,
'type': 'number'},
'pclass': {'maximum': 3.0, 'minimum': 1.0, 'type': 'number'},
'sex': {'enum': ['male', 'female'], 'type': 'string'},
'sibsp': {'maximum': 8.0, 'minimum': 0.0, 'type': 'number'},
'survived': {'enum': [0.0, 1.0], 'type': 'number'},
'ticket': {'maxLength': 18, 'minLength': 3, 'type': 'string'}},
'required': ['passenger_id',
'pclass',
'name',
'sex',
'sibsp',
'parch',
'ticket',
'survived'],
'type': 'object'}
To generate json we first generate a schema and then generate JSON records based on that.
>>> from csv import DictReader
>>> from csvtojson import jsonschema, gen_json_records
>>> from pprint import pprint as pp
>>> d1 = DictReader(open("iris.csv"))
>>> d2 = DictReader(open("iris.csv"))
>>> schema = jsonschema(d1)
>>> gen = gen_json_records(d2, schema)
>>> pp(next(gen))
{'petal_length': 1.4,
'petal_width': 0.2,
'sepal_length': 5.1,
'sepal_width': 3.5,
'species': 'setosa'}
>>> pp(next(gen))
{'petal_length': 1.4,
'petal_width': 0.2,
'sepal_length': 4.9,
'sepal_width': 3.0,
'species': 'setosa'}