Datagen is a command line utility and python library for generating data from arbitrary json schemas. Use it to populate a test database, benchmark parsers, mock-up a frontend, etc.
Use it as follows:
$ datagen schema.json > output.json
Clone the repo and run $ python setup.py install
from the top-level directory.
Feel free to open an issue if you have any ideas to share!
A schema is just a json string that uses special tags to specify what should be filled in. You can think of it like a GraphQL query.
For example, let's say you want to fill in a table of users. For simplicity's sake, say each user has a first name, a last name, an email, and an age. We could write a schema like this:
{
"first" : "firstName",
"last" : "lastName",
"email" : "eMail",
"age" : "personAge"
}
Passing this string to datagen will result in something like this:
{
"first": "Stanley",
"last": "Brocious",
"email": "Justin640@verizon.net",
"age": 32
}
Alternatively, you could pass
["firstName", "lastName", "eMail", "personAge"]
and get back
["Stanley", "Brocious", "Justin640@verizon.net", 32]
In the last example, "firstName"
, "lastName"
, "eMail"
, and "personAge"
are all example of what we call generators. In their simplest form generators are passed as just a name. However, some generators can take arguments.
For example, there is a numberInt
generator that when passed without arguments produces a random signed 32 bit integer. However, let's say we want a three digit positive integer. All we need to get one of these is numberInt|100|999
.
As you can see, we use the |
char to seperate arguments in generators.
Generators are very easy to write, so if you need something custom, don't be afraid to write it!
Of course, producing one record isn't much fun. Datagen provides two ways to produce multiple records.
The first and simplest way is to pass the -n
flag followed by a positive integer. For example, saving the previous schema to schema.json
and doing $ datagen schema.json -n 5
produces
[
["Barbara", "Bogue", "Erik728@comcast.net", 51],
["Angie", "Bogue", "Faith352@hotmail.com", 26],
["Leon", "Kriz", "Jenny971@gmail.com", 49],
["Simon", "Gressett", "Lula258@hotmail.com", 33],
["Annette", "Kellough", "Cody423@hotmail.com", 76]
]
This is handy, but not very flexible.
The second method is much more powerful. Let's say we want our schema to represent a grade school class. We could write this:
{
"teacher_name": "fullName",
"room_number": "numberInt|100|200",
"students": {
"_n": 10,
"obj": ["firstName", "lastName"]
}
}
What we did here was wrap our students
object with another object that specified an _n
and an obj
. Whenever Datagen sees an object of this form, it will expand it into a list of length _n
of obj
's. Here's what we get back:
{
"teacher_name": "Ms. Angelica Elsa Then",
"room_number": 162,
"students": [
["Dorothy", "Kardos"],
["Carlton", "Brodt"],
["Harriet", "Kriz"],
["Elsa", "Jerkins"],
["Della", "Bombardier"],
["Clayton", "Bissette"],
["Johnnie", "Witherite"],
["Morris", "Strayhorn"],
["Susie", "Pullin"],
["Laurence", "Geise"]
]
}
Internally, Datagen wraps the top-level object like this when the -n
flag is passed.
This flag with an integer allows you to generate n records derived from the schema
$ datagen schema.json -n 10
Passing this flag pretty prints the output.
Passing this flag tells Datagen to interperet the first argument as the schema itself. For example, one could do
$ datagen -s "[\"firstName\", \"lastName\"]"
Datagen is more flexible when used as a library. Here's a simple example:
from datagen import Datagen
dg = Datagen()
schema = {
"name" : {"first": "firstName", "last": "lastName"},
}
output = dg(schema, native=True)
By passing the native
keyword argument, output
is returned as a python dictionary (or list if the schema was given as a list, and so on.) This allows you, for instance, to pass the resulting data directly to a database client.
Writing generators is very simple. First off, we should note that when we say "generator", we are not speaking of python generators, which are a built-in feature of python. Our generators are just functions wrapped in classes. Below is an example with one argument:
# helloWorld.py
class helloWorld:
def __call__(self, you):
return 'hello world from' + you
Once it is registered, you can call the generator from your schema just like you'd expect:
{"test": "helloWorld|me"}
Arguments are passed to the generator as strings; it is the generator's responsibility to do type-checking.
To use this generator, we put it in a module named helloWorld.py
. Then we put it somewhere, say in /Users/James/project/generators/
.
To use the generator, we pass the path where it is located to Datagen like so:
from datagen import Datagen
paths = ['/Users/James/project/generators']
dg = Datagen(paths)
You can pass arbitrarily many paths to Datagen. Datagen will automatically register any generators it finds within them. Datagen does not search sub-directories.
Note: Make sure you don't place code that imports the Datagen class in the same directory as your generators. Doing so will result in a circular import.
To summarize, there are x requirements for a generator:
- It must be a class with a
__call__
method - It must be within a module sharing the name of the class
- It's enclosing path must be passed to Datagen
As you may have noted at this point, custom generators can only be used when using Datagen as a library. This could change pretty easily,