Project description:
Imagine that you have a data pipeline and you need some test data to check the correctness of data transformations and validations on this data pipeline. You need to generate different input data.In our Capstone project, we will solve it by creating a universal console utility that will generate our data for us. Format - only JSON.
Create all descriptions, names for commands, etc by yourself.
You need to implement:
- Name, description, help. Console Utility (CU) must have a name. You need to set a name on it and the name must be shown in "help". Each command must have a help. Use argparse to get all those features from the box.
- All params could be set up from cmd (command line). CU must take all params needed to generate data from the command line, but most of them must have default values.
- Default values. All params for CU must have default values. All those values must be shown in CU help. All default values must be stored in "default.ini" file and read from it with configparser. Names of options in default.ini must be identical to those options in console utility help.
- Logging. All steps must have output in the console. If you've got some error, you must print it to the console with logging.error and only after that exit. If you start data generating - you need to write about it in the console, same if you finish. All logs must be printed in console. If you want to, you can also duplicate all logs in a log file, but it is optional for this project.
- Random data generation dependent on field type and data schema. Described in detail in the "Data Schema Parse" point.
Data Schema Parse:
Your script must parse data schema and generate data based on it. All keys in data schema == keys in the final data event line.
Example of data schema:
{"date":"timestamp:", "name": "str:rand", "type":"str:['client', 'partner', 'government']", "age": "int:rand(1, 90)"}
All values support special notation "type:what_to_generate":
":" in value indicate that left part of the value is a type.
Type could be: timestamp, str, and int.
If in the schema there is another type on the left part of ":" statement in a value, write an error. For example, "str:rand" means that value of this key must be a str type and it's generated randomly.
For right part of values with ":" notation, possible 4 types:
-
rand - random generation,
if on the left there is "str" type, for generation use uuid4 (need to use only one function to generate uuid prefix - https://docs.python.org/3.6/library/uuid.html#uuid.uuid4), str(uuid.uuid4()) If on the left there is "int" type, use random.randint(0, 10000) -
list with values [].
For example, "str:['client', 'partner', 'government']" or "int:[0, 9, 10, 4]". A generator must take a random value from a list for each generated data line. Use random.choice. -
rand(from, to) - random generation for int values in the prescribed range.
Possible to use only with "int" type. If on the left part "str" or "timestamp", write an error. -
Stand alone value.
If in schema after ":" a value is written, which has a type corresponding with the left part, and the word "rand" is not reserved, use it on each line of generated data. For example, "name": "str:cat". So your script generates data where in each line attr "name":"cat" will be. But if in schema there is "age":"int:head", it is an error and you must write about it in console because "head" could not be converted to int type. -
Empty value.
It's normal for any type. If type "int" is with empty value, use None in value, if type "str", use empty string - "".
For timestamp type ignore all values after ":". If a value exists in the scheme, write logging.warning with a note that timestamp does not support any values and it will be ignored. Continue correct script work.
Value for timestamp is always current unix timestamp - enough to use time.time().
{"date":"timestamp:", "name": "str:rand", "type":"str:['client', 'partner', 'government']", "age": "int:rand(1, 90)"}
List of input params for CU:
Name and Description you can freely change
Name | Description | Behavior |
---|---|---|
path_to_save_files | Where all files need to save | User can define path in 2 ways: relatively from cwd (current working directory) and absolute. You CU must correct work with both ways and check exist such path or not. If path exist and it is not a directory - exit with error log |
files_count | How much json files to generate | if files_count < 0 - error if files_count == 0 - print all output to console. |
file_name | Base file_name. If no prefix, final file name will be file_name.json. With prefix full file name will be file_name_file_prefix.json | |
file_prefix | What prefix for file name to use if more than 1 file needs to be generated | prefix is a set of possible choices (https://docs.python.org/3.6/library/argparse.html#choices) -count -random -uuid (need to use only one function to generate uuid prefix - https://docs.python.org/3.6/library/uuid.html#uuid.uuid4, str(uuid.uuid4())) |
data_schema | It's a string with json schema. It could be loaded in two ways:
|
Read schema and check if it's correct or not (see "Data Schema Parse" with details of schema). |
data_lines | Count of lines for each file. Default, for example: 1000. | |
clear_path | If this flag is on, before the script starts creating new data files, all files in path_to_save_files that match file_name will be deleted. | If this flag is on, store true (https://docs.python.org/3.6/library/argparse.html#action , see action='store_true') |
Example of CU launch with provided data_schema from cmd:
$ testgen.py . --files_count 3 --file_name super_data --file_prefix count --data_schema {"date":"timestamp:", "name": "str:rand", "type":"['client', 'partner', 'government']", "age": "int:rand(1, 90)"}
Data schema from file:
$ testgen.py . --files_count 3 --file_name super_data --file_prefix count --data_schema ./path/to/schema.json
Do not use Traceback errors and raise:
If it's a console utility, it must have the correspondent behavior. You do not need to send Interpreter traceback or raise errors. All errors that caused exit from script must be processed with "sys.exit(1)" or "exit(1)", not with raise.
Acceptance Criteria:
- All features are implemented and work;
- All possible errors are processed and they have a logging and correct exit (impossible to create a file, incorrect values, etc.);
- Code must be divided into functions/classes/ logic blocks. It must not be a a monolithic sheet of code;
- Unit tests exist (at least 3 tests);
- All steps from the Project Checklist module are done.