Datafaker - Tool for faking data
English | 中文
1. Introduction
Datafaker is a large-scale test data and flow test data generation tool. It is compatible with python2.7 and python3.4+. Welcome to download and use. The github address is:
https://github.com/gangly/datafaker
Document sync updates on github
2. Background
In the software development testing process, test data is often needed. These scenarios include:
- Backend development. After creating a new table, you need to construct database test data and generate interface data for use by the front end.
- Database performance test. Generates a lot of test data to test database performance
- Stream data test. For kafka streaming data, it is necessary to continuously generate test data to write to kafka.
After research, there is currently no open source test data generation tool for generating data with similar structure in mysql table. The common method is to manually create several pieces of data into the database. The disadvantage of this method is
- Wasting work hours. Needs to construct different data for fields of different data types of the table
- Small amount of data. If you need to construct a lot of data, you can't do it manually.
- Not accurate enough. For example, you need to construct a mailbox (satisfying a certain format), a phone number (determined number of digits), an ip address (fixed format), age (cannot be negative, have a size range), and so on. These test data have certain restrictions or rules, and the manual construction may not meet the data range or some format requirements, resulting in the backend program error.
- Multi-table association. The amount of data created manually is small, and the primary key in multiple tables may not be associated with, or associated with no data.
- Dynamic random write. For example, for streaming data, you need to write kafka randomly every few seconds. Or dynamically insert mysql randomly, manual operation is relatively cumbersome, and it is not good to count the number of data written.
In response to these current pain points, datafaker came into being. Datafaker is a multi-data source test data construction tool that can simulate most common data types and easily solve the above pain points. Datafaker has the following features:
- Multiple data types. Includes common database field types (integer, float, character), custom types (IP address, mailbox, ID number, etc.)
- Simulate multi-table association data By formulating some fields as enumerated types (randomly selected from the specified data list), in the case of a large amount of data, it can ensure that multiple tables can be associated with each other and query data.
- Support batch data and stream data generation, and specify stream data interval time
- Support multiple data output methods, including screen printing, files and remote data sources
- Support for multiple data sources. Currently supports relational databases, Hive, Kafka. Will be extended to Mongo, ES and other data sources.
- Can specify the output format, currently supports text, json
3. Architecture
Datafaker is written in python and supports python2.7, python3.4+. The current version has been released on pypi.
The architecture diagram completely shows the execution process of the tool. From the figure, the tool has gone through five modules:
- Parameter parser. Parse the commands that the user enters from the terminal command line.
- Metadata parser. Users can specify metadata from local files or remote data source tables. After the parser obtains the content of the file, the text content is parsed into table field metadata and data construction rules according to the rules.
- Data construction engine. The construction engine constructs rules based on the data generated by the metadata parser, simulating the generation of different types of data.
- Data routing. According to different data output types, it is divided into batch data and stream data generation. Stream data can specify the frequency of generation. The data is then converted to a user-specified format for output to a different data source.
- Data source adapter. Adapt to different data sources and import the data into the data source.
4. Installation
Method 1, install from source code:
Download the source code, unzip and install:
python setup.py install
Method 2, use pip:
pip install datafaker
Upgrade tool
pip install datafaker --upgrade
Uninstall tool
pip uninstall datafaker
Install require package
data source | package | note |
---|---|---|
mysql/tidb | mysql-python/mysqlclient | windows+python3 use mysqlclient |
oracle | cx-Oracle | need some oracle lib |
postgresql/redshift | psycopg2 | |
sqlserver | pyodbc | mssql+pyodbc://mssql-v |
Hbase | happybase,thrift | |
es | elasticsearch | |
hive | pyhive | |
kafka | kafka-python |
5. examples
6. command parameters
7. construction rule
8. note
9. Release note
Give a star or donate a coffee to the author
- 给作者点个star或请作者喝杯咖啡