The Schematizer is a schema store service that tracks and manages all the schemas used in the Data Pipeline and provides features like automatic documentation support. We use Apache Avro to represent our schemas.
git clone git@github.com:Yelp/schematizer.git
Running unit tests
make -f Makefile-opensource test
Running unit integration tests
make -f Makefile-opensource itest
- Create a mysql database for Schematizer Service::
CREATE DATABASE <db_name> DEFAULT CHARACTER SET utf8;
- Create MySQL tables in
<db_name>
database for Schematizer Service::
cat schema/tables/*.sql | mysql <db_name>
- Create a
topology.yaml
file
topology:
- cluster: <schematizer_cluster_name>
replica: master
entries:
- charset: utf8
use_unicode: true
host: <db_ip>
db: <db_name>
user: <db_user>
passwd: <db_password>
port: <db_port>
- In
config.yaml
assign values to the following configs::
schematizer_cluster: <schematizer_cluster_name>
topology_path: /path/to/topology.yaml
Use serviceinitd/schematizer.py
to start the Schematizer service.
Registering a schema::
curl -X POST --header 'Content-Type: application/json' --header 'Accept: text/plain' -d '{
"namespace": "test_namespace",
"source_owner_email": "test@test.com",
"source": "test_source",
"contains_pii": false,
"schema": "{\"type\":\"record\",\"namespace\":\"test_namespace\",\"source\":\"test_source\",\"name\":\"test_name\",\"doc\":\"test_doc\",\"fields\":[{\"type\":\"string\",\"doc\":\"test_doc1\",\"name\":\"key1\"},{\"type\":\"string\",\"doc\":\"test_doc2\",\"name\":\"key2\"}]}"
}' 'http://127.0.0.1:8888/v1/schemas/avro'
Getting Schema By ID::
curl -X GET --header 'Accept: text/plain' 'http://127.0.0.1:8888/v1/schemas/<schema_id>'
Registering a schema::
from data_pipeline.schematizer_clientlib.schematizer import get_schematizer
test_avro_schema_json = {
"type": "record",
"namespace": "test_namespace",
"source": "test_source",
"name": "test_name",
"doc": "test_doc",
"fields": [
{"type": "string", "doc": "test_doc1", "name": "key1"},
{"type": "string", "doc": "test_doc2", "name": "key2"}
]
}
schema_info = get_schematizer().register_schema_from_schema_json(
namespace="test_namespace",
source="test_source",
schema_json=test_avro_schema_json,
source_owner_email="test@test.com",
contains_pii=False
)
Getting Schema By ID::
from data_pipeline.schematizer_clientlib.schematizer import get_schematizer
schema_info = get_schematizer().get_schema_by_id(
schema_id=schema_info.schema_id
)
We're still in the process of setting up this service as a stand-alone. There may be additional work required to run a Schematizer instance and integrate with other applications.
Schematizer is licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0
Everyone is encouraged to contribute to Schematizer by forking the Github repository and making a pull request or opening an issue.