A metadata toolkit written in Python
About
Recap reads and converts schemas in dozens of formats including Parquet, Protocol Buffers, Avro, and JSON schema, BigQuery, Snowflake, and PostgreSQL.
Features
- Read schemas from filesystems, object stores, and databases.
- Convert schemas between Parquet, Protocol Buffers, Avro, and JSON schema.
- Generate
CREATE TABLE
DDL from schemas for popular database SQL dialects. - Infer schemas from unstructured data like CSV, TSV, and JSON.
Compatibility
- Any SQLAlchemy-compatible database
- Any fsspec-compatible filesystem
- Parquet, Protocol Buffers, Avro, and JSON schema
- CSV, TSV, and JSON files
Installation
pip install recap-core
Examples
Read schemas from objects:
s = from_proto(message)
Or files:
s = schema("s3://corp-logs/2022-03-01/0.json")
Or databases:
s = schema("snowflake://ycbjbzl-ib10693/TEST_DB/PUBLIC/311_service_requests")
And convert them to other formats:
to_json_schema(s)
{
"type": "object",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string"
}
},
"required": [
"id"
]
}
Or even CREATE TABLE
statements:
s = schema("/tmp/data/file.json")
to_ddl(s, "my_table", dialect="snowflake")
CREATE TABLE "my_table" (
"col1" BIGINT,
"col2" STRUCT<"col3" VARCHAR>
)
Getting Started
See the Quickstart page to get started.