Schema Analyzer

An Open Source joint by Dan Levy ✨

Analyze column type & size summary from any input JSON array

Schema Analyzer is the core library behind Dan's Schema Generator.

Features

The primary goal is to support any input JSON/CSV and infer as much as possible. More data will generally yield better results.

Getting Started

npm install schema-analyzer

import { schemaBuilder } from 'schema-builder'

schemaBuilder(schemaName: String, data: Array<Object>): TypeSummary

Preview Analysis Results

What does this library's analysis look like?

It consists of 3 key top-level properties:

totalRows - # of rows analyzed.
fields: FieldTypeSummary - a map of field names with all detected types (includes meta-data for each type detected, with possible overlaps. e.g. an Email is also a String, "42" is a String and Number)

Review the raw results below

Details about each field can be found below.

{
  "totalRows": 5,
  "fields": {
    "id": {
      "types": {
        "Number": {
          "rank": 8,
          "count": 5,
          "value": { "min": 1, "mean": 3, "max": 5, "p25": 2, "p33": 2, "p50": 3, "p66": 4, "p75": 4, "p99": 5 }
        },
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 1, "mean": 1, "max": 1, "p25": 1, "p33": 1, "p50": 1, "p66": 1, "p75": 1, "p99": 1 }
        }
      }
    },
    "name": {
      "types": {
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 3, "mean": 7.2, "max": 15, "p25": 3, "p33": 3, "p50": 5, "p66": 10, "p75": 10, "p99": 15 }
        }
      }
    },
    "role": {
      "types": {
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 4, "mean": 5.4, "max": 9, "p25": 4, "p33": 4, "p50": 5, "p66": 5, "p75": 5, "p99": 9 }
        }
      }
    },
    "email": {
      "types": {
        "Email": {
          "rank": 11,
          "count": 5,
          "length": { "min": 15, "mean": 19.4, "max": 26, "p25": 15, "p33": 15, "p50": 18, "p66": 23, "p75": 23, "p99": 26 }
        }
      }
    },
    "createdAt": {
      "types": {
        "Date": {
          "rank": 4,
          "count": 4,
          "value": { "min": "2001-01-01T00:00:00.000Z", "mean": "2015-04-14T18:00:00.000Z", "max": "2020-02-02T00:00:00.000Z", "p25": "2020-02-02T00:00:00.000Z", "p33": "2020-02-02T00:00:00.000Z", "p50": "2019-12-31T00:00:00.000Z", "p66": "2019-12-31T00:00:00.000Z", "p75": "2001-01-01T00:00:00.000Z", "p99": "2001-01-01T00:00:00.000Z" }
        },
        "String": {
          "rank": 12,
          "count": 1,
          "length": { "min": 6, "mean": 6, "max": 6, "p25": 6, "p33": 6, "p50": 6, "p66": 6, "p75": 6, "p99": 6 }
        }
      }
    },
    "accountConfirmed": {
      "types": {
        "Unknown": {
          "rank": -1,
          "count": 1
        },
        "String": {
          "rank": 12,
          "count": 1,
          "length": { "min": 9, "mean": 9, "max": 9, "p25": 9, "p33": 9, "p50": 9, "p66": 9, "p75": 9, "p99": 9 }
        },
        "Boolean": {
          "rank": 3,
          "count": 4
        }
      }
    }
  }
}

Sample input dataset for the example results above

id	name	role	email	createdAt	accountConfirmed
1	Eve	poweruser	`eve@example.com`	01/20/2020	undefined
2	Alice	user	`ali@example.com`	02/02/2020	true
3	Bob	user	`robert@example.com`	12/31/2019	true
4	Elliot Alderson	admin	`falkensmaze@protonmail.com`	01/01/2001	false
5	Sam Sepiol	admin	`falkensmaze@hotmail.com`	9/9/99	true

`AggregateSummary`

Numeric and String types include a summary of the observed field sizes:

Number & String Range Object Details

Properties

min the minimum number or string length
max the maximum number or string length
mean the average number or string length
percentiles[25th, 33th, 50th, 66th, 75th, 99th] values from the Nth percentile number or string length

Percentile is based on input data, as-is with out sorting.

Length Range Data

Range data for the length of a String field type:

{
  "rank": 11,
  "count": 5,
  "length": { "min": 15, "mean": 19.4, "max": 26, "p25": 15, "p33": 15, "p50": 18, "p66": 23, "p75": 23, "p99": 26 }
}

This is useful for defining strict length limits or minimums, for example as SQL servers often require..

Range data for a Date fields value:

{
  "rank": 4,
  "count": 4,
  "value": { "min": "2001-01-01T00:00:00.000Z", "mean": "2015-04-14T18:00:00.000Z", "max": "2020-02-02T00:00:00.000Z", "p25": "2020-02-02T00:00:00.000Z", "p33": "2020-02-02T00:00:00.000Z", "p50": "2019-12-31T00:00:00.000Z", "p66": "2019-12-31T00:00:00.000Z", "p75": "2001-01-01T00:00:00.000Z", "p99": "2001-01-01T00:00:00.000Z" }
}

Notes

We recommend you provide at least 100+ rows. Accuracy increases greatly with 1,000 rows.

The following features require a certain minimum # of records:

Enumeration detection.
- 100+ Rows Required.
- Number of unique values must not exceed 20 or 5% of the total number of records. (100 records will identify as Enum w/ 5 values. Up to 20 are possible given 400 or 1,000+.)
Not Null detection.
- where rowCount === field count

Full List of Detected Types

Unknown
ObjectId
UUID
Boolean
Date
Timestamp
Currency
Float
Number
Email
String
Array
Object
Null

justsml/schema-analyzer