rnag/dataclass-wizard

[wiz-cli] duplicate dataclass schemas should be replaced with just one

rnag opened this issue · 0 comments

rnag commented
  • Dataclass Wizard version: 0.22.1
  • Python version: 3.10
  • Operating System: Mac OS

Description

In certain cases - and especially in certain API responses, most notably for AWS Rekognition - the input JSON object can contain multiple definitions for the same field - for ex. "element", all of which contain an identical schema.

I'd like to eliminate those duplicate dataclass definitions in the output, so that the generated schema is a bit less verbose and we only have the data we care about.

For example, note the below sample input and output.

What I Did

I ran the following command from my mac terminal:

echo '{
    "element": {
        "my_str": "string",
        "my_int": 3
    },
    "Elements": [
        {
            "my_str": "hello",
            "my_int": 5
        },
        {
            "myStr": "world",
            "MyInt": 7
        }
    ],
    "other_field": {
        "element": {
            "my_str": "other string",
            "my_int": 42
        }
    }
}' | wiz gs

The generated output is a bit noisy in this scenario, as it contains duplicate definitions of the dataclass Element:

from dataclasses import dataclass
from typing import List

from dataclass_wizard import JSONWizard


@dataclass
class Data(JSONWizard):
    """
    Data dataclass

    """
    element: 'Element'
    elements: List['Element']
    other_field: 'OtherField'


@dataclass
class Element:
    """
    Element dataclass

    """
    my_str: str
    my_int: int


@dataclass
class Element:
    """
    Element dataclass

    """
    my_str: str
    my_int: int


@dataclass
class OtherField:
    """
    OtherField dataclass

    """
    element: 'Element'


@dataclass
class Element:
    """
    Element dataclass

    """
    my_str: str
    my_int: int

I'd like to eliminate all the duplicate definitions - preferably trim any duplicates after the first dataclass schema for Element.

Resolution

There are multiple ways to achieve this, but I think the easiest might be to store the generated string or __repr__ for the schema in a dict with the class name as the key, and then lookup and compare if those string defintions are the same. If so, we just continue and return an empty __repr__ after the first time. If not, we generate all the field names and types for the dataclass as normal.