instill-ai/instill-core

[JSON] Support Rename Fields for JSON operator

Closed this issue · 13 comments

Issue Description

Current State

  • It is very difficult to manipulate JSON data with JSON operator.

Proposed Change

  • Please fetch this JSON Schema to implement the functions.
  • Manipulating JSON data

JSON schema pseudo code

JsonOperator:
  Task: Rename fields
  
  Input:
    data: 
      type: object
      description: Original data, which can be a JSON object or array of objects.
    fields: 
      type: array
      description: An array of objects specifying the fields to be renamed.
      items:
        type: object
        properties:
          currentField: 
            type: string
            description: The field name in the original data to be replaced, supports nested paths if "supportDotNotation" is true.
          newField: 
            type: string
            description: The new field name that will replace the currentField, supports nested paths if "supportDotNotation" is true.
#    supportDotNotation:
#      type: boolean
#      default: true
#      description: Determines whether to interpret field names as paths using dot notation. If false, fields are treated as literal keys.
    conflictResolution:
      type: string
      enum: [overwrite, skip, error]
      default: overwrite
      description: Defines how conflicts are handled when the newField already exists in the data.
  
  Output:
    data:
      type: object
      description: The modified data with the specified fields renamed.

Key Features:
conflictResolution: Handling conflicts when renaming fields in JSON, especially when working with nested objects and dot notation, is critical to avoid data loss or unexpected behavior. Allow users to specify how they want conflicts to be resolved (e.g., via a parameter such as conflictResolution: 'overwrite'|'skip'|'error'),

  • Provides flexibility and control to the user.
  • Adapts to different use cases.

Here are different strategies to manage conflicts and some considerations for each.

1. Overwrite the Existing Field (Default Behavior)

Description: If the newField already exists in the object, overwrite its value with the value from currentField.
Pros:

  • Simple and straightforward.
  • Useful when the intention is to replace the existing value.
    Cons:
  • Can lead to data loss if not used carefully.

Implementation:

if new_key in obj:
    obj[new_key] = obj.pop(current_key)
else:
    obj[new_key] = obj.pop(current_key)

2. Skip the Renaming Operation

Description: If the newField already exists, skip the renaming operation for that particular field.
Pros:

  • Prevents accidental overwriting of data.
  • Safeguards against potential conflicts without altering the existing data.
    Cons:
  • The currentField remains unchanged, which might not be the desired outcome.

Implementation:

if new_key in obj:
    # Skip renaming if new_key already exists
    continue
else:
    obj[new_key] = obj.pop(current_key)

3. Merge Values

Description: If both currentField and newField exist and contain objects or arrays, merge the two values. This approach is more complex but can be very powerful.
Pros:

  • Preserves both sets of data.
  • Useful for combining information rather than choosing one over the other.
    Cons:
  • Can be complex to implement, especially if the data types of currentField and newField differ.
  • May require custom logic depending on how you want to merge the data (e.g., combining arrays, merging objects, etc.).

Implementation:

if new_key in obj:
    if isinstance(obj[new_key], dict) and isinstance(obj[current_key], dict):
        # Merge dictionaries
        obj[new_key].update(obj.pop(current_key))
    elif isinstance(obj[new_key], list) and isinstance(obj[current_key], list):
        # Merge lists
        obj[new_key].extend(obj.pop(current_key))
    else:
        # Handle other types (overwrite, append, etc.)
        obj[new_key] = obj.pop(current_key)
else:
    obj[new_key] = obj.pop(current_key)

4. Rename with a Suffix or Prefix

Description: If the newField already exists, rename the new field by appending a suffix or prefix (e.g., _1, _conflict) to avoid conflicts.
Pros:

  • Both original and new data are preserved.
  • Easy to track conflicts.
    Cons:
  • The resulting data structure may become less predictable or harder to work with if many conflicts occur.

Implementation:

suffix = 1
original_new_key = new_key
while new_key in obj:
    new_key = f"{original_new_key}_{suffix}"
    suffix += 1
obj[new_key] = obj.pop(current_key)

5. Return an Error or Warning

Description: If a conflict is detected, stop the operation and return an error or warning to the user. This forces the user to address the conflict before proceeding.
Pros:

  • Prevents accidental data overwriting.
  • Makes the user aware of potential issues immediately.
    Cons:
  • Halts the process, which might be undesirable in automated workflows.

Implementation:

if new_key in obj:
    raise ValueError(f"Conflict detected: '{new_key}' already exists.")
else:
    obj[new_key] = obj.pop(current_key)

Summary:

  • Overwrite: Simple and effective, but can lead to data loss.
  • Skip: Safe but may leave data unchanged.
  • Error/Warning: Forces user intervention; best for critical operations.
    Choose the strategy that best aligns with your application's needs and the user's expectations. Implementing a combination of these strategies, such as providing a default behavior with options for customization, can offer the best balance between usability and robustness.

Example Usage:

Scenario: Input data as JSON object

// input
{
"data": {
"name": "John Doe",
"age": 30,
"address": {
"street": "123 Main St",
"city": "Anytown",
"state": "CA"
},
"state": "conflict"
},
"fields": [
{"currentField": "address.street", "newField": "address.road"},
{"currentField": "state", "newField": "address.state"}
],
// "supportDotNotation": true,
"conflictResolution": "overwrite"
}

Conflict Resolution Scenarios:
1. Overwrite (Default):

  • The state field in data would be moved to address.state, overwriting the existing address.state field.
  • Final output:
{
  "data": {
    "name": "John Doe",
    "age": 30,
    "address": {
      "road": "123 Main St",
      "city": "Anytown",
      "state": "conflict"
    }
  }
}

2. Skip:

  • The renaming of state to address.state would be skipped, so both state and address.state remain unchanged.
  • Final output:
{
  "data": {
    "name": "John Doe",
    "age": 30,
    "address": {
      "road": "123 Main St",
      "city": "Anytown",
      "state": "CA"
    },
    "state": "conflict"
  }
}

3. Error:

  • The process would raise an error, stopping execution, because address.state already exists.
    ValueError: Conflict detected: 'address.state' already exists.

Scenario: Input Data as an Array of Objects

If the input data is an array of objects, the logic needs to be adapted to handle each object in the array individually. The schema and the function would process each object within the array according to the specified fields and conflictResolution rules.

Below is an example demonstrating how the "Rename Fields" operation would work with input data that is an array of objects.

Input

{
  "data": [
    {
      "name": "John Doe",
      "age": 30,
      "address": {
        "street": "123 Main St",
        "city": "Anytown",
        "state": "CA"
      },
      "contacts": [
        {
          "type": "email",
          "value": "john.doe@example.com"
        }
      ]
    },
    {
      "name": "Jane Smith",
      "age": 28,
      "address": {
        "street": "456 Oak St",
        "city": "Othertown",
        "state": "NY"
      }
      // Note: Jane Smith does not have a "contacts" field
    }
  ],
  "fields": [
    {"currentField": "name", "newField": "fullName"},
    {"currentField": "address.street", "newField": "address.road"},
    {"currentField": "contacts.0.value", "newField": "contacts.0.contactInfo"},
    {"currentField": "age", "newField": "yearsOld"}
  ],
//  "supportDotNotation": true,
  "conflictResolution": "skip"
}

Explanation:

  • Field "name": The "name" field will be renamed to "fullName" for each object in the array.
  • Field "address.street": The "street" field inside the "address" object will be renamed to "road" for each object.
  • Field "contacts.0.value": The "value" field inside the first element of the "contacts" array will be renamed to "contactInfo" for the first object, but this step will be skipped for the second object because the "contacts" field does not exist.
  • Field "age": The "age" field will be renamed to "yearsOld" for each object.

Output:

{
  "data": [
    {
      "fullName": "John Doe",
      "yearsOld": 30,
      "address": {
        "road": "123 Main St",
        "city": "Anytown",
        "state": "CA"
      },
      "contacts": [
        {
          "type": "email",
          "contactInfo": "john.doe@example.com"
        }
      ]
    },
    {
      "fullName": "Jane Smith",
      "yearsOld": 28,
      "address": {
        "road": "456 Oak St",
        "city": "Othertown",
        "state": "NY"
      }
      // The "contacts" field is not present, so no renaming occurs for "contacts.0.value"
    }
  ]
}

Rules for the Component Hackathon

  • Each issue will only be assigned to one person/team at a time.
  • You can only work on one issue at a time.
  • To express interest in an issue, please comment on it and tag @kuroxx, allowing the Instill AI team to assign it to you.
  • Ensure you address all feedback and suggestions provided by the Instill AI team.
  • If no commits are made within five days, the issue may be reassigned to another contributor.
  • Join our Discord to engage in discussions and seek assistance in #hackathon channel. For technical queries, you can tag @chuang8511.

Component Contribution Guideline | Documentation | Official Go Tutorial

I am interested. Can I work on this issue?

Hello @Danbaba1, sure I have assigned this ticket to you! 🙌

Hey @Danbaba1 I have removed you as an assignee because there is no activity for the past 2 weeks 🙏 Please raise again if you are still working on it, thanks

I would like to give a try for this issue. Can you please assign me?

@AkashJana18 Sounds good, I have assigned it to you!

Hey @chuang8511 @ShihChun-H Could you please guide me on where to make the changes for implementing JSON manipulation with the JsonOperator schema? I haven’t worked with this tech stack before, so any pointers on relevant files, modules, or general structure would be very helpful. Thanks in advance!

I would like to work on this issue. Can you please assign it to me.

Hey @gagan-bhullar-tech I am already working on it would you like to collaborate?

@AkashJana18
Sorry, I put the wrong json schema.

Could you take a look on this?

We have built the task definition. So, what you only have to do is working on Golang implementation.

@chuang8511 so the Golang Implementation needs to be done in pipeline-backend repo?

@AkashJana18
Yes, please check the guideline.

Hey @AkashJana18 , how's it going?

I wanted to let you know that we will need a PR by the end of this week (8th Nov) since we are closing this event.

Please submit:

to ensure your contribution is counted!

Alternatively, if you cannot complete this within the time frame but would still like to contribute, you are more than welcome to but please note it would not be within the scope of Hacktoberfest 2024.

Thank you and look forward to your contribution! ✨