kubeflow/katib

Unable to Intact the data type while passing from one component to another

mayankagarwal19911 opened this issue · 1 comments

/kind bug

What steps did you take and what happened:
While running below notebook, can not pass the required data type of to output to input of another component.

@dsl.component(
    base_image="python image with required libraries"
)
    def load_and_preprocess_data(file_path: str) -> NamedTuple('Outputs',[
        ('X_data', Dataset),
        ('Y_target', Dataset),
     ]):

    import pandas as pd
    import seaborn as sns
    import numpy as np
    
    sns.set(style="whitegrid")
    np.random.seed(203)

    data = pd.read_csv(file_path)
    data["Time"] = data["Time"].apply(lambda x: x / 3600 % 24)

    X_data = data.drop(['Class'], axis=1).values // type numpy.ndarray
    Y_target = data["Class"].values // type numpy.ndarray
   
    from collections import namedtuple
    output = namedtuple('Outputs', ['X_data', 'Y_target'])
    
    return output(X_data, Y_target)

@dsl.component(
    base_image="python image with required libraries"
)
def create_outlier(
    X_data: Input[Dataset],
    Y_target: Input[Dataset], 
    n_samples: int=400000, perc_outlier: int=0):
    import numpy as np
    np.random.seed(0)

    with open(X_data.path) as f:
        x_data = np.array(f.read()) // type numpy.ndarray but lost it's shape
        
    with open(Y_target.path) as f:
        y_target = np.array(f.read()) // type numpy.ndarray but lost it's shape i.e. no N-dimensional array

@dsl.pipeline(
    name="Credit Card Fraud Detection Pipeline",
    description="A pipeline for credit card fraud detection using OutlierVAE."
)
def credit_card_fraud_detection_pipeline(
    load_outlier_detector: bool = False,
    perc_outlier: int = 5
):  
    load_and_preprocess_data_op = load_and_preprocess_data(file_path=file_path)

    create_outlier(X_data=load_and_preprocess_data_op.outputs['X_data'], 
                   Y_target=load_and_preprocess_data_op.outputs['Y_target'])

What did you expect to happen:

In create_outlier component, while reading outputs the f.read() is giving the dataset in string and we need to add more functions to convert it to ndarray.

Data Type should be same as output to input to avoid re-constructing of data while reading from output as an input in another kubeflow component.

Is there any way if we can store the output in ndarray format and retrieve in same as input ?


Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

@mayankagarwal19911 Please re-open this issue in Kubeflow Pipelines repository: https://github.com/kubeflow/pipelines/issues since this issue is not related to Katib.