Unable to Intact the data type while passing from one component to another
mayankagarwal19911 opened this issue · 1 comments
/kind bug
What steps did you take and what happened:
While running below notebook, can not pass the required data type of to output to input of another component.
@dsl.component(
base_image="python image with required libraries"
)
def load_and_preprocess_data(file_path: str) -> NamedTuple('Outputs',[
('X_data', Dataset),
('Y_target', Dataset),
]):
import pandas as pd
import seaborn as sns
import numpy as np
sns.set(style="whitegrid")
np.random.seed(203)
data = pd.read_csv(file_path)
data["Time"] = data["Time"].apply(lambda x: x / 3600 % 24)
X_data = data.drop(['Class'], axis=1).values // type numpy.ndarray
Y_target = data["Class"].values // type numpy.ndarray
from collections import namedtuple
output = namedtuple('Outputs', ['X_data', 'Y_target'])
return output(X_data, Y_target)
@dsl.component(
base_image="python image with required libraries"
)
def create_outlier(
X_data: Input[Dataset],
Y_target: Input[Dataset],
n_samples: int=400000, perc_outlier: int=0):
import numpy as np
np.random.seed(0)
with open(X_data.path) as f:
x_data = np.array(f.read()) // type numpy.ndarray but lost it's shape
with open(Y_target.path) as f:
y_target = np.array(f.read()) // type numpy.ndarray but lost it's shape i.e. no N-dimensional array
@dsl.pipeline(
name="Credit Card Fraud Detection Pipeline",
description="A pipeline for credit card fraud detection using OutlierVAE."
)
def credit_card_fraud_detection_pipeline(
load_outlier_detector: bool = False,
perc_outlier: int = 5
):
load_and_preprocess_data_op = load_and_preprocess_data(file_path=file_path)
create_outlier(X_data=load_and_preprocess_data_op.outputs['X_data'],
Y_target=load_and_preprocess_data_op.outputs['Y_target'])
What did you expect to happen:
In create_outlier component, while reading outputs the f.read() is giving the dataset in string and we need to add more functions to convert it to ndarray.
Data Type should be same as output to input to avoid re-constructing of data while reading from output as an input in another kubeflow component.
Is there any way if we can store the output in ndarray format and retrieve in same as input ?
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
@mayankagarwal19911 Please re-open this issue in Kubeflow Pipelines repository: https://github.com/kubeflow/pipelines/issues since this issue is not related to Katib.