tensorflow/tfx

Enhance BigQueryExampleGen to handle Array columns of Arrays

AlexanderLavelle opened this issue · 8 comments

System information

  • TFX Version (you are using): master
  • Environment in which you plan to use the feature (e.g., Local
    (Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc..): VertexAI Pipelines
  • Are you willing to contribute it (Yes/No): Maybe

Describe the feature and the current behavior/state.
Currently, the BigQueryExampleGen executor has a very basic way of ingesting BigQuery that appears relatively limited and does not support array columns. On the other hand, a DataFlow template from GCP looks more expanded and potentially simple to implement

Will this change the current API? How?
No

Who will benefit with this feature?
Users with Array columns (i.e. SequenceExamples)

Do you have a workaround or are completely blocked by this? :
workaround

Name of your Organization (Optional)

Any Other info.

@AlexanderLavelle,

I saw this pretty old PR to add support for BigQuery arrays in BigQueryExampleGen. Looking at the current row_to_example function, I can still see the same changes.

BigQueryExampleGen should support array columns. Can you try out and let us know if it doesn't works out. Thank you!

@singhniraj08 I think you're correct. Do you know if either code can handle an array column of arrays? That was the sort of underlying intention of this issue

@AlexanderLavelle,

Are you looking for Array columns support as shown in utils_test.py. This is part of the text case for row_to_example function currently in master branch. So current code should support columns with array inputs.

If I am not correct, please help us with the example data which you want support for. Thanks

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

@singhniraj08

I would be looking for support of array of arrays:

Primary Key       | Array_of_Arrays
------------------|-----------------------
1                 | np.array([*range(10)])
                  | np.array([*range(10)])
                  | np.array([*range(10)])
-----------------------------------------
2                 | np.array([*range(10)])
                  | np.array([*range(10)])
                  | np.array([*range(10)])

If the Array_of_Arrays column only had one array component, this would be encoded by
int64_list=tf.train.Int64List(value=value_list))

In the event that there are multiple arrays in the array column, each one needs the function above:

feature = {}

feature['Primary Key'] = tf.train.int64List(value = [1])

int64_lists = [tf.train.int64List(value = np.array([*range(10)])).tolist() for _ in range(3)]
int_to_features = [tf.train.Feature(int64_list = list) for list in int64_lists]
output_feature = tf.train.FeatureList(feature = int_to_features)
feature['Array_of_Arrays'] = output_features

In theory, you would cut the middle step of the array-to-outputs to avoid looping twice, but I left it for clarity.

Is this not the correct setup for this transformation?

@AlexanderLavelle, If you are interested in contributing this feature, Please go ahead and submit the PR for this feature implementation. Thanks.

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

This issue was closed due to lack of activity after being marked stale for past 7 days.