Passing data between Bodywork stages locally

Question

Passing data between Bodywork stages locally

mtszkw opened this issue 3 years ago · 2 comments

Hi guys,

I set up my own project to test Bodywork (started with pipelines). Now I have three simple stages:

downloading data (sklearn.datasets.load),
preprocessing (this one is a dummy stage, empty),
and training (also sklearn).

Step no. 1 downloads data and saves it to a file which then should be used in step no. 3 i.e. training.
Unfortunately, step 3 cannot find the data, me too. Logs below.

Is it because each stage is executed in its own container and everything gets wiped out after stage execution? Is there any recommended way of sharing data between stages so that I don't have to upload data to S3 and then download it back?

Code: https://github.com/mtszkw/turbo_waffle

PS. When executed manually (python), step 1 leaves data_files dir, so let's assume that script works as intended.

Thanks in advance!

---- pod logs for turbo-waffle--3-train-random-forest

2021-07-12 11:58:16,798 - INFO - stage_execution.run_stage - attempting to run stage=3_train_random_forest from main branch of repo at https://github.com/mtszkw/turbo_waffle
2021-07-12 11:58:16,801 - WARNING - git.download_project_code_from_repo - Not configured for use with private GitHub repos
2021-07-12 11:58:28.952 | INFO | main:_load_preprocessed_data:9 - Reading training data from /tmp/data_files/breast_cancer_data.npy and /tmp/data_files/breast_cancer_target.npy...
Traceback (most recent call last):
File "train_random_forest.py", line 32, in
X_train, y_train = _load_preprocessed_data(X_train_full_path, y_train_full_path)
File "train_random_forest.py", line 10, in _load_preprocessed_data
X_train = np.load(X_train_full_path)
File "/usr/local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/data_files/breast_cancer_data.npy'
2021-07-12 11:58:29,025 - ERROR - stage_execution.run_stage - Stage 3_train_random_forest failed - CalledProcessError(1, ['python', 'train_random_forest.py', '/tmp/data_files/breast_cancer_data.npy', '/tmp/data_files/breast_cancer_target.npy'])

Answer 1 · 2021-07-12T13:05:55.000Z

Yo!

"Is it because each stage is executed in its own container and everything gets wiped out after stage execution?"

Yes.

I use S3 for storing interim results and we're currently working on a separate library of 'pipeline tools' to handle these sorts of tedious tasks.

That said, you don't have to run each module in its own container. You could easily create a run.py module that is executed by the stage, but within run.py you are free to run other modules as you see fit, etc.

Please let me know what you think - very keen to get your opinion(s) 🙂

Answer 2 · 2021-07-12T14:04:27.000Z

I intended to run each stage in a separate container just like Kubeflow Pipelines does, that was just my idea.

I prefer to keep the separation, but instead I will also store interim results in S3 as you do. Thanks, Alex!