insitro/redun

Staging files

Closed this issue · 2 comments

Hi, I'm having trouble accessing files when running tasks on different containers (AWS batch/local docker).
e.g.: if one task is outputting a file and the following task consumes that file, but two processes are not executed in one host, it will raise the file not exists error.

I try to look up the doc but it seems that the example uses s3 paths so parsing from the s3 paths will work for all tasks. I'm wondering if redun has any mechanism to automatically stage the files for each process?

I try to look up the doc but it seems that the example uses s3 paths so parsing from the s3 paths will work for all tasks. I'm wondering if redun has any mechanism to automatically stage the files for each process?

Yes, redun has a mechanism for staging remote files (such as S3 or GCS) to/from local files. See these two examples:

# Stage the input file to a local file called `data`.
inputs=[data.stage("data")],
# Unstage the output files to our project directory.
# Final return value of script() takes the shape of outputs, but with each StagingFile
# replaced by `File(remote_path)`.
outputs={
"colors-counts": output.stage("color-counts.txt"),
"log": log_file.stage("log.txt"),
},
)

inputs=[
reads1.stage("reads1.fastq.gz"),
reads2.stage("reads2.fastq.gz"),
genome_ref_file.stage(),
[file.stage() for file in genome_ref_index.values()],
],
outputs={
"bam": File(output_bam_path).stage("sample.aln.sorted.settag.bam"),
"stdout": File("-"),
},

We also describe the file staging feature here in the design docs:
https://github.com/insitro/redun/blob/main/docs/source/design.md#file-staging

You are correct that when running code in docker containers, you will likely need to use the file staging to pass data from one process to the next. Any local files will be lost when the docker container terminates.

Let me know if those links are helpful and I'm happy to answer any other questions you have.

@mattrasmus that's really useful. Thanks!