The Chan Zuckerburg Biohub provides an amazing service to affiliated institutions: offering to run their sequencing jobs! You provide a library, and they provide a runfolder. But now you have to get the runfolder on to your computer.
With earlier generations of sequencing, runfolder sized were measured in tens of Gigabytes. But modern sequencing platforms are delivering runfolders measured in the Terabytes. Successfully downloading so much data requires both the disk space, and stable platform (for example, not a laptop), and bandwidth. You may already have this, in the form of a compute environment or cluster!
This repository contains a script, s3sync.sh
. The script is designed to run
in a compute environment that uses SLURM for job scheduling.
-
The script has minimal requirements: It is written in Bash, and only requires the AWS CLI.
-
The script is fairly smart: Instead of having to parse the seqbot
.sh
file yourself, you just copy-paste the entire thing when the script asks for it. The script will read the seqbot.sh
file, and extract the information needed to do the download. -
The script is automatic: It initially asks for a four-hour runtime from SLURM. If that is not enough time, it will re-submit itself. Once the job is submitted, you only need to get involved when something breaks.
The script was originally written for use in the Stanford Research Computing Center's Sherlock computing environment, but it should work in other SLURM-based clusters. If this is of interest to you, read on for information on what is needed, and how to use it!
This script has just a few requirements.
To start, you need a compute environment! This script is written for use with the SLURM job scheduler. It will not work out-of-the-box with other schedulers, but could possibly be made to work with them (doing so is left as an exercise to the reader).
You will also need the AWS CLI installed. This also means that you will need some sort of Python installation.
The script itself is written in Bash. The default Bash shell for your OS should be sufficient.
You should have received a .sh
file from seqbot. It is a text file. Before
you can start, you will need this file. Remember, these files are
time-limited! Once a download has been made available to you, the access will
expire after 36 hours.
To use the script, you must first ensure that the AWS and SLURM commands aws
,
scontrol
, and sbatch
are in your default path.
You can use the which
command to check this, like so:
If the which
command returns an error, then you will need to do something to
make the command accessible. That might mean loading a module, or changing
your PATH
environment variable. Here is an example:
Pick a place for the files to live. The script will download all of the files into your chosen directory, so you might need to make a new directory to hold the files. Also, make sure your chosen download location has enough room!
Use the cd
command to move into the chosen directory. And then run the
script:
When promped, copy-paste the complete seqbot .sh
file into the program.
When done pasting, press the <Return> (or <Enter>) key, followed by an EOF
character (which is <Control-D>). The script will parse the .sh
file,
exact the download instructions, check that they are valid, and submit the
download job to SLURM!
At the end, you will get a job ID number. You can use the number to track the status of your job.
SLURM partition note: By default, the script will submit the job into
whatever is your default SLURM partition. If you need to change that, you can
add sbatch options command-line
options to the end of the script. So, instead of running s3sync.sh
, you
could run s3sync.sh -p special
to submit the job to the 'special' SLURM
partition.
Once the job begins, you will see files appearing in the directory where you
initially ran the script. There will also be a .out
file, which logs any
messages generated by the download program (if there was nothing to log, the
file will be empty).
After submission, you will (eventually) receive a number of emails from SLURM:
-
The first email will have "Begin" in the subject line, telling you that your download has started.
After this point, you should quickly start to see files appearing in your download directory.
-
You may receive an email with "Queued" in the subject line. This tells you that the download exceeded the four-hour time allocation. The download has been paused, and the job re-queued for another four-hour allocation.
-
The last email will have "TBD" in the subject line, telling you that your download is complete!
At any time, you can run the following commands to check your job's status:
-
squeue -j JOBID
(whereJOBID
is your batch job ID number) to see your job's status. -
scontrol show job JOBID
to see details of the job, like the expected start time. -
If the job is complete (or failed)
sacct -j JOBID
will show you recorded statictics for the job.
If the download takes to long that your four-hour time allocation is exceeded, SLURM will notify the script of its impending termination. The script will then immediately resubmit itself, asking for another four hours of time.
If the script does have to run multiple times, on each new execution the AWS CLI will check the already-download files, and will only download files that are missing or incomplete.
After your first successful run, if you plan on using the script in the future,
you should go into the script and adjust the #SBATCH --time
line. By
default, this script requests four hours of runtime. Depending on your
environment, that might be too short (meaning your job had to resubmit) or too
long (your job completed well under its runtime limit). But both options have
their pros and cons.
If your job completed quickly, reducing the requested runtime may allow your job to run sooner. Shorter jobs are normally able to be scheduled sooner, as SLURM fills the "holes" made between the larger jobs.
If you job took so long that it had to requeue itself, you should consider
increasing the requested time for the job. The "pro" of this is, the aws
command will not have to waste time checking over existing files (to build a
fresh list of what to transfer). The "con" of this is, longer jobs often take
longer to schedule.
The contents of this repository are © 2019 The Board of Trustees of the Leland Stanford Jr. University. It is made available under the MIT License.
Terminal captures were obtained using asciinema, and converted into animated GIF format by asciicast2gif.
Contributions are welcome, if they will fix bugs or improve the clarity of the scripts. If you would like to customize these scripts for your own environment, you should fork this repository, and then commit changes there. You should also update this README, particularly the Parameters and Customization section, to reflect the changes you made.