ukbm: A Shell repository from NeuroDataScience

UKB Manager

About

ukbm is a collection of code for facillitating data management of the UK Biobank bulk data. The code allows users to manage downloads by splitting by modality, limiting downloads to new data, and parallelize fetching. Data structure conversion to BIDS is supported for NIFTI data (e.g. T1 images). High-level SLURM submission scripts are included for deploying extraction and conversion scripts on computing clusters. Finally, SquashFS-related utilities are provided to help converting the data into SquashFS and handling common tasks (e.g. subject removal, patching). ukbm relies on the UKB's download utilities for fetching the data.

Data Fetching

Fetching from the UKB

Data fetching is done via transfertools/parallel_fetch.sh, which in turn relies on ukbfetch or gfetch. parallel_fetch splits a bulk download evenly into parallel processes. Assuming that the transfer speed is not saturated, the parallel download can speed up transfer by up to a factor of 10x. The code is called via:

bash parallel_fetch.sh [-n numjobs] [-s] [-f field_id] [-b bulkfile] [-p blocklist] fetch_utility keyfile

where:

Argument name/flag	Required? : Default	Description
-n, --numjobs NUM	No : 10	Number of parallel processes to start.
-s, --skip	No : 0	If set, check whether files are already present at output.
-f, --field FIELD	Maybe : None.	Required for genetics data; optional for bulk data. Limit fetching to a particular datafield.
-b, --bulkfile BULK	Maybe : -	Required if downloading via a bulkfile; not needed if downloading genetics data.
-p, --blocklist BLOCK	Maybe : -	Required if downlaoding genetics via by blocks; not needed otherwise. E.g. pvcf_blocks.txt
fetch_utility	Yes : -	Path to the UKB's fetch utility to use. Should be `ukbfetch` for bulk imaging data, `gfetch` for genetics
keyfile	Yes : -	Path the the keyfile supplied by the UKB once a basket is made available.

Example uses

bash parallel_fetch.sh -n 6 -f 20252_2_0 -b dataset.bulk -s ukbfetch k12345r000000.key  # Bulk data; downloads only 20252_2_0
bash parallel_fetch.sh -f 23156 -p pvcf_blocks.txt gfetch k12345r000000.key  # Exome pVCF file blocks
bash parallel_fetch.sh -f 22418 gfetch k12345r000000.key  # Genotype calls

Resuming

For resuming partially-completed downloads and avoiding previously-fetched data, we can use transfertools/reduce_bulkfile.py to reduce the initial bulkfile to only the missing data:

python reduce_bulkfile.py [-v] [-f FIELD] [-d DATA] bulkfile output

where:

Argument name/flag	Required? : Default	Description
-f, --field FIELD	No : None	Limit reduction to a particular datafield.
-v, --verbose	No : False	If True, print out how much the bulkfile was reduced by.
-d, --datadir DIR	No : ./	Directory where data was stored. Defaults to current directory.
bulkfile	Yes : -	bulkfile to reduce.
output	Yes : -	Name of the output reduced bulkfile.

Example

# First download:
bash parallel_fetch.sh -b dataset.bulk ukbfetch k12345r0000000.key
# [download gets interrupted]
python reduce_bulkfile.py -d [data_dir] dataset.bulk dataset_reduced.bulk
# resume using reduced bulkfile:
bash parallel_fetch.sh -b dataset_reduced.bulk ukbfetch k12345r000000.key

Data Conversion

Tabular

The tabular data is downloaded from the UKB as an encoded file. It must first be decoded, then converted into the desired format. We can do this using the wrapper functions in convert/tabular.py:

python tabular.py ukbfile [-a authkey] [-p ukbunpack] [-c ukbconv] [-o output] [-e encoding] [-f format_list]

where:

Argument name/flag	Required? : Default	Description
ukbfile	Yes : -	File to be processed. If `-p ukbunpack` is set, the encoded file is expected. Otherwise, the decoded file is expected.
-a, --authkey keyfile	Maybe : None	Required if `-p ukbunpack` is set. Authentication key provided by the UKB.
-p, --ukbunpack PATH	No : -	Path to the `ukbunpack` utility.
-c, --ukbconv PATH	No : -	Path to the `ukbconv` utility.
-o, --output NAME	No : tabular	Output filename prefix for converted data; file extension is determined by the data format.
-e, --encoding PATH	Maybe : ./encoding.ukb	Datafield encoding (encoding.ukb) supplied by the UKB.
-f, --format LIST	No : csv bulk r docs sas stata	Output format for converted data. Valid: csv, docs, sas, stata, r, lims, bulk, txt. For multiple formats, enter the formats as a space-delimited list. Must be the last supplied argument.

Example

python tabular.py dataset.enc -a k12345r000000.key -p ./ukbunpack -c ./ukbconv -o converted_output -e encoding.ukb -f bulk csv  # Do unpacking followed by conversion.
python tabular.py dataset.enc_ukb -c ./ukbconv -o converted_output -e encoding.ukb -f bulk  # Only do conversion.
python tabular.py dataset.enc -a k12345r000000.key -p ./ukbunpack  # Only do unpacking

Conversion to BIDS

The bulk imaging data is supplied as NIFTI files in UKB's custom format. We can convert the raw data into BIDS using convert/bids.py:

python bids.py --zip_filepath ZIP [--raw_out DIR] [--source_out DIR] [--derivatives_out DIR] [--zip_filelist FILE]

where:

Argument name/flag	Required? : Default	Description
--zip_filepath FILE	Maybe : -	Zip file for data to convert to BIDS. Not required if `--zip_filelist` is defined
--raw_out DIR	No : -	Output directory for raw data. Ignored if undefined.
--source_out DIR	No : -	Output directory for source data. Ignored if undefined.
--derivatives_out DIR	No : -	Output directory for derivatives data. Ignored if undefined.
--zip_filelist FILE	Maybe : -	Text file containing a newline-delimited list of zip files to convert.

Example

python bids.py --zip_filepath 123456_20252_2_0.zip --raw_out t1/raw/ --derivatives_out t1/derivs/
python bids.py --zip_filelist datalist.txt --raw_out t1/raw/

SLURM Support

For clusters supporting SLURM, some batch submission scripts are provided for deploying to multiple workers. Some modification may be required to fit your specific server (e.g., removing 'module' and replacing it with the corresponding structure). The code was developed for Compute Canada clusters, which uses Lmod to manage the software environment; if your cluster uses the same tools, they should work as described here.

Conversion to BIDS

Distributed conversion to BIDS can be done using slurm/convert_bids.sh, which simply needs to be pointed to a directory of .zip files. Its interface is similar to convert/bids.py, but instead assumes that you want to convert everything in a particular directory.

sbatch --account=ACCOUNT convert_bids.sh [--raw_out DIR] [--source_out DIR] [--derivatives_out DIR] zipdir

The inputs are the same as with bids.py. zipdir is simply the path to the directory containing the .zip files to be converted to BIDS. The default SBATCH settings use an array of 4 workers; you can increase the number of workers, but using more than 4 workers can cause issues when reading/writing from the same disk spaces. We recommend keeping it at 4 workers and waiting a little longer.

Creating SquashFS images

Before beginning this step make sure that the permissions for all directories and files in the bids-ified data has proper permissions for reading once the squashfs filesystem is mounted. For example, do a chmod -R 555 DIR on the resulting directory from the previous step.

SquashFS is a read-only filesystem image that is useful for limiting the inode footprint associated with the UKB, and has the side-benefit of making data access significantly faster (see publication for more information). For a walkthrough for how to use SquashFS combined with Singularity, see the Neurohub Wiki. We provide a SLURM batch script for creating multiple SquashFS images from a data directory, slurm/squashdir.sh:

sbatch --account=ACCOUNT --array=0-5 squashdir.sh [-d DEPTH] [-f FORMAT] [-n NUMIM] [--dry] directory_to_squash

where:

Argument name/flag	Required? : Default	Description
-d, --depth DEPTH	No : 0	Depth of directory to use for splitting. E.g. /neurohub/ukbb/imaging/sub-* would require `-d 3` to split across subjects. This allows for leading directories for when the SquashFS images are overlaid.
-f, --format FORMAT	No : 'neurohub_ukbb_data_${SLURM_ARRAY_TASK_ID}'	Formatting string to use for naming output. It should change with '${SLURM_ARRAY_TASK_ID}' to prevent collisions. NOTE: input should be specified using single quotes to avoid variable expansion.
-n, --numim NUM	No : ${SLURM_ARRAY_TASK_COUNT}	Only used when a subset needs to be re-squashed. Number of images that would be produced. A job that was previously-submitted with --array=0-3 would produce 4 images. If you need to re-squash image 2, you would use: sbatch --array=2 [...] squashdir.sh -n 4 [...].
--dry	No : -	If set, will not squash but will instead print the list of files that would be excluded by each worker.
directory_to_squash	Yes : -	Directory contaiing the data that needs to be squashed.

When UKB subjects signal that they wish to be removed from the UKB, the SquashFS images need to be re-squashed. This can be done be done easily via slurm/remove_subject.sh, which will check specified SquashFS files against the UKB-provided subject withdrawal list and only submit SLURM jobs for images that have at least one withdrawn subject.

Example:

# Split the data into 6 SquashFS images with the specified output name:
sbatch --account=rpp-account-aa --array=0-5 $(basename $0) -d 3 -f 'neurohub_ukbb_rfmri_ses2_${SLURM_ARRAY_TASK_ID}_bids.squashfs' data/

Modality-specific extraction from DICOMs

Some modalities (DWI, rfMRI) need their raw data to be extracted from their DICOMs due to missing information or errors in the original conversion from DICOM to NIFTI. slurm/extract_dwi_dicom.sh performs DICOM -> BIDS conversion for the DWI data. If the DICOMs are unavailable, you may still need to fix the DWI .bvec and .bval files. The files are tab-delimited, while BIDS expects space-delimited files; dwi_fix_bv.sh can do that for you.
Similarly, some of the rfMRI files have either incorrect values in their .json files or the .json files are missing. slurm/extract_rfmri_json.sh extracts the .json files for each subject using dcm2niix and puts it in the expected BIDS-compliant path.

Feedback / Issues

Feedback is welcome via the issues tab on GitHub.

neurodatascience/ukbm

UKB Manager

About

Data Fetching

Fetching from the UKB

Example uses

Resuming

Example

Data Conversion

Tabular

Example

Conversion to BIDS

Example

SLURM Support

Conversion to BIDS

Creating SquashFS images

Example:

Modality-specific extraction from DICOMs

Feedback / Issues