Authors: Isabell Kiral, Nathalie Willems, Benjamin Goudey
Tool for curation of UK Biobank data to generate cohorts. The tool can filter the main and associated datasets (e.g general practioner clinical data) based on search terms provided by the user. It can be used interactively through a web-based interface, or imported as a module and integrated into a broader pipeline. Additional functionality, such as automatically downloading large data files (bulk data) is also supported.
- Python 3.8
Installing using pip (or pip3):
$ pip3 install ukbcc
Or clone the repo:
$ git clone https://github.com/tool-bin/ukbcc.git
$ python3 setup.py install
NB: We strongly recommend using a virtual environment when installing this package and its dependencies. Please see this link for further information: https://docs.python.org/3/tutorial/venv.html
Run the following command in the terminal
$ ukbcc
The above command will setup the web-based interface and generates a web address where this can be accessed.
Follow the instructions on the website to proceed with cohort generation.
There are two ways to use with this module:
- Running the module from the command line and leveraging the
web mode
features to dynamically generate cohorts. - Importing the module into an existing pipeline, and using the functions developed to interact with the UKBB databases.
There is more detailed information in our paper.
In order to make full use of this module, you will need to download the following files:
main_dataset.csv
: The main dataset as downloaded from UK Biobank. Please follow UKBB instructions to obtain this file. Utilities to download this file can be found here: http://biobank.ctsu.ox.ac.uk/crystal/download.cgi. Please note that the module assumes that the main dataset is generated as a CSV file!Showcase_Data_Dictionary.csv
: This file encodes all the different datafields and their values within the UK Biobank Showcase. The file can be downloaded here: https://raw.githubusercontent.com/tool-bin/ukbcc/master/data_files/showcase.csvShowcase_Codings.csv
: This file contains all the coding schemes used in the UK Biobank Showcase. This file can be downloaded here: https://raw.githubusercontent.com/tool-bin/ukbcc/master/data_files/codings.csvreadcodes.csv
: A file linking readcodes to descriptions for the GP Clinical data, available from the UKBB data portal. This file can be found in the data_files directory within this repo. A link to download this file is given here: https://raw.githubusercontent.com/tool-bin/ukbcc/master/data_files/readcodes.csvgp_clinical.txt
: The full general practioner (GP) clinical data that forms part of the primary care dataset. The full table (gp_clinical) can be downloaded from the UKBB data portal website. Instructions to download this table are provided below.
The GP clinical dataset can be downloaded directly from the UK Biobank showcase website, through the Data Portal webpage. Instructions for how to download this file are provided below:
- Log into the UK Biobank showcase website (https://bbams.ndph.ox.ac.uk/ams/resApplications)
- Navigate to your Project by clicking the "Project" button on the left-hand side of the page
- Click on the "Data" tab on the right of the page
- Click on the "Go to Showcase" tab - this will take you to the UK Biobank Showcase landing page
- Click the "Data Portal" tab and click on the "Connect" button. Note! The user requires access permissions to access the Data Portal webpage - if you do not see this button, you do not have access to this page.
- Click on the "Table Download" button
- Type in "gp_clinical" into the search bar and click the "Fetch Table" button
- Click on the generated link. This will automatically start downloading the gp_clinical table, as tab-separated plain-text file.
- Provide the path and name of the GP clinical file to the main.py module in order to use this dataset within the
web
mode of the UKBCC module
In order to use the web-based interface, please run the following command from the command line:
$ ukbcc
The above command will setup the web-based interface and generates a web address where this can be accessed.
Follow the instructions on the website to proceed with cohort generation.
NB: The web-based interface is built using Plotly Dash, which uses Flask in order to serve the web application. The Flask library uses the default werkzeug development server, which has not been tested for security or performance. Consequently, you will see the following warning when running this command in the command line:
"Warning: This is a development server. Do not use app.run_server in production, use a production WSGI server like gunicorn instead."
We recommend using a web-server if you would like to run the UKBCC tool in a production environment. Popular choices include:
- gunicorn: https://gunicorn.org/
- uWSIG: https://uwsgi-docs.readthedocs.io/en/latest/
The ukbcc module uses dictionaries in order to represent the various datafield:code combinations and conditional logic to be applied in generating a cohort.
This dictionary will be automatically generated through the web mode
.
Alternatively, the user can write this dictionary themselves, and use the query submodule to further interact with UKBB databases.
Further information about the expected structure of the dictionary is provided in the docstrings of the functions within this module.
It is recommended the user leverage the web mode
if using the ukbcc module for the first time.
To learn about how to use modules in this package in your existing pipeline, see example-module notebook in the examples directory in this repo.
As a collaborator, please make a branch and create a pull request when ready. To contribute otherwise, please fork directory and create pull requests. Github issues are also welcome.
If you found this tool useful in your work, please use the following citation:
UKBCC: a cohort curation package for UK Biobank
Isabell Kiral, Nathalie Willems, Benjamin Goudey
bioRxiv 2020.07.12.199810; doi: https://doi.org/10.1101/2020.07.12.199810