DBFE is a Python library with feature extraction methods that facilitate classifier learning from distributions of genomic variants.
To install dbfe, just execute:
pip install dbfe
Afterwards you can import dbfe
and use all the classes and functions. To run all the tests and experiments you will require additional packages, which can be installed using the requirements.txt
file found in this repository:
pip install -r requirements.txt
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import dbfe
# sample data
stat_vals = pd.read_csv(f"../experiments/data/ovarian/ovarian_cnv.csv.gz", index_col='SAMPLEID')
stat_vals = stat_vals.loc[stat_vals.SVCLASS == "DEL", :]
stat_vals = stat_vals.groupby(stat_vals.index)['LEN'].apply(list).to_frame()
labels = pd.read_csv(f"../experiments/data/ovarian/labels.tsv", sep='\t', index_col=0)
labels = (labels == "RES") * 1
stat_df = stat_vals.join(labels.CLASS_LABEL, how='inner')
# splitting into training and testing data
X = stat_df.loc[:, "LEN"]
y = stat_df.loc[:, "CLASS_LABEL"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=23, stratify=y)
# DBFE in a classification pipeline
extractor = dbfe.DistributionBasedFeatureExtractor(breakpoint_type='supervised', n_bins='auto', cv=10)
pipe = make_pipeline(extractor, StandardScaler(), LogisticRegression())
pipe.fit(X_train, y_train)
extractor.plot_data_with_breaks(X_train, y_train, plot_type='kde')
y_prob = pipe.predict_proba(X_test)
print("AUC on test data: {:.3}".format(roc_auc_score(y_test, y_prob[:, 1])))
More code examples can be found in the examples
folder.
The repository contains reproducible experiment source code in the form of a Jupyter notebook and detailed results of the analyses discussed in "DBFE: Distribution-based feature extraction from copy number and structural variants in whole-genome data" by Piernik et al. To re-run the experiments or analyze the results go to the experiments
folder. The code there is organized as follows:
- the root of the folder contains the experiment source code; to start the analysis run the
experiments.ipynb
notebook (remember to install requirements.txt beforehand); - the
data
folder contains variant length and gene amplification datasets; there you will also find theget_variant_lengths.py
utility script for extracting variant lengths from Manta, Strelka, Brass, and Ascat VCF files; results
plots and tabular results (in CSV format) corresponding to different parts of the analysis.
- This project is released under a permissive new BSD open source license (LICENSE-BSD3.txt) and commercially usable. There is no warranty; not even for merchantability or fitness for a particular purpose.
- In addition, you may use, copy, modify and redistribute all artistic creative works (figures and images) included in this distribution under the directory according to the terms and conditions of the Creative Commons Attribution 4.0 International License. See the file LICENSE-CC-BY.txt for details. (Computer-generated graphics such as the plots produced by seaborn/matplotlib fall under the BSD license mentioned above).
If you use dbfe as part of your workflow in a scientific publication, please consider citing the associated paper:
@article{piernik_2022_dbfe,
author = {Piernik, Maciej and Brzezinski, Dariusz and Sztromwasser, Pawel and Pacewicz, Klaudia and
Majer-Burman, Weronika and Gniot, Michal and Sielski, Dawid and Bryzghalov, Oleksii and
Wozna, Alicja and Zawadzki, Pawel},
title = {DBFE: Distribution-based feature extraction from
copy number and structural variants in whole-genome data},
journal = {Bioinformatics},
year = 2022,
doi = {10.1093/bioinformatics/btac513}
}
The best way to ask questions is via the GitHub Discussions channel. In case you encounter usage bugs, please don't hesitate to use the GitHub's issue tracker directly.