This Python package provide a series of tools to integrate and query the genomics, transcriptomics, proteomics, and clinical data (aka multi-omics data). With scalable data-frame manipulation tools, OpenOmics facilitates the common coding tasks when preparing data for bioinformatics analysis.
Documentation (Latest | Stable) | OpenOmics at a glance
OpenOmics assist in integration of heterogeneous multi-omics bioinformatics data. The library provides a Python API as well as an interactive Dash web interface. It features support for:
- Genomics, Transcriptomics, Proteomics, and Clinical data.
- Harmonization with 20+ popular annotation, interaction, disease-association databases.
OpenOmics also has an efficient data pipeline that bridges the popular data manipulation Pandas library and Dask distributed processing to address the following use cases:
- Provides a standard pipeline for dataset indexing, table joining and querying, which are transparent and customizable for end-users.
- Efficient disk storage for large multi-omics dataset with Parquet data structures.
- Multiple data types that supports both interactions and sequence data, and allows users to export to NetworkX graphs or down-stream machine learning.
- An easy-to-use API that works seamlessly with external Galaxy tool interface or the built-in Dash web interface (WIP).
pip install openomics
from openomics import MultiOmics
Import TCGA LUAD data included in tests dataset (preprocessed from TCGA-Assembler). It is located at tests/data/TCGA_LUAD.
folder_path = "tests/data/TCGA_LUAD/"
Load the multiomics: Gene Expression, MicroRNA expression lncRNA expression, Copy Number Variation, Somatic Mutation, DNA Methylation, and Protein Expression data
from openomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein
# Load each expression dataframe
mRNA = MessengerRNA(data=folder_path+"LUAD__geneExp.txt", transpose=True,
usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name")
miRNA = MicroRNA(data=folder_path+"LUAD__miRNAExp__RPM.txt"), transpose=True,
usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name")
lncRNA = LncRNA(data=folder_path+"TCGA-rnaexpr.tsv"), transpose=True,
usecols="Gene_ID|TCGA", gene_index="Gene_ID", gene_level="gene_id")
som = SomaticMutation(data=folder_path+"LUAD__somaticMutation_geneLevel.txt"),
transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name")
pro = Protein(data=folder_path+"protein_RPPA.txt"), transpose=True,
usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="protein_name")
# Create an integrated MultiOmics dataset
luad_data = MultiOmics(cohort_name="LUAD")
luad_data.add_clinical_data(
clinical_data=folder_path+"nationwidechildrens.org_clinical_patient_luad.txt")
luad_data.add_omic(mRNA)
luad_data.add_omic(miRNA)
luad_data.add_omic(lncRNA)
luad_data.add_omic(som)
luad_data.add_omic(pro)
luad_data.build_samples()
Each data is stored as a Pandas DataFrame. Below are all the data imported for TCGA LUAD. For each, the first number represents the number of samples, the second number is the number of features.
PATIENTS (522, 5)
SAMPLES (1160, 6)
DRUGS (461, 4)
MessengerRNA (576, 20472)
SomaticMutation (587, 21070)
MicroRNA (494, 1870)
LncRNA (546, 12727)
Protein (364, 154)
# Import GENCODE database (from URL)
from openomics.database import GENCODE
gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
"basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
"lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz",
"transcripts.fa": "gencode.v32.transcripts.fa.gz"},
remove_version_num=True,
npartitions=5)
# Annotate LncRNAs with GENCODE by gene_id
luad_data.LncRNA.annotate_genomics(gencode, index="gene_id",
columns=['feature', 'start', 'end', 'strand', 'tag', 'havana_gene'])
luad_data.LncRNA.annotations.info()
<class 'pandas.core.frame.DataFrame'>
Index: 13729 entries, ENSG00000082929 to ENSG00000284600
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 feature 13729 non-null object
1 start 13729 non-null object
2 end 13729 non-null object
3 strand 13729 non-null object
4 tag 13729 non-null object
5 havana_gene 13729 non-null object
dtypes: object(6)
memory usage: 1.4+ MB
Each multi-omics and clinical data can be accessed through luad_data.data[], like:
luad_data.data["PATIENTS"]
bcr_patient_barcode | gender | race | histologic_subtype | pathologic_stage | |
---|---|---|---|---|---|
bcr_patient_barcode | |||||
TCGA-05-4244 | TCGA-05-4244 | MALE | NaN | Lung Adenocarcinoma- Not Otherwise Specified (... | Stage IV |
TCGA-05-4245 | TCGA-05-4245 | MALE | NaN | Lung Adenocarcinoma- Not Otherwise Specified (... | Stage III |
TCGA-05-4249 | TCGA-05-4249 | MALE | NaN | Lung Adenocarcinoma- Not Otherwise Specified (... | Stage I |
TCGA-05-4250 | TCGA-05-4250 | FEMALE | NaN | Lung Adenocarcinoma- Not Otherwise Specified (... | Stage III |
TCGA-05-4382 | TCGA-05-4382 | MALE | NaN | Lung Adenocarcinoma Mixed Subtype | Stage I |
522 rows × 5 columns
luad_data.data["MessengerRNA"]
gene_name | A1BG | A1BG-AS1 | A1CF | A2M | A2ML1 | A4GALT | A4GNT | AAAS | AACS | AACSP1 | ... | ZXDA | ZXDB | ZXDC | ZYG11A | ZYG11B | ZYX | ZZEF1 | ZZZ3 | psiTPTE22 | tAKR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-05-4244-01A | 4.756500 | 5.239211 | 0.000000 | 13.265291 | 0.431997 | 7.043317 | 1.033652 | 9.348765 | 9.652057 | 0.763921 | ... | 5.350285 | 8.197321 | 9.907260 | 0.763921 | 10.088859 | 11.471139 | 9.768648 | 9.170597 | 2.932118 | 0.000000 |
TCGA-05-4249-01A | 6.920471 | 7.056843 | 0.402722 | 14.650247 | 1.383939 | 9.178805 | 0.717123 | 9.241537 | 9.967223 | 0.000000 | ... | 5.980428 | 8.950001 | 10.204971 | 4.411650 | 9.622978 | 11.199826 | 10.153700 | 9.433116 | 7.499637 | 0.000000 |
TCGA-05-4250-01A | 5.696542 | 6.136327 | 0.000000 | 14.048541 | 0.000000 | 8.481646 | 0.996244 | 9.203535 | 9.560412 | 0.733962 | ... | 5.931168 | 8.517334 | 9.722642 | 4.782796 | 8.895339 | 12.408981 | 10.194168 | 9.060342 | 2.867956 | 0.000000 |
TCGA-05-4382-01A | 7.198727 | 6.809804 | 0.000000 | 14.509730 | 2.532591 | 9.117559 | 1.657045 | 9.251035 | 10.078124 | 1.860883 | ... | 5.373036 | 8.441914 | 9.888267 | 6.041142 | 9.828389 | 12.725186 | 10.192589 | 9.376841 | 5.177029 | 0.000000 |
576 rows × 20472 columns
luad_data.match_samples(modalities=["MicroRNA", "MessengerRNA"])
Index(['TCGA-05-4384-01A', 'TCGA-05-4390-01A', 'TCGA-05-4396-01A',
'TCGA-05-4405-01A', 'TCGA-05-4410-01A', 'TCGA-05-4415-01A',
'TCGA-05-4417-01A', 'TCGA-05-4424-01A', 'TCGA-05-4425-01A',
'TCGA-05-4427-01A',
...
'TCGA-NJ-A4YG-01A', 'TCGA-NJ-A4YI-01A', 'TCGA-NJ-A4YP-01A',
'TCGA-NJ-A4YQ-01A', 'TCGA-NJ-A55A-01A', 'TCGA-NJ-A55O-01A',
'TCGA-NJ-A55R-01A', 'TCGA-NJ-A7XG-01A', 'TCGA-O1-A52J-01A',
'TCGA-S2-AA1A-01A'],
dtype='object', length=465)
# This function selects only patients with patholotic stages "Stage I" and "Stage II"
X_multiomics, y = luad_data.load_dataframe(modalities=["MessengerRNA", "MicroRNA", "LncRNA"], target=['pathologic_stage'],
pathologic_stages=['Stage I', 'Stage II'])
print(X_multiomics['MessengerRNA'].shape, X_multiomics['MicroRNA'].shape, X_multiomics['LncRNA'].shape, y.shape)
(336, 20472) (336, 1870) (336, 12727) (336, 1)
y
pathologic_stage | |
---|---|
TCGA-05-4390-01A | Stage I |
TCGA-05-4405-01A | Stage I |
TCGA-05-4410-01A | Stage I |
TCGA-05-4417-01A | Stage I |
TCGA-05-4424-01A | Stage II |
TCGA-05-4427-01A | Stage II |
TCGA-05-4433-01A | Stage I |
TCGA-05-5423-01A | Stage II |
TCGA-05-5425-01A | Stage II |
TCGA-05-5428-01A | Stage II |
TCGA-05-5715-01A | Stage I |
TCGA-38-4631-01A | Stage I |
TCGA-38-7271-01A | Stage I |
TCGA-38-A44F-01A | Stage I |
TCGA-44-2655-11A | Stage I |
336 rows × 1 columns
def expression_val_transform(x):
return np.log2(x+1)
X_multiomics['MessengerRNA'] = X_multiomics['MessengerRNA'].applymap(expression_val_transform)
X_multiomics['MicroRNA'] = X_multiomics['MicroRNA'].applymap(expression_val_transform)
# X_multiomics['LncRNA'] = X_multiomics['LncRNA'].applymap(expression_val_transform)
from sklearn import preprocessing
from sklearn import metrics
from sklearn.svm import SVC, LinearSVC
import sklearn.linear_model
from sklearn.model_selection import train_test_split
binarizer = preprocessing.LabelEncoder()
binarizer.fit(y)
binarizer.transform(y)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0,
1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0,
1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])
for omic in ["MessengerRNA", "MicroRNA"]:
print(omic)
scaler = sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=False)
scaler.fit(X_multiomics[omic])
X_train, X_test, Y_train, Y_test = \
train_test_split(X_multiomics[omic], y, test_size=0.3, random_state=np.random.randint(0, 10000), stratify=y)
print(X_train.shape, X_test.shape)
X_train = scaler.transform(X_train)
model = LinearSVC(C=1e-2, penalty='l1', class_weight='balanced', dual=False, multi_class="ovr")
# model = sklearn.linear_model.LogisticRegression(C=1e-0, penalty='l1', fit_intercept=False, class_weight="balanced")
# model = SVC(C=1e0, kernel="rbf", class_weight="balanced", decision_function_shape="ovo")
model.fit(X=X_train, y=Y_train)
print("NONZERO", len(np.nonzero(model.coef_)[0]))
print("Training accuracy", metrics.accuracy_score(model.predict(X_train), Y_train))
print(metrics.classification_report(y_pred=model.predict(X_test), y_true=Y_test))
MessengerRNA
(254, 20472) (109, 20472)
NONZERO 0
Training accuracy 0.6929133858267716
precision recall f1-score support
Stage I 0.69 1.00 0.82 75
Stage II 0.00 0.00 0.00 34
avg / total 0.47 0.69 0.56 109
MicroRNA
(254, 1870) (109, 1870)
NONZERO 0
Training accuracy 0.6929133858267716
precision recall f1-score support
Stage I 0.69 1.00 0.82 75
Stage II 0.00 0.00 0.00 34
avg / total 0.47 0.69 0.56 109
This package was created with Cookiecutter_ and the pyOpenSci/cookiecutter-pyopensci
_ project template, based off audreyr/cookiecutter-pypackage
_.
.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _pyOpenSci/cookiecutter-pyopensci
: https://github.com/pyOpenSci/cookiecutter-pyopensci
.. _audreyr/cookiecutter-pypackage
: https://github.com/audreyr/cookiecutter-pypackage