Sequence not featurized
fbenedett opened this issue · 7 comments
Dear all,
I am trying to analyze a dataset but, for unknown reasons, some of the sequences are not considered.
My code is the follow
%%capture
import sys
import pandas as pd
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_U
# Instantiate training object
DTCRU = DeepTCR_U('Tutorial')
a_target="MA0"
#Load Data from directories
DTCRU.Get_Data(directory='data_deep_tcr/'+a_target,Load_Prev_Data=False,aggregate_by_aa=True,
aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)
#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False, size_of_net="small")
DTCRU.Cluster(clustering_method='phenograph', sample=500)
DFs = DTCRU.Cluster_DFs
r_df=pd.DataFrame()
for i in range(0, len(DFs)):
tdf=DFs[i]
tdf["cluster_index"]=i
r_df=r_df.append(tdf)
fn="result_clustering_MA0_clean.txt"
r_df.to_csv(fn)
In the directory "data_deep_tcr/MA0 I have two subfolders. Each of these subfolders contains one TSV file with the following format:
aminoAcid counts v_beta j_beta
CASTHLDPPGEQYFG 571795 hTRBV28 hTRBJ02-7
CASSPLGASGEQFFG 317906 hTRBV28 hTRBJ02-1
CASGGGEQFFG 104692 hTRBV12-3 hTRBJ02-1
CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-1
CASSFFPFNEQFFG 74908 hTRBV12-3 hTRBJ02-1
For example, I pass this sequence (with the v_beta and j_beta and the counts)
CANEGASENTEAFFG 73703 hTRBV06-8 hTRBJ01-1
but it is not clustered.
Whereas, the same sequence with a different count, v_beta and j_beta is clustered:
CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-17
Any idea why is this happening?
So the latent representations that are learned are a joint representation of sequence + V/D/J gene usage if you supply this information. So the latent representations of those two sequences will actually be different because they use different V and J-genes. If you don't want to use V/D/J gene information, do not provide it to the model. The clustering algorithms provided in DeepTCR are "off the shelf" algorithms that I have not personally developed. I cannot speak to how they cluster data. One can always extract the learned features from the DeepTCR object after fitting the autoencoder and apply your own clustering algorithms in python. Learned features can be found after training under DTCR.features.
I made a mistake when I was explaining the problem.
The sequences are not clustered because they are not even featurized.
In the following case I provide a single file where I checked that there are no double entries
%%capture
import sys
import numpy as np
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_U
import pandas as pd
# Instantiate training object
DTCRU = DeepTCR_U('Tutorial')
a_target="MA0_alltogether"
#Load Data from directories
DTCRU.Get_Data(directory='data_deep_tcr/'+a_target,Load_Prev_Data=False,aggregate_by_aa=True,
aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)
Here I check the number of unique sequences
f=open("data_deep_tcr/MA0_alltogether/all_together.tsv", "r")
data=f.readlines()
f.close()
len(data)
27768
#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False, size_of_net="small")
len(DTCRU.features)
21910
About 6000 sequences are not featurized.
And this happens also if I try to load the sequences in other ways.
DeepTCR removes certain sequences. In general, if the sequence is greater than the max_length parameter defined when instatiating the object (default is 40) or the sequences uses non-IUPAC letters, it is removed. You can find this logic below.
def Process_Seq(df,col):
#Drop null values
df = df.dropna(subset=[col])
#strip any white space and remove non-IUPAC characters
df[col] = df[col].str.strip()
df = df[~df[col].str.contains(r'[^A-Z]')]
iupac_c = set((list(IUPAC.IUPACProtein.letters)))
all_c = set(''.join(list(df[col])))
searchfor = list(all_c.difference(iupac_c))
if len(searchfor) != 0:
df = df[~df[col].str.contains('|'.join(searchfor))]
return df
But something else is going on here...
For example, I pass this sequence (with the v_beta and j_beta and the counts)
CANEGASENTEAFFG 73703 hTRBV06-8 hTRBJ01-1
but it is not featurized.
Whereas, the same sequence with a different count, v_beta and j_beta is featurized:
CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-1
It can't be the sequence. it should be something regarding the count or the v_beta or j_beta.
But, if I understand correctly, v_beta and j_beta are just labels.
Please, reopen the issue. The fact that the same sequence with different v_beta j_beta is not featurize is still a problem.
I think I know what your issue is. If you pass the parameter (aggregate_by_aa = True), DeepTCR collapses all TCR's with the same amino acid sequence. In this case, the only one of the v/d/j labels is taken in this collapse. Try setting the parameter to aggregate_by_aa = False and it should keep both those sequences separate in the featurization. Let me know if this works.
Thank you. That was the problem.