sidhomj/DeepTCR

Sequence not featurized

fbenedett opened this issue · 7 comments

Dear all,

I am trying to analyze a dataset but, for unknown reasons, some of the sequences are not considered.

My code is the follow

%%capture
import sys
import pandas as pd
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_U

# Instantiate training object
DTCRU = DeepTCR_U('Tutorial')

a_target="MA0"
#Load Data from directories
DTCRU.Get_Data(directory='data_deep_tcr/'+a_target,Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)

#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False, size_of_net="small")
DTCRU.Cluster(clustering_method='phenograph', sample=500)
DFs = DTCRU.Cluster_DFs

r_df=pd.DataFrame()

for i in range(0, len(DFs)):
    tdf=DFs[i]
    tdf["cluster_index"]=i
    r_df=r_df.append(tdf)

fn="result_clustering_MA0_clean.txt" 
r_df.to_csv(fn)

In the directory "data_deep_tcr/MA0 I have two subfolders. Each of these subfolders contains one TSV file with the following format:

aminoAcid	counts	v_beta	j_beta
CASTHLDPPGEQYFG	571795	hTRBV28	hTRBJ02-7
CASSPLGASGEQFFG	317906	hTRBV28	hTRBJ02-1
CASGGGEQFFG	104692	hTRBV12-3	hTRBJ02-1
CANEGASENTEAFFG	86447	hTRBV06-1	hTRBJ01-1
CASSFFPFNEQFFG	74908	hTRBV12-3	hTRBJ02-1

For example, I pass this sequence (with the v_beta and j_beta and the counts)
CANEGASENTEAFFG 73703 hTRBV06-8 hTRBJ01-1
but it is not clustered.

Whereas, the same sequence with a different count, v_beta and j_beta is clustered:
CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-17

Any idea why is this happening?

So the latent representations that are learned are a joint representation of sequence + V/D/J gene usage if you supply this information. So the latent representations of those two sequences will actually be different because they use different V and J-genes. If you don't want to use V/D/J gene information, do not provide it to the model. The clustering algorithms provided in DeepTCR are "off the shelf" algorithms that I have not personally developed. I cannot speak to how they cluster data. One can always extract the learned features from the DeepTCR object after fitting the autoencoder and apply your own clustering algorithms in python. Learned features can be found after training under DTCR.features.

I made a mistake when I was explaining the problem.
The sequences are not clustered because they are not even featurized.

In the following case I provide a single file where I checked that there are no double entries

%%capture
import sys
import numpy as np
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_U
import pandas as pd

# Instantiate training object
DTCRU = DeepTCR_U('Tutorial')

a_target="MA0_alltogether"
#Load Data from directories
DTCRU.Get_Data(directory='data_deep_tcr/'+a_target,Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)

Here I check the number of unique sequences

f=open("data_deep_tcr/MA0_alltogether/all_together.tsv", "r")
data=f.readlines()
f.close()
len(data)
27768
#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False, size_of_net="small")
len(DTCRU.features)
21910

About 6000 sequences are not featurized.
And this happens also if I try to load the sequences in other ways.

DeepTCR removes certain sequences. In general, if the sequence is greater than the max_length parameter defined when instatiating the object (default is 40) or the sequences uses non-IUPAC letters, it is removed. You can find this logic below.

def Process_Seq(df,col):
    #Drop null values
    df = df.dropna(subset=[col])

    #strip any white space and remove non-IUPAC characters
    df[col] = df[col].str.strip()
    df = df[~df[col].str.contains(r'[^A-Z]')]
    iupac_c = set((list(IUPAC.IUPACProtein.letters)))
    all_c = set(''.join(list(df[col])))
    searchfor = list(all_c.difference(iupac_c))
    if len(searchfor) != 0:
        df = df[~df[col].str.contains('|'.join(searchfor))]
    return df

But something else is going on here...
For example, I pass this sequence (with the v_beta and j_beta and the counts)
CANEGASENTEAFFG 73703 hTRBV06-8 hTRBJ01-1
but it is not featurized.

Whereas, the same sequence with a different count, v_beta and j_beta is featurized:
CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-1

It can't be the sequence. it should be something regarding the count or the v_beta or j_beta.
But, if I understand correctly, v_beta and j_beta are just labels.

Please, reopen the issue. The fact that the same sequence with different v_beta j_beta is not featurize is still a problem.

I think I know what your issue is. If you pass the parameter (aggregate_by_aa = True), DeepTCR collapses all TCR's with the same amino acid sequence. In this case, the only one of the v/d/j labels is taken in this collapse. Try setting the parameter to aggregate_by_aa = False and it should keep both those sequences separate in the featurization. Let me know if this works.

Thank you. That was the problem.