sidhomj/DeepTCR

Can´t load own Data using DTCR_SS.Get_Data

junyho486 opened this issue · 7 comments

TRB.txt
I have TCRseq Data which was annotated by IGB and preprocessed for DeepTCR as indicated in the tutorial.
I have 9 Samples with many TCRs, here is an excerpt of the Data for one Sample:

cdr3_aa	v_call	d_call	j_call	Count
ASSARQDLQQY	TRBV2*01	TRBD1*01	TRBJ2-7*01	39890
ASKDRALLRAV	TRBV21-1*01	TRBD1*01	TRBJ2-7*01	32323
ASSFSATNTGELF	TRBV5-1*01	TRBD2*01	TRBJ2-2*01	26637
ASSPGEQNTGELF	TRBV7-8*01	TRBD2*01	TRBJ2-2*01	26258
ASSGAGTGGYNEQF	TRBV12-3*01	TRBD1*01	TRBJ2-1*01	16692
ASSFSGHTGELF	TRBV7-2*01	TRBD2*01	TRBJ2-2*01	13838
ASSVETGTEKY	TRBV7-9*01	TRBD1*01	TRBJ2-3*01	13831
PPVIWTATSST	TRBV24-1*01	TRBD1*01	TRBJ2-7*01	13819
ASSSGLAGAYEQY	TRBV7-2*02	TRBD2*01	TRBJ2-7*01	13216
ASSFGVSGANVLT	TRBV7-9*03	TRBD2*01	TRBJ2-6*01	11449
ASSGLAGGPGTGELF	TRBV9*01	TRBD2*02	TRBJ2-2*01	11292
ASSPLAGGVAQF	TRBV7-6*01	TRBD2*02	TRBJ2-1*01	11019
ASSSTGQGNSYEQY	TRBV28*01	TRBD1*01	TRBJ2-7*01	10466

If I run the Tutorial using the example Data from the Repository for supervised Sequence Classification, loading Data, cluster etc. works perfectly (except for DTCR_SS.Train() which throws:

[AttributeError: 'DeepTCR_SS' object has no attribute 'test_pred']()

DTCR_SS.Monte_Carlo_CrossVal, DTCR_SS.K_Fold_CrossVal etc. work.

If I then replace the Folders in Data/Murine_Antigens with my Samples, DTCR_SS.Get_Data() which usually takes just a moment to load the data gets stuck (stopped it after 40min).

Even after only using TCRs >= 1000 Reads which results in Tables between 50-80 rows, does not resolve the issue.

import sys
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_SS

# Instantiate training object
DTCR_SS = DeepTCR_SS('Tutorial')

#Load Data from directories
DTCR_SS.Get_Data(directory='../../Data/TRB',Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=4,v_beta_column=1,j_beta_column=3)

Output:

Loading Data ...

Is there anything that could cause this kind of Bug?

Attached you will find the data for one Sample for TCR-seqs > 1000 (as .txt file saved .tsv)

Thank you in Advance for your help!

that bug should be fixed now in v 2.1.17.

As for the Get_Data method, it takes files that are csv/tsv. just change the extension of the file and it should work.

Well, I did a clean install today, like so:

  1. conda create -n DEEPTCR python=3.8.0
  2. conda activate DEEPTCR
  3. pip3 install DeepTCR // pip3 install git+https://github.com/sidhomj/DeepTCR.git -> same bug
  4. conda install ipykernel
  5. ipykernel install --user DEEPTCR

The Files do have .tsv format, I just changed to .txt for uploading them on github.

Sorry. are both bugs still there? or just the latter?

Thank you for the quick response!

I installed DeepTCR several times today trying to find the bug. Currently I am running the stable installation and see both bugs.

Edit: I reinstalled into a new env using pip3 install git+https://github.com/sidhomj/DeepTCR.git and the first bug seems to be resolved, but the second one persists.

second bug fixed. it was an issue with the expected order of columns in the files. I fixed the loading function so the order does not matter anymore. let me know if it works now!

Thank you so much! I was struggling with this one all day...
Now both issues are resolved for the unsupervised and supervised model!

Ps: Also congrats for creating DeepTCR it is a very impressive tool!