sidhomj/DeepTCR

The dataset used in the regression model

zguo235 opened this issue · 3 comments

Hello,

I checked the dataset used in the regression model. It seems that simply dropping duplicate TCR won't get the dataset used in the regression model. Could you tell you where I can find the preprocessing detail to obtain a dataset for the regression model?

Thanks!

Scripts to train regression models can be found under ancillary_analysis/supervised/supervised_reg/ under the following files. mart1_train.py, flu_train.py, ebv_train.py.

The csv file under Data/10x_Data/Data_Regression.csv already has no duplicates when looking at alpha/beta pairs.

Thank you for your prompt response. I have an in-house dataset and I want to train the regression model using my dataset. My dataset is like the counting matrix in the original 10x dataset, that each row is the UMI counts for one cell. I checked ancillary_analysis/supervised/supervised_reg/*_train.py files, but there is no description about the data preprocessing. How should I clean my dataset to get a file like Data/10x_Data/Data_Regression.csv to train the regression model?

Unfortunately, I am not able at this time to find the scripts I wrote to convert the 10x outputs to that csv file. But it should be rather simple to do with basic pandas functions.