sidhomj/DeepTCR

Recommendations for handling large datasets

leeanapeters opened this issue · 1 comments

Hi, thank you for creating this great tool!

I was wondering if you could offer some guidance on handling large datasets in the unsupervised workflow? In particular this seems to be a problem with the clustering/KNN classification steps as it seems to be prohibitively memory-expensive.

I think that downsampling is interfering with the classification accuracy so I would like to use all the data if possible.

Thanks so much for your help!

Leeana

Hi I am also using this tool with large datasets (~150k sequences). The KNN classification returns empty knn_seq.pkl and an error like below. I am wondering if you have ever encountered this error? and I suspect it may be an out-of-memory issue of KNN?


ValueError Traceback (most recent call last)
/tmp/ipykernel_15992/968723552.py in
----> 1 DTCRU.KNN_Sequence_Classifier(metrics=['AUC'],plot_metrics=True,n_jobs=-1, Load_Prev_Data=True,by_class=True)

~/deeptcr/lib/python3.7/site-packages/DeepTCR/DeepTCR.py in KNN_Sequence_Classifier(self, folds, k_values, rep, plot_metrics, by_class, plot_type, metrics, n_jobs, Load_Prev_Data)
2429 if plot_metrics is True:
2430 if by_class is True:
-> 2431 sns.catplot(data=df_out, x='Metric', y='Value', hue='Classes', kind=plot_type)
2432 else:
2433 sns.catplot(data=df_out, x='Metric', y='Value', kind=plot_type)

~/deeptcr/lib/python3.7/site-packages/seaborn/_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48

~/deeptcr/lib/python3.7/site-packages/seaborn/categorical.py in catplot(x, y, hue, data, row, col, col_wrap, estimator, ci, n_boot, units, seed, order, hue_order, row_order, col_order, kind, height, aspect, orient, color, palette, legend, legend_out, sharex, sharey, margin_titles, facet_kws, **kwargs)
3801 # so we need to define palette to get default behavior for the
3802 # categorical functions
-> 3803 p.establish_colors(color, palette, 1)
3804 if kind != "point" or hue is not None:
3805 palette = p.colors

~/deeptcr/lib/python3.7/site-packages/seaborn/categorical.py in establish_colors(self, color, palette, saturation)
317 # Determine the gray color to use for the lines framing the plot
318 light_vals = [colorsys.rgb_to_hls(*c)[1] for c in rgb_colors]
--> 319 lum = min(light_vals) * .6
320 gray = mpl.colors.rgb2hex((lum, lum, lum))
321

ValueError: min() arg is an empty sequence