Supervised learning train error: need at least one array to concatenate

Question

Supervised learning train error: need at least one array to concatenate

hejing3283 opened this issue 5 years ago · 12 comments

I am running a testing using my own data
After loading the data successfully, I got an error when training:
#Load Data from directories
DTCR_WF.Get_Data(directory='data_test/',
Load_Prev_Data=False,
aggregate_by_aa=True,
aa_column_beta=1,v_beta_column=3,d_beta_column=4,j_beta_column=5,
count_column=6,n_jobs = 2, sep=",")
DTCR_WF.Get_Train_Valid_Test(test_size=0.2)
DTCR_WF.Train()
error msg start ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in
1 DTCR_WF.Get_Train_Valid_Test(test_size=0.2)
----> 2 DTCR_WF.Train()
3
4 # DTCR_WF.Monte_Carlo_CrossVal(folds=5,test_size=0.3,stop_criterion=0.25,epochs_min=100,
5 # suppress_output = False)

~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py in Train(self, batch_size, epochs_min, stop_criterion, stop_criterion_window, kernel, on_graph_clustering, num_clusters, weight_by_class, class_weights, trainable_embedding, accuracy_min, num_fc_layers, units_fc, drop_out_rate, suppress_output, use_only_seq, use_only_gene, use_only_hla, size_of_net, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla)
3148
3149 valid_loss, valid_accuracy, valid_predicted, valid_auc =
-> 3150 Run_Graph_WF(self.valid, sess, self, GO, batch_size, random=False, train=False)
3151
3152

~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/utils_s.py in Run_Graph_WF(set, sess, self, GO, batch_size, random, train, drop_out_rate)
390 loss = np.mean(loss)
391 accuracy = np.mean(accuracy)
--> 392 predicted_out = np.vstack(predicted_list)
393 try:
394 auc = roc_auc_score(set[-1], predicted_out)

~/anaconda3/envs/dl/lib/python3.7/site-packages/numpy/core/shape_base.py in vstack(tup)
281 """
282 _warn_for_nonsequence(tup)
--> 283 return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
284
285

ValueError: need at least one array to concatenate
End of error msg ------------------------------------
The directory structure is as following:

data_test/
├── A
│   ├── A_1.csv
│   └── A_2.csv
├── B
│   ├── B_1.csv
│   └── B_2.csv
├── C
│   ├── C_1.csv
│   └── C_2.csv
└── D
├── D_1.csv
└── D_2.csv

In each csv file, there is beta chain information
AAACCTGCAGGCTGAA-1,CASSIRDTETLYF,498,TRBV16,TRBD1,TRBJ2-3,1
AAACGGGAGGGTGTGT-1,CASGEGQTNSDYTF,568,TRBV13-2,TRBD1,TRBJ1-2,5
AAACGGGGTCTTCAAG-1,CASSGQNQDTQYF,503,TRBV15,TRBD1,TRBJ2-5,1
AAACGGGTCTAACTGG-1,CASSLGWHSYEQYF,572,TRBV16,None,TRBJ2-7,3
AAAGATGAGAATTGTG-1,CASGPGQSNTEVFF,527,TRBV13-2,TRBD1,TRBJ1-1,7
AAAGCAATCTGGCGAC-1,CASSDGLGGLEQYF,481,TRBV13-1,TRBD2,TRBJ2-7,7
AAATGCCCAATCCAAC-1,CAWVDWAQNTLYF,544,TRBV31,TRBD2,TRBJ2-4,3
AAATGCCTCGGCTTGG-1,CSAQGAHTEVFF,566,TRBV1,TRBD1,TRBJ1-1,18
AACACGTGTATAATGG-1,CASSSPLAGQDTQYF,519,TRBV3,None,TRBJ2-5,1

Number of records for each input file :
808 data_test/A/A_1.csv
1920 data_test/A/A_2.csv
2163 data_test/B/B_1.csv
1879 data_test/B/B_2.csv
836 data_test/C/C_1.csv
1182 data_test/C/C_2.csv
1705 data_test/D/D_1.csv
2091 data_test/D/D_2.csv

Answer 1 · 2019-04-23T14:26:15.000Z

I think because your data set is only 8 samples, the test size is too small. The test size fraction is how much is used for the valid and test sets. If you have it set to 0.2, that's 1.6 samples for 2 sets which would not work. I would recommend in this case training with the LOO = 1 where 1 sample gets used to validation and one gets used for the test set. One can set this parameter in either a monte-carlo simulation or k-fold cross val. Let me know if this is was the issue and I'll write something into the code to catch when this happens and alert the user.

Answer 2 · 2019-04-23T15:00:39.000Z

Thanks for the explanation. I realized it and tried with more data, each label has 8 samples, changed test for 0.5 which allows 2 samples for validation and test independently. Now I am getting a new error

err msg start-----------
Traceback (most recent call last):
File "run_deepTCR_1_main.py", line 84, in
DTCR_WF.Train()
File "/Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py", line 3164, in Train
TypeError: unsupported format string passed to list.format
err msg end ------------------------------

Answer 3 · 2019-04-23T15:07:06.000Z

It seems like in the output statistics, something is getting passed to the print statement that is not correct. You said each folder now has 8 csv files in each one?

Answer 4 · 2019-04-23T15:23:39.000Z

Yes. I added more samples. Now each folder has 8 .csv files. The same format as before

Answer 5 · 2019-04-23T15:27:23.000Z

if you send your data or a part of it to my email, i might be able to better assess the issue you are having. jsidhom1@jhmi.edu

Answer 6 · 2019-04-23T15:28:59.000Z

I would also recommend trying this and seeing if it works after you load the data.

DTCR_WF.Monte_Carlo_CrossVal(folds=5,LOO=1)

Answer 7 · 2019-04-23T15:31:54.000Z

Thanks for your help ahead! I am sending you 2 of the 4 directory.

…

On Tue, Apr 23, 2019 at 11:27 AM John-William Sidhom < ***@***.***> wrote: if you send your data or a part of it to my email, i might be able to better assess the issue you are having. ***@***.*** — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUAIIMMKPPK6BSVVKF5GQTPR4TFXANCNFSM4HHXTAGQ> .

-- Cheers! Jing -------------------------------------------------------- E-mail: jing.he@dbmi.columbia.edu violet.hj@gmail.com --------------------------------------------------------

Answer 8 · 2019-04-23T15:36:35.000Z

Just tried the MCCV, similar error

err msg start---------------------------------------------------------------
Traceback (most recent call last):
File "run_deepTCR_1_main.py", line 85, in
DTCR_WF.Monte_Carlo_CrossVal(folds=5,LOO=1)
File "/Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py", line 3373, in Monte_Carlo_CrossVal
File "/Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py", line 3164, in Train
TypeError: unsupported format string passed to list.format
err msg end---------------------------------

Also, I was getting waring msgs say some of the tensorflow functions are depreciated, not sure if this is related.

warning msg start------------------------------------------------------------
WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/Layers.py:98: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv2d instead.
WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/Layers.py:99: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.flatten instead.
WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/Layers.py:102: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py:3098: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /Users/jing.he1/anaconda3/envs/dl/lib/python3.7/site-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
2019-04-23 11:32:31.648901: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
warning msg end------------------------------------------------------------

Answer 9 · 2019-04-23T15:41:32.000Z

I just ran the following code and it worked fine..

The tensorflow deprecation warnings are normal. Will eventually need to update the code for tensorflow 2.0 but for now, it should work fine.

Answer 10 · 2019-04-23T16:12:29.000Z

The only difference I have is the Get_Data parameter positions. But I think it is not position sensitive.
I changed it, used the same script as you did, uninstall and install the package again, and it worked now!

Thanks so much!! Much appreciated!

Answer 11 · 2019-04-23T16:18:27.000Z

Awesome! I just made some final updates. I would re-install the latest version 1.2.17.

Thanks!

Answer 12 · 2019-04-23T16:49:08.000Z

Got you! 👍

ValueError: need at least one array to concatenate End of error msg ------------------------------------ The directory structure is as following:

ValueError: need at least one array to concatenate
End of error msg ------------------------------------
The directory structure is as following: