linwhitehat/ET-BERT

Is here should be 'append' instead of '='?

Closed this issue · 1 comments

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [3], in <cell line: 6>()
      1 main.samples = main.count_label_number(main.samples)
      3 train_model = ["pre-train"]
----> 6 main.dataset_extract(train_model)

File ~/projects/ET-BERT-main/data_process/main.py:111, in dataset_extract(model)
    109 pprint(y_test)
    110 pprint(x_payload_test)
--> 111 for test_index, valid_index in split_2.split(x_payload_test, y_test):
    112     x_payload_valid, y_valid = \
    113         x_payload_test[valid_index], y_test[valid_index]
    114     x_payload_test, y_test = \
    115         x_payload_test[test_index], y_test[test_index]

File ~/anaconda3/envs/etbert/lib/python3.10/site-packages/sklearn/model_selection/_split.py:1600, in BaseShuffleSplit.split(self, X, y, groups)
   1570 """Generate indices to split data into training and test set.
   1571 
   1572 Parameters
   (...)
   1597 to an integer.
   1598 """
   1599 X, y, groups = indexable(X, y, groups)
-> 1600 for train, test in self._iter_indices(X, y, groups):
   1601     yield train, test

File ~/anaconda3/envs/etbert/lib/python3.10/site-packages/sklearn/model_selection/_split.py:1923, in StratifiedShuffleSplit._iter_indices(self, X, y, groups)
   1921 n_samples = _num_samples(X)
   1922 y = check_array(y, ensure_2d=False, dtype=None)
-> 1923 n_train, n_test = _validate_shuffle_split(
   1924     n_samples,
   1925     self.test_size,
   1926     self.train_size,
   1927     default_test_size=self._default_test_size,
   1928 )
   1930 if y.ndim == 2:
   1931     # for multi-label y, map each distinct row to a string repr
   1932     # using join because str(row) uses an ellipsis if len(row) > 1000
   1933     y = np.array([" ".join(row.astype("str")) for row in y])

File ~/anaconda3/envs/etbert/lib/python3.10/site-packages/sklearn/model_selection/_split.py:2098, in _validate_shuffle_split(n_samples, test_size, train_size, default_test_size)
   2095 n_train, n_test = int(n_train), int(n_test)
   2097 if n_train == 0:
-> 2098     raise ValueError(
   2099         "With n_samples={}, test_size={} and train_size={}, the "
   2100         "resulting train set will be empty. Adjust any of the "
   2101         "aforementioned parameters.".format(n_samples, test_size, train_size)
   2102     )
   2104 return n_train, n_test

ValueError: With n_samples=1, test_size=0.5 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

你好,我运行finetuning数据生成的最后一步的时候产生了这个报错,我输出了这里传入的参数x_payload_test和y_test:
Hello,when I ran the data_process/main.py as the last step of generating finetuning data,I got this error.Then I pprint x_payload_test and y_test

array(['fd82 8296 9602 0200 0000 0000 0000 0080 8002 0220 2000 00c0 c008 0800 0000 0002 0204 0405 05b4 b401 0103 0303 0308 0801 0101 0104 040201f9 f9df df95 95fd fd82 8296 9603 0380 8012 1272 7210 10df df6f 6f00 0000 0002 0204 0405 05b4 b401 0101 0104 0402 0201 0103 0303 0307fd82 8296 9603 0301 01f9 f9df df96 9650 5010 1001 0100 00bf bffc fc00 0000fd82 8296 9603 0301 01f9 f9df df96 9650 5018 1801 0100 00c2 c202 0200 0000 0050 504f 4f53 5354 5420 202f 2f66 6669 6973 7368 682f 2f67 6773 736c 6c2e 2e6a 6a73 7370 7020 2048 4854 5454 5450 502f 2f31 312e 2e31 310d 0d0a 0a55 5573 7365 6572 722d 2d41 4167 6765 656e 6e74 743a 3a20 204d 4d6f 6f7a 7a69 696c 6c6c 6c61 612f 2f35 352e 2e30 3020 2028 2857 5769 696e 6e64 646f 6f77 7773 7320 204e 4e54 5420 2031 3130 302e 2e30 303b 3b20 2057 5769 696e 6e36 3634 343b 3b20 2078 7836 3634 343b 3b20 2072 7276 763a 3a38 3834 342e 2e30 3029 2920 2047 4765 6563 636b 6b6f 6f2f 2f32 3230 3031 3130 3030 3031 3130 3031 3120 2046 4669 6972 7265 6566 666f01f9 f9df df96 96fd fd82 8298 9809 0950 5010 1000 00ed ed8f 8f5f 5f00 0000 0000 0000 0000 0000 0000 0000'],
      dtype='<U1085')

array([0])

二者中都只有一个元素
There is only one element in both

我解决了,是我的疏忽,反复复制生成的试运行数据被去重变得太少了,抱歉打扰。
I solved the problem. It was my negligence. The trial run data generated by repeated replication became too small after repeated repetition.