TensorArray Not Used on line 865 of tokenization_utils.py

Hello!

I found an AI-Specific Code smell in your project.
The smell is called: TensorArray Not Used

You can find more information about it in this paper: https://dl.acm.org/doi/abs/10.1145/3522664.3528620.

According to the paper, the smell is described as follows:

Problem	If the developer initializes an array using tf.constant() and tries to assign a new value to it in the loop to keep it growing, the code will run into an error. The developer can fix this error by the low-level tf.while_loop() API. However, it is inefficient coding in this way. A lot of intermediate tensors are built in this process.
Solution	Using tf.TensorArray() for growing array in the loop is a better solution for this kind of problem in TensorFlow 2.
Impact	Efficiency, Error-proneness

Example:

### TensorFlow
import tensorflow as tf
@tf.function
def fibonacci(n):
   a = tf.constant(1)
   b = tf.constant(1)
-    c = tf.constant([1, 1])
+    c = tf.TensorArray(tf.int32, n)
+    c = c.write(0, a)
+    c = c.write(1, b)

   for i in range(2, n):
       a, b = b, a + b
-       c = tf.concat([c, [b]], 0)
+		c = c.write(i, b)

-    return c
+	 return c.stack()

You can find the code related to this smell in this link:

CLUE/baselines/models_pytorch/classifier_pytorch/transformers/tokenization_utils.py

Lines 855 to 875 in 2ea9046

    
           if add_special_tokens: 
        
               sequence = self.build_inputs_with_special_tokens(ids, pair_ids) 
        
               token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids) 
        
               encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids) 
        
           else: 
        
               sequence = ids + pair_ids if pair else ids 
        
               token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else []) 
        
           if return_tensors == 'tf' and is_tf_available(): 
        
               sequence = tf.constant([sequence]) 
        
               token_type_ids = tf.constant([token_type_ids]) 
        
           elif return_tensors == 'pt' and is_torch_available(): 
        
               sequence = torch.tensor([sequence]) 
        
               token_type_ids = torch.tensor([token_type_ids]) 
        
           elif return_tensors is not None: 
        
               logger.warning("Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.".format(return_tensors)) 
        
           encoded_inputs["input_ids"] = sequence 
        
           encoded_inputs["token_type_ids"] = token_type_ids 
        
           if max_length and len(encoded_inputs["input_ids"]) > max_length:

.

I also found instances of this smell in other files, such as:

File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/bert/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/bert_wwm_ext/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/ernie/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/roberta_wwm_ext/optimization_test.py#L26-L36 Line: 31
File: https://github.com/CLUEbenchmark/CLUE/blob/master/baselines/models/roberta_wwm_large_ext/optimization_test.py#L26-L36 Line: 31
.

I hope this information is helpful!

	if add_special_tokens:
	sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
	token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
	encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
	else:
	sequence = ids + pair_ids if pair else ids
	token_type_ids = [0] * len(ids) + ([1] * len(pair_ids) if pair else [])

	if return_tensors == 'tf' and is_tf_available():
	sequence = tf.constant([sequence])
	token_type_ids = tf.constant([token_type_ids])
	elif return_tensors == 'pt' and is_torch_available():
	sequence = torch.tensor([sequence])
	token_type_ids = torch.tensor([token_type_ids])
	elif return_tensors is not None:
	logger.warning("Unable to convert output to tensors format {}, PyTorch or TensorFlow is not available.".format(return_tensors))

	encoded_inputs["input_ids"] = sequence
	encoded_inputs["token_type_ids"] = token_type_ids

	if max_length and len(encoded_inputs["input_ids"]) > max_length: