linwhitehat/ET-BERT

data preprocessing

Closed this issue · 3 comments

Hello author, I’m sorry to bother you again. In the paper, I didn’t see more detailed data preprocessing information. In the process of data preprocessing, is the slice data 256 bytes, 784 bytes, or 900 words? festival. Looking forward to your reply

Hello author, I’m sorry to bother you again. In the paper, I didn’t see more detailed data preprocessing information. In the process of data preprocessing, is the slice data 256 bytes, 784 bytes, or 900 words? festival. Looking forward to your reply

Hello, the datagram is sliced into lengths in this article and is controlled within 512, here is not referring to bytes, but the size of the slicing sequence.

Hello author, I’m sorry to bother you again. In the paper, I didn’t see more detailed data preprocessing information. In the process of data preprocessing, is the slice data 256 bytes, 784 bytes, or 900 words? festival. Looking forward to your reply

Hello, the datagram is sliced into lengths in this article and is controlled within 512, here is not referring to bytes, but the size of the slicing sequence.

image
According to the data set (packet) you gave, I found that there are 64 tokens in each category. How is this related to the 512 tokens you mentioned? Looking forward to your reply, sorry to bother you again

Hello author, I’m sorry to bother you again. In the paper, I didn’t see more detailed data preprocessing information. In the process of data preprocessing, is the slice data 256 bytes, 784 bytes, or 900 words? festival. Looking forward to your reply

Hello, the datagram is sliced into lengths in this article and is controlled within 512, here is not referring to bytes, but the size of the slicing sequence.

image According to the data set (packet) you gave, I found that there are 64 tokens in each category. How is this related to the 512 tokens you mentioned? Looking forward to your reply, sorry to bother you again

The 512 mentioned in the paper refers to the length of the final embedded representation used for pre-training, while in the pre-processing, the packet level and flow level are sufficient by controlling the actual length plus the special token to not exceed 512.