IBM/TabFormer

How to reproduce the results on the paper

shaoyijia opened this issue · 8 comments

Hi,

I noticed that there is no code for downstream tasks in this repo and I'm wondering if you could upload it. I want to reproduce your results on the ICASSP paper (e.g. Table 1 on the paper). Based on your paper, you feed the extracted features into MLP / LSTM. However, the detailed config of MLP / LSTM is not clear and I'm not very sure about how you do upsampling for the Fraud Detection Task.

It'll be super helpful if you could provide the code for downstream tasks.

Thanks for your help!

Add more details for my puzzle in reproducing the results:

  1. How you construct the Fraud Detection Dataset?
    In your paper, "For fraud detection, we create samples by combining 10 contiguous rows (with a stride of 10) in a time-dependent manner for each user. In total, there are 2.4M samples with 29,342 labeled as fraudulent". I look at card_transaction.v1.csv, there are 24386900 rows, so creating samples by combining 10 contiguous rows can yield 2.4M samples. However, there are 29757 rows with 'Is Fraud?' == 'Yes'. This statics is in row level. I'm not sure how you correspond it to the samples (which combine 10 contiguous rows) and get 29342 fraudulent labels.

  2. How to get row embeddings?
    In your paper, 'we pool the encoder outputs at individual row level to create Ei (see Fig. 2) before doing classification'. I'm still unclear about how your get the sample features from TabFormer. There is no Ei in Fig.2, but according to your writing, you pool row embeddings to get the embedding of the sample. Then how can I get the row embeddings? (Issue #16 is related to this but still unsolved.)

  3. How do you actually do upsampling?

The questions above really trouble me when I'm trying to reproduce the results. Hope to receive your help and best wishes!

Hi @EchoShao8899

I noticed a similar problem to yours. In https://github.com/IBM/TabFormer/blob/main/dataset/card.py line 198, there are codes (shown below) to generate samples. However the number of samples labeled as fraudulent generated this way seems to be different from that mentioned in the paper. I guess the dataset provided in the repo is also different.

   def prepare_samples(self):
        log.info("preparing user level data...")
        trans_data, trans_labels, columns_names = self.user_level_data()

        log.info("creating transaction samples with vocab")
        for user_idx in tqdm.tqdm(range(len(trans_data))):
            user_row = trans_data[user_idx]
            user_row_ids = self.format_trans(user_row, columns_names)

            user_labels = trans_labels[user_idx]

            bos_token = self.vocab.get_id(self.vocab.bos_token, special_token=True)  # will be used for GPT2
            eos_token = self.vocab.get_id(self.vocab.eos_token, special_token=True)  # will be used for GPT2
            for jdx in range(0, len(user_row_ids) - self.seq_len + 1, self.trans_stride):
                ids = user_row_ids[jdx:(jdx + self.seq_len)]
                ids = [idx for ids_lst in ids for idx in ids_lst]  # flattening
                if not self.mlm and self.flatten:  # for GPT2, need to add [BOS] and [EOS] tokens
                    ids = [bos_token] + ids + [eos_token]
                self.data.append(ids)

            for jdx in range(0, len(user_labels) - self.seq_len + 1, self.trans_stride):
                ids = user_labels[jdx:(jdx + self.seq_len)]
                self.labels.append(ids)

                fraud = 0
                if len(np.nonzero(ids)[0]) > 0:
                    fraud = 1
                self.window_label.append(fraud)

        assert len(self.data) == len(self.labels)

        '''
            ncols = total fields - 1 (special tokens) - 1 (label)
            if bert:
                ncols += 1 (for sep)
        '''
        self.ncols = len(self.vocab.field_keys) - 2 + (1 if self.mlm else 0)
        log.info(f"ncols: {self.ncols}")
        log.info(f"no of samples {len(self.data)}")

Hi @EchoShao8899

I noticed a similar problem to yours. In https://github.com/IBM/TabFormer/blob/main/dataset/card.py line 198, there are codes (shown below) to generate samples. However the number of samples labeled as fraudulent generated this way seems to be different from that mentioned in the paper. I guess the dataset provided in the repo is also different.

   def prepare_samples(self):
        log.info("preparing user level data...")
        trans_data, trans_labels, columns_names = self.user_level_data()

        log.info("creating transaction samples with vocab")
        for user_idx in tqdm.tqdm(range(len(trans_data))):
            user_row = trans_data[user_idx]
            user_row_ids = self.format_trans(user_row, columns_names)

            user_labels = trans_labels[user_idx]

            bos_token = self.vocab.get_id(self.vocab.bos_token, special_token=True)  # will be used for GPT2
            eos_token = self.vocab.get_id(self.vocab.eos_token, special_token=True)  # will be used for GPT2
            for jdx in range(0, len(user_row_ids) - self.seq_len + 1, self.trans_stride):
                ids = user_row_ids[jdx:(jdx + self.seq_len)]
                ids = [idx for ids_lst in ids for idx in ids_lst]  # flattening
                if not self.mlm and self.flatten:  # for GPT2, need to add [BOS] and [EOS] tokens
                    ids = [bos_token] + ids + [eos_token]
                self.data.append(ids)

            for jdx in range(0, len(user_labels) - self.seq_len + 1, self.trans_stride):
                ids = user_labels[jdx:(jdx + self.seq_len)]
                self.labels.append(ids)

                fraud = 0
                if len(np.nonzero(ids)[0]) > 0:
                    fraud = 1
                self.window_label.append(fraud)

        assert len(self.data) == len(self.labels)

        '''
            ncols = total fields - 1 (special tokens) - 1 (label)
            if bert:
                ncols += 1 (for sep)
        '''
        self.ncols = len(self.vocab.field_keys) - 2 + (1 if self.mlm else 0)
        log.info(f"ncols: {self.ncols}")
        log.info(f"no of samples {len(self.data)}")

Thank you so much for pointing the code snippet!

Yes, I think there exists some inconsistency between the code and the paper. According to the paper, for transaction dataset, the window stride should be 10 instead of 5. If we set self.stride = 10, len(self.data) = 2437789 which is consistent with the paper. But there are only 9.492 fraudulent samples. Maybe the provided data is a little bit different.

如果stride 设成10的话话fraudulent samples似乎更少

E_i 是不是 Embedded row i

如果stride 设成10的话话fraudulent samples似乎更少

E_i 是不是 Embedded row i

在论文中stride设成5和window设成10

@pphhzz $E_i$ 在哪儿

Hi, any update on this yet?

Hello,

I think some hyperparameters and configs used in this paper are not so clear. When I'm reproducing the results in the paper, I use a LSTM head like this:

class LSTMPredictionHead(nn.Module):
    """LSTM prediction head for binary classification."""

    def __init__(self):
        super().__init__()
        self.lstm = nn.LSTM(input_size=768, hidden_size=768, dropout=0.1,
                            batch_first=True)
        self.linear = nn.Linear(768, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out, (h_n, h_c) = self.lstm(x)
        # Use the last output for prediction
        out = self.linear(out[:, -1, :])

        pred = self.sigmoid(out)

        return pred

As for the upsampling, I upsample the negative sames to the same amount with the positive samples.

class TransactionFeatureDataset(Dataset):
    """Transaction Feature Dataset for Fraud Detection task."""

    def __init__(self, data, label, with_upsample=False):
        """Args:
            - data: sample feature extracted from TabBERT.
            - label: label in sample (window) level.
            - with_upsample: if True, upsample fraudulent data to have the same amount with non-fraudulent data.
        """
        self.data = data
        self.label = label
        if with_upsample:
            self._upsample()

    def __getitem__(self, item):
        return self.data[item], self.label[item]

    def __len__(self):
        return len(self.data)

    def _upsample(self):
        logger.info('Upsample fraudulent samples.')
        non_fraud = self.data[self.label == 0]
        fraud = self.data[self.label == 1]
        fraud_upsample = resample(fraud, replace=True, n_samples=non_fraud.shape[0], random_state=2022)
        self.data = torch.cat((fraud_upsample, non_fraud))
        self.label = torch.cat((torch.ones(fraud_upsample.shape[0]), torch.zeros(non_fraud.shape[0])))

With this setting, I'm able to reproduce roughly the same results with the reported results on the paper.

Since this reproducing work is for a research project during my internship in a company, I cannot share my implementation of the downsteam task pipelines right now.

I'll close this issue. Hope the above information may help who meets the same problem and hope I may get the chance to share the downstream task pipelines.

@jinxmirror13

之前没看到,$E_i$是论文里第三页关于fraud detection的 “As a baseline, we use a multi-layer perceptron (MLP) trained directly on the embeddings of the raw features. In order to model temporal dependencies, we also use an LSTM network baseline on the raw embedded features. In both cases, we pool the encoder outputs at individual row level to create Ei (see Fig. 2) before doing classification.”