songlab-cal/gpn

Question about Model

gdolsten opened this issue · 11 comments

Hi, I am reading through the code and trying to understand how the model works. Can you clarify to me what is the meaning of batch.pop("Y") ? What does batch.pop do? What is the data type of batch?

class Module(pl.LightningModule):
    def training_step(self, batch, batch_idx):
        Y = batch.pop("Y")
        logits = self(**batch)
        loss = self.loss(logits, Y)
        self.log("train/loss", loss)
        return loss


Nevermind, I see now:

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        x = row.seq
        # x = x[400:600]
        # x = x[300:700]
        x = self.tokenizer(
            x,
            padding="max_length",
            max_length=self.max_length,
            return_token_type_ids=False,
            return_tensors="pt",
            truncation=True,
        )
        d = dict(
            input_ids=x["input_ids"].flatten(),
            attention_mask=x["attention_mask"].flatten(),
            Y=torch.tensor(row[self.features].values.astype(np.uint8)),
        )

Alright, will close for now. Let me know if you have any other questions!

Thanks so much!

I was just wondering if you could explain the reason why GPNDataModule is used in /chromatin/trainer.py but Trainer is used with DataCollatorForLanguageModelingSpan in /mlm/run_mlm_custom.py? What is the difference between these two training scripts? Is the /chromatin/ for fine tuning and the /mlm/ for the language model task?

Another question – in run_mlm_custom you have:
config = CONFIG_MAPPING[model_args.model_type]()
And when running the model you have: --model_type ConvNet \
But I get the following error:

config = CONFIG_MAPPING['ConvNet']()
KeyError: 'ConvNet'

One more question:

        self.conv = nn.Sequential(
            TransposeLayer(),
            nn.Conv1d(
                in_channels=hidden_size,
                out_channels=hidden_size,
                padding="same",
                **kwargs,
            ),
            TransposeLayer(),
            nn.GELU(),
            nn.LayerNorm(hidden_size),
        )

Why do you have a TransposeLayer() here? Won't the convolution output be the same with our without a transpose? (since the convolution itself would just be transposed, no?)

Secondly, you have:

class OneHotEmbedding(nn.Module):
    def __init__(self, hidden_size=None):
        super().__init__()
        self.hidden_size = hidden_size

    def forward(self, x):
        return F.one_hot(x, num_classes=self.hidden_size).float()

with

class ConvNetConfig(PretrainedConfig):
    def __init__(
        self,
        vocab_size=7,
        hidden_size=512,
        n_layers=30,
        kernel_size=9,
        dilation_double_every=1,
        dilation_max=32,
        dilation_cycle=6,
        initializer_range=0.02,
        **kwargs
    ):

Is there a reason you are initializing the OneHotEmbedding in 512 dimensions?

Thanks so much!

I was just wondering if you could explain the reason why GPNDataModule is used in /chromatin/trainer.py but Trainer is used with DataCollatorForLanguageModelingSpan in /mlm/run_mlm_custom.py? What is the difference between these two training scripts? Is the /chromatin/ for fine tuning and the /mlm/ for the language model task?

That's right, everything mlm/ is pre-training and chromatin/ is fine-tuning. Unfortunately, we used a different framework for each, Huggingface for pre-training and Pytorch Lightning for fine-tuning. I'm hoping to re-organize the code in the next 2 months, using just Huggingface.

Another question – in run_mlm_custom you have: config = CONFIG_MAPPING[model_args.model_type]() And when running the model you have: --model_type ConvNet \ But I get the following error:

config = CONFIG_MAPPING['ConvNet']()
KeyError: 'ConvNet'

Usually import gpn.mlm makes sure that 'ConvNet' is registered. Could you provide some more details on how you run the script?

One more question:

        self.conv = nn.Sequential(
            TransposeLayer(),
            nn.Conv1d(
                in_channels=hidden_size,
                out_channels=hidden_size,
                padding="same",
                **kwargs,
            ),
            TransposeLayer(),
            nn.GELU(),
            nn.LayerNorm(hidden_size),
        )

Why do you have a TransposeLayer() here? Won't the convolution output be the same with our without a transpose? (since the convolution itself would just be transposed, no?)

Secondly, you have:

class OneHotEmbedding(nn.Module):
    def __init__(self, hidden_size=None):
        super().__init__()
        self.hidden_size = hidden_size

    def forward(self, x):
        return F.one_hot(x, num_classes=self.hidden_size).float()

with

class ConvNetConfig(PretrainedConfig):
    def __init__(
        self,
        vocab_size=7,
        hidden_size=512,
        n_layers=30,
        kernel_size=9,
        dilation_double_every=1,
        dilation_max=32,
        dilation_cycle=6,
        initializer_range=0.02,
        **kwargs
    ):

Is there a reason you are initializing the OneHotEmbedding in 512 dimensions?

The transpose layer is to make sure the tensor dimensions are compatible with the different operations. For example, conv1d may expect batch,channels,position while layernorm may expect batch,position,channels.

The OneHotEmbedding into 512 dimensions is just to simplify all the convolutional layers to have the same dimensions... It's certainly a waste of parameters and compute.

Thanks, with your help I understand all of these! I now have the model running, but I just wanted to check how long does one batch of 128 take (for the MLM task) on a single GPU for you?

Hey, would love to know this^ for benchmarking purposes

Hey! Sorry I don't have everything set up to easily check this scenario right now.