Add Event Stream Modeling
Opened this issue · 1 comments
Oufattole commented
We should add support for EventStream models. The tokenization is already supported in the pytorch dataset class, just set the collate type to event_stream in your hyda config. I think we just need to copy some code from ESGPT github to run this.
Oufattole commented
Getting Started
- Branch off the dev branch (most up-to-date)
- Event stream data support is implemented in collating (see test output batch examples)
Implementation Steps
-
Create
event_stream_input_encoder.py
:- Location: input_encoder folder
- Add corresponding default config in input_encoder config subdirectory
- Purpose: Convert padded raw event-stream (integers and numeric values) to token sequences for sequence models
-
Implement ESGPT custom hierarchical architecture:
- Location: model/backbone folder
- Add corresponding config in backbone config subdirectory
-
Supervised Model:
- Use existing supervised_model PyTorch Lightning class
- Override
model.input_encoder
andmodel.backbone
with new ESGPT components - For ESGPT pretraining: Create a new PyTorch Lightning class in the models folder
-
Add Integration Tests:
Lemme know what else I can clarify @mmcdermott