Determine cause of MGH workstation crashes
erikr opened this issue · 2 comments
erikr commented
What
olympus
and cybertronpc
keep crashing. Is this due to RAM shortage, or PSU limitation?
Why
Crashes interrupt training and require manual reboot.
How
Determine how much RAM is used by train
mode on ECGs.
Determine power draw during train
and see if exceeds supply (we only have 450W PSUs and may need 650W+).
Acceptance Criteria
- decide if we need to revamp code to be more RAM-efficient
and/or - buy larger PSUs
StevenSong commented
Since running train
on olympus
results in it crashing, I don't think it'd be good to determine RAM/power usage there. Instead, I'll try to log info on mithril
StevenSong commented
@ndiamant had montserrat
ever crashed if you were training models on it?