HazyResearch/metal

Postpone averaging loss as long as possible

bhancock8 opened this issue · 3 comments

We should pass around the total loss (sum of losses for the batch) as long as possible and then divide by total number of examples only right before we report it. Otherwise we flip back and forth multiple times in the code, which is likely to introduce errors (more on the programmer side than the computational precision side).

I've had a change of heart. Let's say that we also assume we'll get back a batch-averaged loss, since I think that's more standard and will require fewer changes for most people. Then in our train loop we can keep track of total loss by multiplying by batch size if we need to. This will likely be touched by #129 as well.

Related to #77

Fixed in #134.