Ensuring Numerical Stability In Online Attention Updates (2)
In my last post, I attempted to derive an online update formula for the attention mechanism. However, the resulting formula isn’t optimal for high-performance real-world implementations. There are two issues left undiscussed:
Deriving The Online Update Formula For Attention (1)
Recently, while handling some everyday tasks, I stumbled upon the derivation of the online update formula in Flash Attention[1]. Even though I’ve worked it out before, it’s easy to forget over time. So, I thought it would be a good idea to jot down some notes about it.