Ensuring Numerical Stability In Online Attention Updates (2)

In my last post, I attempted to derive an online update formula for the attention mechanism. However, the resulting formula isn’t optimal for high-performance real-world implementations. There are two issues left undiscussed:

Deriving The Online Update Formula For Attention (1)

Recently, while handling some everyday tasks, I stumbled upon the derivation of the online update formula in Flash Attention^[1]. Even though I’ve worked it out before, it’s easy to forget over time. So, I thought it would be a good idea to jot down some notes about it.