- aims to fix the quadratic computational complexity during seq. processing in llms
- In SWAT, softmax replaced by sigmoid function for efficient information and compression and retention
- then it use ALiBi(Attention with Linear Biases) and Rotatry position embedings to stabilize training
- linear computational complexity is maintained using sliding window attention(SWA)
Introduction
Problem with current methods:
- sparse attention → reduce computation by selectively calculating the
attention score, and sequence models which process seq. through recursive hidden states. these both either compromise model performance to achieve efficiency or propose new com- plex architectures that cannot fully exploit existing techniques for convenient implementation and deployment. SO, NOT EFFECTIVE
- current research in SWA focus on attention sink problem: excessive attention to initial tokens cause uneven distribution in attention weights across seq. so a gap b/w inference and training.
- tokens outside attention window coverage are ignored for prediction → info loss in long seq
SWAT framework:
- To handle all these bottlenecks, SWAT is introduced.
- softmax → sigmoid: prevent attention sink problem and maintain the dense attention weights for higher information capacity per token but lack sparsity
- for handling lack of sparsity, balanced ALiBi is used which introduce position-dependent differentiation, prevent info overload in dense representation.
- ALiBi(Read this: https://medium.com/@pajakamy/alibi-attention-with-linear-biases-942abe042e9f)
- further RoPE is used to explicity encode position information in hidden states(training stability)
Understanding Transformer’s Attention