Sliding Window Attention Training paper(SWAT)

aims to fix the quadratic computational complexity during seq. processing in llms
In SWAT, softmax replaced by sigmoid function for efficient information and compression and retention
then it use ALiBi(Attention with Linear Biases) and Rotatry position embedings to stabilize training
linear computational complexity is maintained using sliding window attention(SWA)

Problem with current methods:

sparse attention → reduce computation by selectively calculating the attention score, and sequence models which process seq. through recursive hidden states. these both either compromise model performance to achieve efficiency or propose new com- plex architectures that cannot fully exploit existing techniques for convenient implementation and deployment. SO, NOT EFFECTIVE
current research in SWA focus on attention sink problem: excessive attention to initial tokens cause uneven distribution in attention weights across seq. so a gap b/w inference and training.
tokens outside attention window coverage are ignored for prediction → info loss in long seq

SWAT framework:

To handle all these bottlenecks, SWAT is introduced.
softmax → sigmoid: prevent attention sink problem and maintain the dense attention weights for higher information capacity per token but lack sparsity
for handling lack of sparsity, balanced ALiBi is used which introduce position-dependent differentiation, prevent info overload in dense representation.
ALiBi(Read this: https://medium.com/@pajakamy/alibi-attention-with-linear-biases-942abe042e9f)
further RoPE is used to explicity encode position information in hidden states(training stability)