top of page
Writer's pictureReza Hagel

New xLSTM paper introduces the following innovations

similar to the attention used in transformers.



Problems with LSTMs:

- Inability to revise storage decisions

- Limited storage capacities

- Lack of parallelizability



Problems with transformers:

- Quadratic Scaling Complexity





The xLSTM paper introduces the following innovations:



1. Exponential gating: Enables LSTMs to revise storage decisions more effectively. It involves using exponential activation functions for the input and forget gates, which are crucial components of LSTMs responsible for controlling the flow of information.



2. Memory mixing: Integrating information from the hidden state vector into memory cells through recurrent connections and gating mechanisms. Multiple heads allow for memory mixing within each head independently contributing to the model's and capacity for capturing complex dependencies.



3. Matrix Memory Cell: Instead of a scalar memory cell, a matrix memory cell of size d×d is introduced, allowing retrieval via matrix multiplication. This approach enables more efficient storage and retrieval of key-value pairs.



4. Covariance Update Rule: A covariance update rule is integrated such that the memory cell is updated based on the outer product of the value and key generated from the matrix above. This update mechanism optimizes the separability of retrieved binary vectors.



The way the matrix memory cell works is a lot like how attention works in transformers. But unlike attention, it stays scalable because it uses LSTM's compressive properties, avoiding the scaling problems of transformers. Memory mixing is also similar to multi-head attention. It's a clever blend of transformer ideas into the LSTM setup.



Read the paper here: https://lnkd.in/eanUp9sS

1 view0 comments

Recent Posts

See All

留言


bottom of page