similar to the attention used in transformers.
Problems with LSTMs:
- Inability to revise storage decisions
- Limited storage capacities
- Lack of parallelizability
Problems with transformers:
- Quadratic Scaling Complexity
The xLSTM paper introduces the following innovations:
1. Exponential gating: Enables LSTMs to revise storage decisions more effectively. It involves using exponential activation functions for the input and forget gates, which are crucial components of LSTMs responsible for controlling the flow of information.
2. Memory mixing: Integrating information from the hidden state vector into memory cells through recurrent connections and gating mechanisms. Multiple heads allow for memory mixing within each head independently contributing to the model's and capacity for capturing complex dependencies.
3. Matrix Memory Cell: Instead of a scalar memory cell, a matrix memory cell of size d×d is introduced, allowing retrieval via matrix multiplication. This approach enables more efficient storage and retrieval of key-value pairs.
4. Covariance Update Rule: A covariance update rule is integrated such that the memory cell is updated based on the outer product of the value and key generated from the matrix above. This update mechanism optimizes the separability of retrieved binary vectors.
The way the matrix memory cell works is a lot like how attention works in transformers. But unlike attention, it stays scalable because it uses LSTM's compressive properties, avoiding the scaling problems of transformers. Memory mixing is also similar to multi-head attention. It's a clever blend of transformer ideas into the LSTM setup.
Read the paper here: https://lnkd.in/eanUp9sS
留言