Scaling neural networks has been popular in recent years. Several potent deep networks are produced with the depth largely increased for exponential expressivity. Then, the hidden dimension is effectively expanded using sparse MoE models and model parallelism techniques. As the last atomic dimension of the neural network, the sequence length should be as long as possible. There are several benefits when the sequence length restriction is removed. First, it gives models a sizable memory and receptive field, enabling them to interact with people and the outside environment. Second, lengthier contexts include more intricate causal chains and thought processes, which models