We argue that the need to represent and propagate lexical features in each layer limits the transformer’s capacity for learning and representing contextual information. To alleviate this bottleneck, we introduce gated shortcut connections between the embedding layer and each subsequent layer within the encoder and decoder, which enables the model to access relevant lexical content dynamically, without expending limited resources on storing it within intermediate states.