Generative Modeling with Sparse Transformers
We’ve developed the Sparse Transformer, a deep neural network which sets new records at predicting what comes next in a sequence—whether text, images, or sound. It uses an algorithmic improvement of the attention mechanism to extract patterns from sequences 30x longer than possible previously.
One existing challenge in AI research is modeling long-range, subtle interdependencies in complex data like images, videos, or sounds. The Sparse Transformer incorporates an O(N \sqrt{N})O(NN) reformulation of the O(N^2)O(N2) Transformer self-attention mechanism, along with several other improvements, to apply it directly to these rich data types. Previously, models used on these data were specifically crafted for one domain or difficult to scale to sequences more than a few thousand elements long. In contrast, our model can model sequences with tens of thousands of elements using hundreds of layers, achieving state-of-the-art performance across multiple domains. At OpenAI, we’re using it to help us build AI systems that possess a greater ability to understand the world.