The document introduces the Transformer, a novel neural network architecture relying solely on attention mechanisms to handle sequence transduction tasks. Traditionally, recurrent or convolutional networks with encoder-decoders, enhanced by attention, were dominant. The Transformer abandons recurrence and convolutions, offering superior quality, increased parallelization, and reduced training time.
Key Innovations:
- Attention-Based Architecture: The Transformer uses only attention mechanisms to draw global dependencies between input and output.
- Parallelization: It allows for significantly more parallelization during training.
- Multi-Head Attention: Employs multiple attention heads to capture information from different representation subspaces.
- Scaled Dot-Product Attention: A specific attention mechanism that scales dot products to prevent softmax function gradients from becoming too small.
- Positional Encoding: Introduces positional encodings to incorporate information about the order of tokens in a sequence.
Performance Highlights:
-Achieved 28.4 BLEU on the WMT 2014 English-to-German translation task, surpassing existing results by over 2 BLEU.
-Attained a new single-model state-of-the-art BLEU score of 41.8 on the WMT 2014 English-to-French translation task after training for 3.5 days on eight GPUs.
-Generalizes well to other tasks, demonstrated by its successful application to English constituency parsing.
Model Architecture Details:
-The Transformer utilizes stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.
-The encoder and decoder each consist of N = 6 identical layers.
-Residual connections and layer normalization are employed around each sub-layer.
-The model uses learned embeddings to convert input and output tokens to vectors and shares the same weight matrix between embedding layers and the pre-softmax linear transformation.
Overall, the Transformer architecture presents a significant advancement in sequence transduction, offering improvements in quality, parallelization, and training efficiency compared to traditional recurrent and convolutional models.