Seq-to-Seq Transformer Model from Scratch in PyTorch

This project implements a Transformer model from scratch using PyTorch. It includes a training script to train the model on a custom dataset and supports distributed training.

Features

Transformer model implementation from scratch
Customizable model parameters
Training script to train the model
Easy-to-follow code structure

Installation

To install the required dependencies, run:

pip install -r requirements.txt

Training

To train the Transformer model, run the training script with a configuration file. The configuration file allows you to specify all necessary parameters for training.

You can use the config.json file where you provide the following parameters:

train_data: Path to the training dataset. It can be on your local machine or on HuggingFace.
epochs: Number of training epochs.
batch_size: Batch size for training.
learning_rate: Learning rate for the optimizer.
valid_data: Path to the validation dataset.
model_dim: Dimensionality of the model.
hidden_dims: Dimensionality of the hidden layers.
num_heads: Number of attention heads.
num_layers: Number of encoder/decoder layers.
dropout: Dropout rate.
max_seq_length: Maximum sequence length.

The provided training script is set up for language translation using the wmt14 dataset and the tokenizer of the Helsinki-NLP/opus-mt-en-de model.

Distributed Training

The training script supports distributed training using Fully Sharded Data Parallel (FSDP) with Hugging Face's accelerate library. This allows you to efficiently train large models across multiple GPUs or nodes.

The settings for accelerate are found in accelerate_config.yaml. To start the training you can use the CLI command:

accelerate launch --config-file accelerate_config.yaml train.py

Model Architecture

The Transformer model consists of an encoder and a decoder.

Encoder: Processes the input sequence and produces a representation of it.
Decoder: Takes the encoder's representation and generates the output sequence.

Encoder

The Encoder in a Transformer model is responsible for processing the input sequence and generating a contextualized representation of it. The Encoder consists of several layers, each containing two main sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Each sub-layer is followed by layer normalization and a residual connection.

Components of the Encoder

Input Embedding:
- Converts token indices into dense vectors of fixed size (model dimension).
- Includes positional encoding to provide information about the position of each token in the sequence.
Multi-Head Self-Attention Mechanism:
- Self-Attention: Allows the model to focus on different parts of the input sequence, capturing dependencies between words regardless of their distance.
- Multi-Head: Multiple attention mechanisms run in parallel (heads), each with its own set of parameters, enabling the model to capture various relationships within the input.
Position-Wise Feed-Forward Network:
- A two-layer fully connected network applied independently to each position, with a non-linear activation function (GELU) in between.
- Applies a non-linear transformation to the attention outputs.
Layer Normalization and Residual Connections:
- Layer Normalization: Normalizes the output of each sub-layer to stabilize and accelerate training.
- Residual Connections: Adds the input of the sub-layer to its output, ensuring the retention of original information and aiding gradient flow during backpropagation.
Dropout:
- A regularization technique to prevent overfitting by randomly setting a fraction of input units to zero during training.

Decoder

The Decoder in a Transformer model generates the output sequence based on the encoded representation of the input sequence and the previously generated tokens. The Decoder consists of several layers, each containing three main sub-layers: a masked multi-head self-attention mechanism, a multi-head attention mechanism over the encoder's output, and a position-wise fully connected feed-forward network. Each sub-layer is followed by layer normalization and a residual connection.

Components of the Decoder

Input Embedding:
- Converts token indices into dense vectors of fixed size (model dimension).
- Includes positional encoding to provide information about the position of each token in the sequence.
Masked Multi-Head Self-Attention Mechanism:
- Self-Attention: Allows the model to focus on different parts of the decoder input sequence, capturing dependencies between words regardless of their distance.
- Masked: Ensures that the prediction for a particular position depends only on the known outputs at preceding positions, preventing information leakage.
Multi-Head Attention Mechanism over Encoder Output:
- Uses the encoder's output as keys and values and the decoder's output as queries to focus on relevant parts of the input sequence.
Position-Wise Feed-Forward Network:
- A two-layer fully connected network applied independently to each position, with a non-linear activation function (GELU) in between.
- Applies a non-linear transformation to the attention outputs.
Layer Normalization and Residual Connections:
- Layer Normalization: Normalizes the output of each sub-layer to stabilize and accelerate training.
- Residual Connections: Adds the input of the sub-layer to its output, ensuring the retention of original information and aiding gradient flow during backpropagation.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
model		model
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
accelerate_config.yaml		accelerate_config.yaml
config.json		config.json
pytest.ini		pytest.ini
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seq-to-Seq Transformer Model from Scratch in PyTorch

Table of Contents

Features

Installation

Training

Distributed Training

Model Architecture

Encoder

Components of the Encoder

Decoder

Components of the Decoder

About

Releases

Packages

Languages

BGabor01/Transformer

Folders and files

Latest commit

History

Repository files navigation

Seq-to-Seq Transformer Model from Scratch in PyTorch

Table of Contents

Features

Installation

Training

Distributed Training

Model Architecture

Encoder

Components of the Encoder

Decoder

Components of the Decoder

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages