Skip to content

An experimental GPT-2 implementation exploring scalable softmax, positional encoding ablation, and super-weights for efficient transformer architectures.

License

Notifications You must be signed in to change notification settings

dame-cell/Super-transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Super-transformers

Important
This is not a formal implementation of the referenced papers. Instead, it's a simple experiment to explore these new ideas and test them out for fun.

Overview

Super-transformers is an experimental implementation that explores modifications to the standard GPT-2 transformer architecture. This project combines three key architectural innovations:

  • Scalable Softmax: Implementation of a modified attention mechanism using scalable softmax to improve computational efficiency and potentially enhance model performance on longer sequences.

  • Positional Encoding Ablation: Exploring the model's capability to learn without explicit positional encodings, investigating whether contextual information alone is sufficient for sequence modeling.

  • Super Weights for Pruning: Implementing dynamic weight pruning mechanisms that can be applied during inference time to reduce computational overhead without significant performance degradation.

Experimental Results

Our preliminary experiments compare training and validation loss between:

  • A model using scalable softmax without positional encodings
  • A model using standard softmax with positional encodings

Model Comparison

Figure 1: Comparison of training and validation losses between different architectural configurations.

Getting Started

git clone  https://github.com/dame-cell/Super-transformers.git
cd Super-transformers
pip install -r requirements.txt 
cd src

Dataset

This project uses the fineweb-small dataset from Hugging Face. The preprocessing script handles downloading and preparing the data.

python3 data.py \
    --max_length 1024 \
    --sample_size 100000 \
    --data_name eliplutchok/fineweb-small-sample \
    --stride 256 \
    --split_ratio 0.9 \
    --batch_size 1000 \
    --output_dir .

Training

python3 train.py \
    --train_data path_to_train_data \
    --test_data path_to_test_data \
    --size small \
    --ssmax True \
    --use_pos_enc False \
    --wandb False \
    --batch_size 2 \
    --generating_step 2000 \
    --validation_step 1000 \
    --save_model 1000 \
    --epoch 1 \

Inference

Citations

@inproceedings{Nakanishi2025ScalableSoftmax,
    title   = {Scalable-Softmax Is Superior for Attention},
    author  = {Ken M. Nakanishi},
    year    = {2025},
    url     = {https://arxiv.org/pdf/2501.19399}
}
@article{Irie2024WhyAP,
    title   = {Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph},
    author  = {Kazuki Irie},
    year    = {2024},
    url     = {https://arxiv.org/pdf/2501.00659}
}
@article{Yu2024TheSW,
    title   = {The Super Weight in Large Language Models},
    author  = {Mengxia Yu and De Wang and Qi Shan and Colorado Reed and Alvin Wan},
    year    = {2024},
    url     = {https://arxiv.org/pdf/2411.07191v1}
}

About

An experimental GPT-2 implementation exploring scalable softmax, positional encoding ablation, and super-weights for efficient transformer architectures.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages