A lightweight modular framework for language modeling in PyTorch.
Mini is a lightweight modular framework for pre-training, fine-tuning, and evaluating transformer-based language models. It is designed to support sequence-to-sequence learning tasks such as next-word prediction, instruction tuning, and text generation.
- Pre-training on structured & unstructured text
- Fine-tuning for instruction-based tasks (coming soon)
- Evaluation with perplexity and future metric support (BLEU, ROUGE, etc.)
- Easy CLI interface for training, inference, and evaluation
- Support for model checkpointing & resuming
- Optimized training with RMSNorm, SiLU activation, and RoPE Attention
Mini follows both classical and state-of-the-art transformer-based architectures with a simplified and efficient design:
- PositionalEncoding - Adds positional information to input tokens.
- BertEncoding - Uses residual learning to improve positional encoding.
- LinearEncoding - Uses linear transformations for positional encoding.
- RotaryEncoding - Uses rotary positional encoding for efficient computation.
- PositionalEmbedding - Embeds input tokens with positional information.
- BertEmbedding - Uses residual learning to improve embedding.
- LinearEmbedding - Uses linear transformations for embedding.
- SelfAttention - Computes self-attention between tokens.
- RotaryAttention - Uses rotary positional encoding for efficient computation.
- LayerNorm - Normalizes across the last dimension.
- RMSNorm - Normalizes activations using the root mean square.
- PositionWiseFeedForward - Applies feed-forward transformations to each token.
- GatedFeedForward - Applies feed-forward transformations with gating.
- PositionWiseBlock - Combines self-attention and feed-forward layers.
- GatedBlock - Combines rotary self-attention and feed-forward layers with gating.
Current implementations focus on position-wise and gated architectures. The goal is to provide a flexible and efficient framework for building transformer-based models. Mini includes a variety of components and modules that allow for easy experimentation and customization.
git clone https://github.com/teleprint-me/mini.git
cd mini
python3.12 -m venv .venv
source .venv/bin/activate
- CPU
pip install torch --index-url https://download.pytorch.org/whl/cpu
- CUDA
pip install torch --index-url https://download.pytorch.org/whl/cu126
- ROCm
pip install torch --index-url https://download.pytorch.org/whl/rocm6.2.4
pip install -r requirements.txt
sha256sum -c sha256sum.txt
Expected output:
models/tokenizer.model: OK
data/mini-fairy.md: OK
data/mini-owl.md: OK
data/mini-owl-fairy.md: OK
Check the dataset character count.
wc -c data/mini-owl.md
Expected output:
1078 data/mini-owl.md
Train a model from scratch on a dataset:
python -m mini.cli.train \
--processor models/tokenizer.model \
--model models/misty-owl.pth \
--dataset data/mini-owl.md \
--architecture misty \
--num-epochs 10 \
--batch-size 2 \
--batch-stride 8 \
--lr 1e-4 \
--optimizer adamw \
--scheduler none \
--criterion cross_entropy \
--verbose
- Parameters:
--processor models/tokenizer.model
: Path to the tokenizer model.--model models/misty-owl.pth
: Path to the model state.--dataset data/mini-owl.md
: Path to the dataset.--architecture misty
: Model architecture.--num-epochs 10
: Number of training epochs.--batch-size 2
: Batch size.--batch-stride 8
: Sequence length per sub-batch.--lr 1e-4
: Learning rate.--optimizer adamw
: Optimizer algorithm.--scheduler none
: Learning rate scheduler.--criterion cross_entropy
: Loss function.--verbose
: Enable verbosity for debugging.
NOTE: Any plaintext file will work. mini-owl.md
is used for isolated and
controlled experimentation. See training.md for more
information.
Run inference on a trained model:
python -m mini.cli.infer \
--processor models/tokenizer.model \
--model models/misty-owl.pth \
--temperature 0.5 \
--prompt "The young bird listened"
- Parameters:
--processor models/tokenizer.model
: Path to the tokenizer model.--model models/misty-owl.pth
: Path to the model state.--temperature 0.5
: Temperature for sampling. Lower values make the output more deterministic, and higher values make it more diverse.--prompt "The young bird listened"
: Input sequence for inference.
Fine-tune on an instruction-based dataset.
# Placeholder for fine-tuning command
Evaluate model performance with perplexity, BLEU, and other metrics.
# Placeholder for evaluation script
- Pre-training on custom datasets
- Inference support for text generation
- Fine-tuning for instruction-based tasks (up next! 🚀)
- Evaluation with additional NLP metrics
- Distributed training & performance optimizations
This project is licensed under AGPL-3.0. See the LICENSE file for details.
Contributions are welcome! If you have ideas or improvements, feel free to open an issue or submit a pull request. Be sure to follow the Code of Conduct.
If you find this project useful and would like to support its continued development, consider making a donation. Your support is greatly appreciated!