Mini is a language modeling framework built with PyTorch, designed to support modern transformer architectures. This guide walks through the process of training a model from scratch using a simple plaintext dataset.
Before training a model, ensure you have the required dependencies installed:
pip install -r requirements.txt
Mini provides a command-line tool (train.py
) to manage training runs. The tool
is highly configurable, supporting various optimizers, loss functions, and
scheduling options.
For this tutorial, we use the data/mini-owl.md
dataset—a tiny, synthetic
short story distilled from a larger parent model.
- Size: ~1078 characters
- Purpose: Demonstrates model convergence on small datasets
- Impact: Model parameters must align with dataset size to avoid underfitting or overfitting.
A small dataset like this limits generalization but allows quick iterations to observe learning patterns.
Mini uses SentencePiece tokenization. The framework includes a pre-trained
tokenizer.model
from Mistral, allowing immediate use without training a
custom tokenizer.
- Why SentencePiece?
- Subword-based for flexibility
- No need for custom tokenization pretraining
- Works efficiently with small and large datasets
# Tokenizing a sample input
python -m mini.cli.tokenizer --model models/tokenizer.model --input data/mini-owl.md
This command will generate a tokenized version of the dataset, which is required for training.
The tokenizer determines vocabulary size, token distribution, and
sequence length, all of which impact model efficiency and learning
stability. The max sequence length (--max-seq-len
) should be set
appropriately for the dataset to ensure that tokenized sequences fit within the
model's attention window.
Mini's training pipeline processes data in batched input-target pairs, structured as follows:
--max-seq-len
: Defines the length of each tokenized sequence. Shorter sequences are padded.--batch-size
: The number of sequences included in a training batch.--batch-stride
: The step size for shifting sequences in the dataset to form overlapping training examples.
The batch stride (--batch-stride
) determines how much overlap exists
between training examples by shifting the input sequence by a defined number of
tokens. This is similar to a sliding window, where overlapping sequences
allow for better gradient updates while training.
Example Behavior: Given a tokenized dataset with 807 tokens,
--max-seq-len 128
, and --batch-stride 8
:
- The first batch will take tokens 1 → 128.
- The second batch will take tokens 9 → 136.
- The third batch will take tokens 17 → 144.
- This continues until all tokens are processed.
This setup allows the model to capture more contextual dependencies instead of treating batches as fully independent sequences.
The training batches are stored as PyTorch tensors in the shape [B, T]
:
B
= batch size (number of sequences per batch)T
= sequence length (defined by--max-seq-len
)
Example: With --batch-size 2
, --batch-stride 8
, and --max-seq-len 128
,
the dataset will generate multiple overlapping sequences for better learning.
Since batch settings depend on dataset size, choosing a stride that is too
large may lead to non-overlapping sequences, limiting training efficiency.
A smaller stride (e.g., max-seq-len / 2
) is often a good default.
Model hyperparameters directly affect how well the model converges. The dataset size, tokenizer, and model architecture all influence:
- Learning Rate (
--lr
): Too high → exploding loss, too low → slow convergence. - Batch Size (
--batch-size
): Limited by memory constraints. - Optimizer (
--optimizer
): Supports Adam, AdamW, SGD. - Loss Function (
--criterion
): Choose between MSE, MAE, CE.
Example configuration:
python -m mini.cli.train \
--dataset data/mini-owl.md \
--processor models/tokenizer.model \
--model models/mini.pt \
--optimizer 'adamw' \
--scheduler 'none' \
--criterion 'cross_entropy' \
--batch-size 2 --batch-stride 8 \
--lr 1e-4 --epochs 100
During training, monitor logs for:
✅ Steady loss decrease → Healthy training
Run the CLI tool to start training:
python -m mini.cli.train \
--dataset data/mini-owl.md \
--processor models/tokenizer.model \
--model models/mini.pt \
--batch-size 2 \
--batch-stride 8 \
--num-epochs 100 \
--save-every 10 \
--num-layers 4 \
--num-heads 8 \
--head-dim 16 \
--embed-dim 256 \
--lr 1e-4 \
--scheduler none \
--bias \
--verbose
Here, we set a small number of layers and a decent number of heads to allow the model to generalize appropriately to the Mini Owl dataset. Head and embedding dimensions are conservative and set according to the expected model fit.
Example Output:
[Epoch: 100/100] [Avg Loss: 1.8174] [Perplexity: 6.155758]
Training complete! Checkpoint saved: models/mini.pt
Training speed depends on GPU vs CPU. Expect significantly longer runtimes on CPU. The device is automatically determined at runtime.
- Vanishing gradients? Try increasing the learning rate.
- Exploding loss? Reduce learning rate or check batch size.
- Loss plateaus? Consider adjusting optimizer or dataset size.
- Poor convergence? Ensure dataset and model size are balanced.
After training, the next step is evaluation and inference, covered in separate guides. This tutorial focuses on training fundamentals to get a working model.
This guide provides a high-level overview of training with Mini. The CLI tool simplifies pretraining, and the choice of tokenizer, dataset, and hyperparameters all impact model convergence.
For more advanced topics like evaluation, fine-tuning, and inference, see the dedicated documentation.