The open-source materials for paper Sparsing Law: Towards Large Language Models with Greater Activation Sparsity (paper link).
Please kindly cite using the following BibTeX:
@article{luo2024sparsinglaw,
title={{Sparsing Law}: Towards Large Language Models with Greater Activation Sparsity},
author={Yuqi Luo and Chenyang Song and Xu Han and Yingfa Chen and Chaojun Xiao and Zhiyuan Liu and Maosong Sun},
year={2024},
journal={arXiv preprint arXiv:2411.02335},
url={https://arxiv.org/pdf/2411.02335.pdf}
}
Paper Abstract
Repository Overview
Requirements
Sparsity Metrics
Evaluation
Model Links
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs), such as computation acceleration and model interpretability.
Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors.
In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs.
Specifically, we propose PPL-
To enhance the reproducibility of our work, in this Github repository, we open-source the following materials:
-
The pre-trained checkpoints (before the decay stage) mentioned in our paper, with five different scales (i.e., 0.1B, 0.2B, 0.4B, 0.8B, and 1.2B) as well as 2 distinct activation functions (i.e., ReLU and SiLU). These checkpoints are most frequently used in our experimental analyses. You can download them on Huggingface.
-
The codes for applying different sparsity metrics, including Top-
$k$ , FAT-$\epsilon$ , and PPL-$p\%$ to pre-trained checkpoints. As it is not convenient for us to open-source the datasets, we provide corresponding demos to evaluate the sparsity level on an input toy dataset. -
Steps to evaluate the sparsely-activated checkpoints on task-specific benchmarks.
python==3.10.8
Run pip install -r requirements.txt
to install the required Python packages.
To run scripts, first copy the corresponding *.sh
files into the src
directory, and then appropriately set the environment variables (e.g., model_path
, input_file
, etc.).
In this section, we mainly introduce how to measure the activation sparsity of a single checkpoint using the two baseline sparsity metrics mentioned in Section 4.1 of our paper: Top-
First, set the following variables in calc_activation.sh
:
-
model_path
: The path to the model checkpoint folder, which can be loaded by Huggingface. -
input_file
: The path to the text file as the input toy dataset of model. -
prune_strategy
: Must be one of the following:- "fat": Corresponding to FAT-
$\epsilon$ sparsity. - "topk": Corresponding to Top-
$k$ sparsity.
- "fat": Corresponding to FAT-
-
prune_arg
: A float value representing:- The threshold
$\epsilon$ for "fat". - The fixed ratio of activated neurons in each layer for "topk", which means that the value of k in Top-k equals
prune_arg * ffn_intermediate_dimension
.
- The threshold
Finally, run the following command:
bash calc_activation.sh
The average activation and the PPL ratio (increase in perplexity) will be printed to screen, and an image for visualized results will be saved to outputs/
.
In this section, we introduce how to apply the PPL-
First, set the following variables in calc_activation.sh
:
-
model_path
: The path to the model checkpoint folder, which can be loaded by Huggingface. -
input_file
: The path to the text file as the input toy dataset of model. -
prune_strategy
: Must be "pplp" -
prune_arg
:$1+p\%$ for PPL-$p\%$ sparsity (e.g. prune_arg=1.01 for PPL-$1\%$ sparsity).
Finally, run the following command:
bash calc_activation.sh
In addition to the information mentioned in the previous subsection, the thresholds for each layer will also be printed, which can be employed in evaluation.
The resulting activation ratios within our 0.8B ReLU-activated model, using three different sparsity metrics on the sample text, are shown in the figure below.
In this section, we discuss how to evaluate a specific checkpoint (generally after the decay stage) on downstream benchmarks. This is intended to produce the evaluation scores mentioned in our paper, including Table 1 and Table 4-6.
Notably, our evaluation is conducted under a specific PPL-
We use the sparsing_law
branch of UltraEval to evaluate our models. Remember to checkout to this branch rather than main
.
To support inference under PPL-FFNBlock
in the Huggingface codes of the model architecture should be modified as shown below. Note that the environment variable thresholds
is obtained in the previous step, and mode
is set to sparse
for all PPL-
def forward(self, x: torch.Tensor):
if os.environ.get('mode') == 'sparse' and self.threshold == None:
with torch.no_grad():
t = float(os.environ.get('thresholds').split(',')[self.layer_idx])
out_norm = self.down_proj.weight.norm(dim=0)
self.threshold = t / out_norm
x = self.act_fn(self.gate_proj(x)) * self.up_proj(x)
if os.environ.get('mode') == 'sparse':
x[x.abs() < self.threshold] = 0.
down_proj = self.down_proj(x)
return down_proj
After appropriately setting the above two environment variables, just run the following codes for evaluation:
bash eval_entrance.sh /path/to/the/huggingface/model/ 8 piqa,siqa,hellaswag,winogrande,copa,boolq,agieval ppl
bash eval_entrance.sh /path/to/the/huggingface/model/ 8 humaneval,mbpp,lambada,tydiqa,gsm8k,mmlu,bbh gen
Finally, you can obtain the evaluation results under /path/to/the/huggingface/model/eval_results/
.
You can download the checkpoints of our model from the links below.
Relu-activated | SiLU-activated | |
---|---|---|
0.1B | 0.1b-relu | 0.1b-silu |
0.2B | 0.2b-relu | 0.2b-silu |
0.4B | 0.4b-relu | 0.4b-silu |
0.8B | 0.8b-relu | 0.8b-silu |
1.2B | 1.2b-relu | 1.2b-silu |