How to accelerate the process of dictionary training in zstd? #4053

riyuejiuzhao · 2024-05-23T09:29:22Z

I am facing a time-consuming issue with zstd dictionary training when working with large datasets. The slow process has led me to search for ways to speed it up.

I would be grateful for any suggestions, code examples, or guidance on how to accelerate the dictionary training process using the zstd library. Although I want to use multithreading for dictionary training, I am unsure about how to implement it.

Thanks

Cyan4973 · 2024-05-23T18:43:22Z

-T0 will trigger multi-threading, scaling the nb of working threads to the nb of detected cores on the local system. It's active during training.

Training time is generally a function of training size. So if you want faster training, reduce the training sample size. If you don't what to do the selection work manually, use the --memory=# command, and the trainer will randomize its selection up to the requested amount.

There are several dictionary trainers available, and --train-fastcover is the faster one.
It's enabled by default, and also features multiple advanced parameters, some of which can impact speed in major ways.
Try --train-fastcover=accel=#, with # within [1,10]. It will trade accuracy for speed.
Other advanced parameters exist, but can be harder to understand and employ.

riyuejiuzhao · 2024-05-24T03:24:00Z

-T0 will trigger multi-threading, scaling the nb of working threads to the nb of detected cores on the local system. It's active during training.

Training time is generally a function of training size. So if you want faster training, reduce the training sample size. If you don't what to do the selection work manually, use the --memory=# command, and the trainer will randomize its selection up to the requested amount.

There are several dictionary trainers available, and --train-fastcover is the faster one. It's enabled by default, and also features multiple advanced parameters, some of which can impact speed in major ways. Try --train-fastcover=accel=#, with # within [1,10]. It will trade accuracy for speed. Other advanced parameters exist, but can be harder to understand and employ.

Thank you very much for your help! I am actually using the Python interface of zstd to train dictionaries, and I tried setting the threads parameter, only to find that the training process entered optimization mode, which actually took even longer than the regular training.

I think the main issue is that the dataset is too large overall. I am curious about the principles behind dictionary training. Is it possible to split the entire dataset into smaller parts, train them separately, and then combine the results?

Cyan4973 · 2024-05-24T04:15:19Z

I think the main issue is that the dataset is too large overall. I am curious about the principles behind dictionary training. Is it possible to split the entire dataset into smaller parts, train them separately, and then combine the results?

Nope.

If your sample set is too large, your best option is to consider using --memory=# to limit the amount used for training.

riyuejiuzhao · 2024-05-24T06:51:42Z

Training time is generally a function of training size. So if you want faster training, reduce the training sample size. If you don't what to do the selection work manually, use the --memory=# command, and the trainer will randomize its selection up to the requested amount.

Thank you. If I want to delve deeper into the specific principles of the dictionary training process, such as debugging the libzstd source code with gdb, are there any resources or references that you could recommend for me to consult?

Cyan4973 · 2024-05-24T15:31:07Z

The source code itself points at a few resources, but beyond that, don't expect any tutorial to exist on the matter.
These algorithms are fairly complex and rare. There isn't a CS corpus of knowledge developed around this topic yet.

xiehengjian · 2025-01-21T08:52:55Z

-T0 will trigger multi-threading, scaling the nb of working threads to the nb of detected cores on the local system. It's active during training.-T0 将触发多线程，将工作线程的 nb 缩放到本地系统上检测到的内核的 nb。它在训练期间处于活动状态。
Training time is generally a function of training size. So if you want faster training, reduce the training sample size. If you don't what to do the selection work manually, use the --memory=# command, and the trainer will randomize its selection up to the requested amount.训练时间通常是训练大小的函数。因此，如果您想要更快的训练，请减小训练样本量。如果您不知道如何手动进行选择工作，请使用 --memory=# 命令，Trainer 会将其选择随机化到请求的量。
There are several dictionary trainers available, and --train-fastcover is the faster one. It's enabled by default, and also features multiple advanced parameters, some of which can impact speed in major ways. Try --train-fastcover=accel=#, with # within [1,10]. It will trade accuracy for speed. Other advanced parameters exist, but can be harder to understand and employ.有几种字典训练器可用，--train-fastcover 是更快的一种。它默认启用，并且还具有多个高级参数，其中一些参数会对速度产生重大影响。尝试 --train-fastcover=accel=#，其中 # 在 [1,10] 内。它将用准确性换取速度。存在其他高级参数，但可能更难理解和使用。

Thank you very much for your help! I am actually using the Python interface of zstd to train dictionaries, and I tried setting the threads parameter, only to find that the training process entered optimization mode, which actually took even longer than the regular training.非常感谢您的帮助！我其实在用 zstd 的 Python 接口来训练字典，我试着设置了 threads 参数，结果发现训练过程进入了优化模式，其实比常规训练花的时间还要长。

I think the main issue is that the dataset is too large overall. I am curious about the principles behind dictionary training. Is it possible to split the entire dataset into smaller parts, train them separately, and then combine the results?我认为主要问题是数据集总体上太大了。我对字典训练背后的原则很好奇。是否可以将整个数据集拆分成更小的部分，分别训练它们，然后组合结果？

Perhaps it is feasible to split the dataset for training. The training of the dictionary seems to inherently divide the input samples into multiple epochs, and then select an optimal segment in each epoch.

Cyan4973 added the question label May 23, 2024

Cyan4973 self-assigned this May 23, 2024

Cyan4973 closed this as completed Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to accelerate the process of dictionary training in zstd? #4053

How to accelerate the process of dictionary training in zstd? #4053

riyuejiuzhao commented May 23, 2024

Cyan4973 commented May 23, 2024

riyuejiuzhao commented May 24, 2024

Cyan4973 commented May 24, 2024

riyuejiuzhao commented May 24, 2024

Cyan4973 commented May 24, 2024

xiehengjian commented Jan 21, 2025

How to accelerate the process of dictionary training in zstd? #4053

How to accelerate the process of dictionary training in zstd? #4053

Comments

riyuejiuzhao commented May 23, 2024

Cyan4973 commented May 23, 2024

riyuejiuzhao commented May 24, 2024

Cyan4973 commented May 24, 2024

riyuejiuzhao commented May 24, 2024

Cyan4973 commented May 24, 2024

xiehengjian commented Jan 21, 2025