Replies: 2 comments 2 replies
-
Oh, this is super interesting! Thanks for your investigation! I was actually somewhat optimistic that using So here's where the differences come from:
Once the last issue is fixed, I can release a version that can produce bit-identical images without too much effort. |
Beta Was this translation helpful? Give feedback.
-
Actually, as of v0.6.1 |
Beta Was this translation helpful? Give feedback.
-
Suppose I run
tar --sort=name cf folder.tar folder && mkdwarfs -i folder -o image.dwarfs --order=path
on two different computers at two different times with two different CPU architectures and OSes, but the same version ofmkdwarfs
, its shared libraries, and command line arguments. If the .tar files on both computers end up bit-identical (meaning file contents and metadata must have been identical, although I haven't verified thattar
andmkdwarfs
examine identical sets of metadata, but you get the idea), under what circumstances will the .dwarfs images be bit-identical too? Zstd notably guarantees reproducibility of its archives in this manner, even when using different numbers of threads to compress them.I tried the test above using a folder structure with ~120k files totaling a few gigabytes and failed to reproduce the dwarfs image on the same computer. The mkdwarfs man page explicitly calls out the default
order
option as non-reproducible, which is why I used--order=path
in my test, but clearly there are other factors.Are creation timestamps embedded in the image? What if the two computers use a different number of threads? I would be willing to trade some creation time in order to perform sorting and guarantee reproducibility, but I'm not sure what engineering work would be involved or what constraints there are in the design.
It would additionally be nice to provide options to guarantee reproducibility even if the input folders are not exactly identical. In my use cases, I typically care about directory structure, file names and contents, and the executable bit, and would like to ignore all other metadata. There are already options to sort by path, remove timestamps and uid/gid values, and so forth, but it would be nice to document which of these are needed for reproducibility or combine them under one flag once the basic image format is reproducible.
Beta Was this translation helpful? Give feedback.
All reactions