Regarding long-term archival of DwarFS images #23
Replies: 8 comments 4 replies
-
This is really interesting, thanks for the links! Just quick:
I've moved to frozen (thrift) precisely because the previous metadata format was almost impossible to change in a backwards-compatible way. It's a two-edged sword, though, because of the extra dependency. It definitely made the code much simpler, certainly less buggy, way easier to debug, and so much easier to extend. If fbthrift dies one day, well, I'd probably rip out the frozen stuff. (Actually, the metadata schema also uses thrift.) Frozen itself is actually a simple format, although looking at it with e.g. a hex editor is pointless due to the bit packing and recovery by manually "fixing bits" is probably close to impossible. But under the hood, it's really just bit offsets, bit widths and lengths.
Not yet, though it's something I've thought about at least for the metadata block and the section headers. Without the metadata block, there's not much you can extract anymore. If the metadata is intact, you can lose individual data blocks and still recover (at least in theory) data that's stored in the blocks that are still valid. Unless a file is larger than a single block, it'll never be spread out across more than 3 blocks.
None of these at the moment, unless the underlying compression algorithm actually detects an error. |
Beta Was this translation helpful? Give feedback.
-
More thoughts:
|
Beta Was this translation helpful? Give feedback.
-
In general I agree. However the documentation should perhaps state a few options. (For example at some point in time there was
For long term archival I would personally go with something more "classical" like perhaps
This is a very good idea. My assumption is that for each "file" / "folder" you'll export a JSON object holding perhaps the offsets inside the original image file where the data actually is found (obviously taking into account the compression). Perhaps those JSON objects could also contain the On this topic, given that archived images can contain many (millions) of entries, I would suggest instead of outputing one large JSON file, to instead output in JSON-stream format, i.e. one JSON object per entry, each such JSON written on a separate line. This way one could easily use All that remains afterwards is a small sample C code that given as argument some fields from that JSON, and an image path, it's able to stream to |
Beta Was this translation helpful? Give feedback.
-
While some of this is tempting, it can easily defeat the purpose of "compressing" the data by totally blowing up the JSON metadata. For example, the metadata for the Another advantage of actually leaving the metadata in the exact same space-efficient structure that is being used internally (just a different representation) is that it makes it trivial to recover a corrupted metadata block. |
Beta Was this translation helpful? Give feedback.
-
Given that from the documentation I understand that the image is composed of three sections (schema, metadata and actual data), might I suggest adding one strong hash (say SHA2-256) for each of these sections, thus the (I've suggested something similar in your initial comment, however it didn't give details.) |
Beta Was this translation helpful? Give feedback.
-
Block/schema/metadata are the section types. There are multiple block sections, but only one each of schema & metadata. The schema isn't needed for the exported JSON metadata, it's only necessary to interpret the bit-packing in the frozen metadata. The plan is indeed for each section to have its own checksum, and quite likely that's going to be SHA256. |
Beta Was this translation helpful? Give feedback.
-
cde36cf introduces the new section header, which adds a couple of things:
You can easily convert an old image to the new format in a few seconds using:
|
Beta Was this translation helpful? Give feedback.
-
FWIW, here is a brief description of the general filesystem format and in particular the metadata format. |
Beta Was this translation helpful? Give feedback.
-
Due to the fact that DwarFS has excelent compression, and allows one to easily offload large datasets (especially for text files) but still be able to easily access them, it looks as a perfect candidate for long term archival.
So the question is if DwarFS images are suitable for such a use case?
For example here are a few items that I believe are important:
Personally I would say that it's important to at least have a simple format, and to include strong hashes to detect corruption.
For example the author of Lzip has two nice articles about the topic of long-term archival of compressed bundles:
Beta Was this translation helpful? Give feedback.
All reactions