API limits the maximum archive entry size on 32bit systems #15

r0ptr · 2021-01-06T12:18:15Z

The current API, namely the functions: ar_entry_get_size and ar_entry_uncompress use size_t in their signature, which limits the maximum size of archive entries on 32bit systems. Would you consider changing the API to use a fixed-width 64bit unsigned integer type, such that both 32bit and 64bit systems would be able to handle large archive entries?

The text was updated successfully, but these errors were encountered:

r0ptr · 2021-01-06T15:08:26Z

Oh shoot, after looking into it a bit more, it seems not even the LZMA SDK supports archive entries which size exceed 4294967295 bytes, when compiled for 32bits (at least not the C version of the 7z SDK)

selmf · 2021-01-06T15:51:21Z

The 7z SDK decompresses all data into memory before returning it. This will of course fail on systems that can't address that much memory.

selmf · 2021-01-06T16:03:28Z

The problem with size_t and huge file entries is that size_t is the maximum chunk of memory your system can address. This means even if you'd change the uncompress function to always use a 64 bit buffer this would fail on a 32 bit system for the simple reason that it is too huge to address. There is a simple way around this, though - use a smaller buffer and call uncompress repeatedly to decompress the data in chunks that you write to disk.

This still leaves the problem with the entry size. Size_t in this context is mainly used because it indicates a size. If the data type prevents us from getting the true size this is a bug and it should be fixed. The issue I see with this is that changing the return type will require digging deeper into the respective archive implementations and we need to take care to only change size_t to 64 bit in the correct contexts.

I also need to consider how to handle the (minor) API breakage this might incur for 32 bit users.

r0ptr · 2021-01-06T16:15:11Z

The 7z SDK has this piece of code in the SzArEx_Extract function:

  if (*tempBuf == NULL || *blockIndex != folderIndex)
  {
    UInt64 unpackSizeSpec = SzAr_GetFolderUnpackSize(&p->db, folderIndex);
    /*
    UInt64 unpackSizeSpec =
        p->UnpackPositions[p->FolderToFile[(size_t)folderIndex + 1]] -
        p->UnpackPositions[p->FolderToFile[folderIndex]];
    */
    size_t unpackSize = (size_t)unpackSizeSpec;

    if (unpackSize != unpackSizeSpec)
      return SZ_ERROR_MEM;

This is clearly designed to fail on 32bit systems, I'm guessing they wanted to fail early here because the implementation also has problems deeper down...

Regarding the problem you mention that addressing such a large piece of memory is impossible, that is true, but the way I work around that is by using memory mapping, where I map the largest possible free chunk of memory, unpack that much data into the chunk, unmap that chunk, map the next chunk, unpack, and so on.

selmf · 2021-01-06T17:01:14Z

Don't bother too much with the 7z SDK. This memory limitation is the main reason I have marked 7z support as experimental. The underlying decompression code should be able to handle large files just fine, but the C code for archive insists on decompressing huge blocks into a memory cache instead of returning the files as they are decompressed. Fixing this would need a partial rewrite of the SDK.

r0ptr · 2021-01-06T19:40:54Z

I see, does it mean that extraction of large files (>4GB) with 7z will never be supported on 32bit systems?

selmf · 2021-01-06T20:07:38Z

No, it just means that there is much more work involved to make this work. SzArEx_Extract and probably other parts of 7zDec.c need to be rewritten to support decompressing directly to a buffer and keeping the decompression state and dictionary instead of decompressing everything into a private buffer and allowing access to that.

The question is – is it worth it? It actually might be a better idea to rebuild the parsing code from scratch in C99 and only use the low-level parts of the 7z SDK. That way we have better control over what is happening and it is easier to implement full support for 7z archives and not only the limited the SDK provides.

selmf · 2022-07-03T17:19:05Z

Hey, just a quick heads up. I recently rechecked this issue to see if I could include a fix into the upcoming unarr release, but the problem goes deeper than just 7z SDK and the API signature. The bad pattern of using size_t for filesizes is present in a lot of the internal code and structures. Working on this without having proper unit tests to catch regressions is asking for trouble.

I will check if I can improve the situation in the next development cycle. By then I should have a proper test system set up.

gen2brain mentioned this issue Jul 22, 2024

bug: not tolerating archives with bad/missing meta data, but which are extractable gen2brain/cbconvert#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API limits the maximum archive entry size on 32bit systems #15

API limits the maximum archive entry size on 32bit systems #15

r0ptr commented Jan 6, 2021 •

edited

Loading

r0ptr commented Jan 6, 2021 •

edited

Loading

selmf commented Jan 6, 2021

selmf commented Jan 6, 2021

r0ptr commented Jan 6, 2021

selmf commented Jan 6, 2021

r0ptr commented Jan 6, 2021

selmf commented Jan 6, 2021

selmf commented Jul 3, 2022

API limits the maximum archive entry size on 32bit systems #15

API limits the maximum archive entry size on 32bit systems #15

Comments

r0ptr commented Jan 6, 2021 • edited Loading

r0ptr commented Jan 6, 2021 • edited Loading

selmf commented Jan 6, 2021

selmf commented Jan 6, 2021

r0ptr commented Jan 6, 2021

selmf commented Jan 6, 2021

r0ptr commented Jan 6, 2021

selmf commented Jan 6, 2021

selmf commented Jul 3, 2022

r0ptr commented Jan 6, 2021 •

edited

Loading

r0ptr commented Jan 6, 2021 •

edited

Loading