[Proposal] Features to get maximum possible encode length of given mnemonic(or +given operands) #548

laie · 2025-01-14T03:41:32Z

Currently I am preallocating 15 bytes(absolute maximum size) per an insn, for a recompilation.
However, most of the instructions are not that large. And the preallocation gets too big compared to what it needs to be

Tried out ZydisGetEncodableInstructions, but there was no straightforward length information, and it seemed like zydis encoder is designed that like it figures out the length by actually trying the encoding.

If there was a ZydisEncoderGetMaxLengthForMnemonic(), ZydisEncoderGetMaxLength(mnem, operands), it would have been very very helpful.

ZehMatt · 2025-01-14T04:25:06Z

There is no simple way to determine the instruction length so it would be almost as complex as encoding it fully. Why do you have to actually pre-allocate per instruction? Just have 15 bytes on the stack and then copy it to your real buffer.

laie · 2025-01-24T05:45:48Z

well even "max possible size" of certain mnemonic, as database, not going to be easy?
for example Jmp Imm has (I presume) 5 bytes max.

I am writing a recompiler. because I don't know the max size of the insn, I just preallocate 15 bytes(per insn) and then shrink hierarchically by actually trying to assemble. it would have been fantastic if there was a database like the above.

just a feature proposal.

*btw, I am writing a recompiler and testing with UnrealEngine exe, which is almost 100MB. without ability to determine reasonable max size of insns, 'shrink' algorithm was essential(by actually trying encoding) to prevent the recompiler output code to be "prohibitedly large"

mappzor · 2025-01-24T06:10:33Z

There is no simple way to determine the instruction length so it would be almost as complex as encoding it fully

It's fundamental to understand that this statement is true. After figuring out: that instruction is possible to encode, finding the definition which matches operands, validating that operands obey all the rules imposed by the definition, computing all the necessary parts of an instruction (ModR/M, SIB, prefixes, etc.) you are basically done encoding the whole thing. Instruction length becomes known at the very end of the process. If you really need length you can just encode and discard the output.

without ability to determine reasonable max size of insns, 'shrink' algorithm was essential

I don't see how that's essential at all, it seems rather unconventional. If you have 100 MB of instructions to recompile you can start with assumption that you will generate around 100 MB of code. That would be your initial buffer allocation (later you can +N% to this based on empirical observations). You keep encoding instructions, append them to this buffer and simply grow the buffer if it ever runs out of space. In other words, you are working with a std::vector, a structure that grows when necessary rather than shrinks.

laie · 2025-01-26T06:31:10Z

Feature Proposal: Maximum Instruction Size Lookup for ZydisEncoder

First, I want to express my appreciation for Zydis. The library's exceptional code quality has saved us significant development time, and its completeness is evident as it's the only external dependency in our project.

This is a feature proposal aimed at enhancing ZydisEncoder for compiler frontend use cases, particularly when encoding multiple functions. While I've already implemented a workaround for my specific needs, I believe this feature could benefit others who want to compile multiple functions at once with ZydisEncoder, especially for parallel compilation scenarios.

Current Challenge:
Without knowing the reasonable maximum size per mnemonic (beyond the 15-byte maximum), we must either:

Allocate unnecessarily large chunks for functions in multi-threaded compilation
Or sequentially compile them one by one, determining function entries and sizes one by one
Or accept executable size bloat (2-3x) from required padding
(Manual handling of each mnemonic is possible but a built-in solution would be preferable)

Proposed Solution:
A precomputed "reasonable max size" table per mnemonic would enable:

Assigning function entry points before actual encoding
Parallel compilation without excessive padding
Avoiding forced sequential compilation

I acknowledge implementing this may be more complex than initially apparent, given all the instruction encoding steps (checking possibility, operand validation, ModR/M/SIB/prefixes computation, etc.) you provided.

However, precomputing those and make a table of sizes sound possible to me.

Current Workaround:
I'm currently using sequential compilation with multiple passes to shrink and remove padding. Even with sequential compilation, multiple shrinking passes are necessary since jump addresses and block addresses can't be determined until after the first full compilation pass.

There are of course various tricks and workarounds, but it will be more convenient if we had such precomputed table.

Common value of this feature:
This feature will benefit anyone using ZydisEncoder to compile multiple functions at once. Can be a low hanging fruit for improvements.

laie · 2025-01-26T06:37:41Z

For example:
Jmp, Imm -> 5 bytes
Jmp, Mem -> I don’t know but will be much less than 15 bytes
Call imm -> 5 bytes
…

I gave up doing this manual table and just gone sequential compile + multiple shrink passes, lol

ZehMatt · 2025-01-26T10:58:06Z

Why can't you first encode it into a stack buffer and then expand your function/global buffer? A memcpy of 16 bytes is fairly cheap which is most likely two mov instructions but can be also done with just one.

In one of my projects I have the local buffer defined as:

struct EncodeResult {
  uint8_t size;
  uint8_t bytes[15];
};
static_assert(sizeof(EncodeResult) == 16);

So with that I just do following

const auto encodedInfo = encodeInstr(...);
finalBuffer.insert(finalBuffer.end(), std::begin(encodeInfo.bytes), std::begin(encodeInfo.bytes) + encodeInfo.size));

Also perhaps you might want to have a look at https://github.com/zyantific/zasm, I used it for a binary rewriter as well so I know it works fairly well.

laie · 2025-01-26T16:12:14Z

Thank you for the attention. Let me provide more materialized example case:

## Core Problem Visualization

```nasm
Block0:
    push reg1
    push reg2
    sub rsp, const1 ; size uncertainty
    jmp Block1       ; Size uncertainty

Block1:
    jmp Block2

Block2:
    ...

First Pass (Naive Parallel)

Maximum allocation approach enables parallelism with space overhead:

Block0:
    nop * 15    ; Space for push reg1
    nop * 15    ; Space for push reg2
    nop * 15    ; Space for sub rsp, const1
    nop * 15    ; Space for jmp Block1
    ; Pre-determined addresses enable parallel compilation

Block1:
    ...
Block2:
    ...

Key Points:

Block addresses determinable before encoding
15 bytes per instruction: extremely wasteful
Wasteful but enables parallelism

Second Pass (Sequential Shrinking)

Block0:
    push reg1        ; Size resolved through encoding
    push reg2        ; Size resolved
    sub rsp, const1  ; Size determined by const1
    nop * 15        ; RIP-relative uncertainty remains

Block1:
    ...

Challenge:

RIP-relative instructions still need 15-byte allocation

Third Pass & Loop Shrink

Block0:
    push rbp         ; Known size
    push r15         ; Known size
    sub rsp, 28      ; Short sub
    jmp block2       ; 2-byte short jmp possible

Block2:
    ...

Iterative Approach:

Assumes monotonic code shrinkage
Repeatedly try encoding jumps with updated block addresses
Continue until no further shrinkage occurs
Drawback: Requires multiple encoding attempts

Ideal First Pass (With Size Estimation)

Efficient single-pass approach using size prediction:

Block0:
    nop * 2-3       ; push reg1 size predictable
    nop * 2-3       ; push reg2 size predictable
    sub rsp, const1 ; 4-byte immediate assumed
    jmp Block1      ; 5 bytes allocated (near jump)

Block1:
    ...

Key Benefits:

Immediately parallelizable
Immediate determination of block addresses
Optional shrink pass (output size reasonable)
Predictable block layout
Single-pass efficiency
Massive performance gain - only encode necessary instructions

ZehMatt · 2025-01-26T17:56:27Z

So I wrote this small test to see what the cost of doing 1'000'000 instructions are with zasm:

    TEST(SerializationTests, Encode1MillionRandom)
    {
        Program program(zasm::MachineMode::AMD64);
        x86::Assembler assembler(program);

        const auto count = 1'000'000;
        for (int64_t i = 0; i < count; ++i)
        {
            const auto& instr = tests::data::Instructions[i % std::size(tests::data::Instructions)];
            instr.emitter(assembler);
        }

        zasm::Serializer serializer{};
        ASSERT_EQ(serializer.serialize(program, 0x140015000), ErrorCode::None);
    }

This runs in under 500ms and peak memory usage is 312,668 K (301.4 MiB), I would say that is fairly efficient already. Like before, getting the instruction length is not trivial and since you have to consider all the corner cases you will end up with the same complexity as just encoding it.

Anyway, if you just need the instruction size you could do following:

    ZyanUSize ZydisEncoderGetInstructionSize(ZydisEncoderRequest* req, ZyanU64 address)
    {
        ZyanU8 buf[15];
        ZyanUSize len = sizeof(buf);
        ZyanStatus res = ZydisEncoderEncodeInstructionAbsolute(req, buf, &len, address);
        if (res != ZYAN_STATUS_SUCCESS)
        {
            return 0;
        }
        return len;
    }

Which is how the API probably would work if Zydis were to ever add something like this, it would have at least near identical complexity minus the writes to the buffer which are trivial already.

laie · 2025-01-27T12:56:39Z

Still want to argue this feature would be valuable:

My UnrealEngine4 sample has 12,465,412 instructions, for example. Real world samples often exceed 1,000,000 by far margin.
Key point is easier parallelization - full UE4 decompilation takes 1,566ms with multithreading vs 15+ seconds without

Other than that I came up with a new workaround:

Can build the table myself by encoding all mnem + looking up Zydis insn table
I don't know .. I still think it might be a low hanging fruit

About Zydis:

Zydis is still great in this form, I can argue it has cleanest & good design in general among all x64 assemble/dissemblers.
I understand of this proposal to be dismissed, but doubt that I have delivered its importance properly.
Thank you for the attention again

laie changed the title ~~[Proposal] Cannot get maximum encode length of given mnemonic(or +given operands)~~ [Proposal] Features to get maximum possible encode length of given mnemonic(or +given operands) Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Features to get maximum possible encode length of given mnemonic(or +given operands) #548

[Proposal] Features to get maximum possible encode length of given mnemonic(or +given operands) #548

laie commented Jan 14, 2025

ZehMatt commented Jan 14, 2025

laie commented Jan 24, 2025 •

edited

Loading

mappzor commented Jan 24, 2025

laie commented Jan 26, 2025

laie commented Jan 26, 2025

ZehMatt commented Jan 26, 2025

laie commented Jan 26, 2025 •

edited

Loading

ZehMatt commented Jan 26, 2025

laie commented Jan 27, 2025 •

edited

Loading

[Proposal] Features to get maximum possible encode length of given mnemonic(or +given operands) #548

[Proposal] Features to get maximum possible encode length of given mnemonic(or +given operands) #548

Comments

laie commented Jan 14, 2025

ZehMatt commented Jan 14, 2025

laie commented Jan 24, 2025 • edited Loading

mappzor commented Jan 24, 2025

laie commented Jan 26, 2025

laie commented Jan 26, 2025

ZehMatt commented Jan 26, 2025

laie commented Jan 26, 2025 • edited Loading

First Pass (Naive Parallel)

Second Pass (Sequential Shrinking)

Third Pass & Loop Shrink

Ideal First Pass (With Size Estimation)

ZehMatt commented Jan 26, 2025

laie commented Jan 27, 2025 • edited Loading

laie commented Jan 24, 2025 •

edited

Loading

laie commented Jan 26, 2025 •

edited

Loading

laie commented Jan 27, 2025 •

edited

Loading