Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Features to get maximum possible encode length of given mnemonic(or +given operands) #548

Open
laie opened this issue Jan 14, 2025 · 9 comments

Comments

@laie
Copy link

laie commented Jan 14, 2025

Currently I am preallocating 15 bytes(absolute maximum size) per an insn, for a recompilation.
However, most of the instructions are not that large. And the preallocation gets too big compared to what it needs to be

Tried out ZydisGetEncodableInstructions, but there was no straightforward length information, and it seemed like zydis encoder is designed that like it figures out the length by actually trying the encoding.

If there was a ZydisEncoderGetMaxLengthForMnemonic(), ZydisEncoderGetMaxLength(mnem, operands), it would have been very very helpful.

@laie laie changed the title [Proposal] Cannot get maximum encode length of given mnemonic(or +given operands) [Proposal] Features to get maximum possible encode length of given mnemonic(or +given operands) Jan 14, 2025
@ZehMatt
Copy link
Contributor

ZehMatt commented Jan 14, 2025

There is no simple way to determine the instruction length so it would be almost as complex as encoding it fully. Why do you have to actually pre-allocate per instruction? Just have 15 bytes on the stack and then copy it to your real buffer.

@laie
Copy link
Author

laie commented Jan 24, 2025

well even "max possible size" of certain mnemonic, as database, not going to be easy?
for example Jmp Imm has (I presume) 5 bytes max.

I am writing a recompiler. because I don't know the max size of the insn, I just preallocate 15 bytes(per insn) and then shrink hierarchically by actually trying to assemble. it would have been fantastic if there was a database like the above.

just a feature proposal.

*btw, I am writing a recompiler and testing with UnrealEngine exe, which is almost 100MB. without ability to determine reasonable max size of insns, 'shrink' algorithm was essential(by actually trying encoding) to prevent the recompiler output code to be "prohibitedly large"

@mappzor
Copy link
Contributor

mappzor commented Jan 24, 2025

There is no simple way to determine the instruction length so it would be almost as complex as encoding it fully

It's fundamental to understand that this statement is true. After figuring out: that instruction is possible to encode, finding the definition which matches operands, validating that operands obey all the rules imposed by the definition, computing all the necessary parts of an instruction (ModR/M, SIB, prefixes, etc.) you are basically done encoding the whole thing. Instruction length becomes known at the very end of the process. If you really need length you can just encode and discard the output.

without ability to determine reasonable max size of insns, 'shrink' algorithm was essential

I don't see how that's essential at all, it seems rather unconventional. If you have 100 MB of instructions to recompile you can start with assumption that you will generate around 100 MB of code. That would be your initial buffer allocation (later you can +N% to this based on empirical observations). You keep encoding instructions, append them to this buffer and simply grow the buffer if it ever runs out of space. In other words, you are working with a std::vector, a structure that grows when necessary rather than shrinks.

@laie
Copy link
Author

laie commented Jan 26, 2025

Feature Proposal: Maximum Instruction Size Lookup for ZydisEncoder

First, I want to express my appreciation for Zydis. The library's exceptional code quality has saved us significant development time, and its completeness is evident as it's the only external dependency in our project.

This is a feature proposal aimed at enhancing ZydisEncoder for compiler frontend use cases, particularly when encoding multiple functions. While I've already implemented a workaround for my specific needs, I believe this feature could benefit others who want to compile multiple functions at once with ZydisEncoder, especially for parallel compilation scenarios.

Current Challenge:
Without knowing the reasonable maximum size per mnemonic (beyond the 15-byte maximum), we must either:

  • Allocate unnecessarily large chunks for functions in multi-threaded compilation
  • Or sequentially compile them one by one, determining function entries and sizes one by one
  • Or accept executable size bloat (2-3x) from required padding
  • (Manual handling of each mnemonic is possible but a built-in solution would be preferable)

Proposed Solution:
A precomputed "reasonable max size" table per mnemonic would enable:

  • Assigning function entry points before actual encoding
  • Parallel compilation without excessive padding
  • Avoiding forced sequential compilation

I acknowledge implementing this may be more complex than initially apparent, given all the instruction encoding steps (checking possibility, operand validation, ModR/M/SIB/prefixes computation, etc.) you provided.

However, precomputing those and make a table of sizes sound possible to me.

Current Workaround:
I'm currently using sequential compilation with multiple passes to shrink and remove padding. Even with sequential compilation, multiple shrinking passes are necessary since jump addresses and block addresses can't be determined until after the first full compilation pass.

There are of course various tricks and workarounds, but it will be more convenient if we had such precomputed table.

Common value of this feature:
This feature will benefit anyone using ZydisEncoder to compile multiple functions at once. Can be a low hanging fruit for improvements.

@laie
Copy link
Author

laie commented Jan 26, 2025

For example:
Jmp, Imm -> 5 bytes
Jmp, Mem -> I don’t know but will be much less than 15 bytes
Call imm -> 5 bytes

I gave up doing this manual table and just gone sequential compile + multiple shrink passes, lol

@ZehMatt
Copy link
Contributor

ZehMatt commented Jan 26, 2025

Why can't you first encode it into a stack buffer and then expand your function/global buffer? A memcpy of 16 bytes is fairly cheap which is most likely two mov instructions but can be also done with just one.

In one of my projects I have the local buffer defined as:

struct EncodeResult {
  uint8_t size;
  uint8_t bytes[15];
};
static_assert(sizeof(EncodeResult) == 16);

So with that I just do following

const auto encodedInfo = encodeInstr(...);
finalBuffer.insert(finalBuffer.end(), std::begin(encodeInfo.bytes), std::begin(encodeInfo.bytes) + encodeInfo.size));

Also perhaps you might want to have a look at https://github.com/zyantific/zasm, I used it for a binary rewriter as well so I know it works fairly well.

@laie
Copy link
Author

laie commented Jan 26, 2025

Thank you for the attention. Let me provide more materialized example case:

## Core Problem Visualization

```nasm
Block0:
    push reg1
    push reg2
    sub rsp, const1 ; size uncertainty
    jmp Block1       ; Size uncertainty

Block1:
    jmp Block2

Block2:
    ...

First Pass (Naive Parallel)

Maximum allocation approach enables parallelism with space overhead:

Block0:
    nop * 15    ; Space for push reg1
    nop * 15    ; Space for push reg2
    nop * 15    ; Space for sub rsp, const1
    nop * 15    ; Space for jmp Block1
    ; Pre-determined addresses enable parallel compilation

Block1:
    ...
Block2:
    ...

Key Points:

  • Block addresses determinable before encoding
  • 15 bytes per instruction: extremely wasteful
  • Wasteful but enables parallelism

Second Pass (Sequential Shrinking)

Block0:
    push reg1        ; Size resolved through encoding
    push reg2        ; Size resolved
    sub rsp, const1  ; Size determined by const1
    nop * 15        ; RIP-relative uncertainty remains

Block1:
    ...

Challenge:

  • RIP-relative instructions still need 15-byte allocation

Third Pass & Loop Shrink

Block0:
    push rbp         ; Known size
    push r15         ; Known size
    sub rsp, 28      ; Short sub
    jmp block2       ; 2-byte short jmp possible

Block2:
    ...

Iterative Approach:

  • Assumes monotonic code shrinkage
  • Repeatedly try encoding jumps with updated block addresses
  • Continue until no further shrinkage occurs
  • Drawback: Requires multiple encoding attempts

Ideal First Pass (With Size Estimation)

Efficient single-pass approach using size prediction:

Block0:
    nop * 2-3       ; push reg1 size predictable
    nop * 2-3       ; push reg2 size predictable
    sub rsp, const1 ; 4-byte immediate assumed
    jmp Block1      ; 5 bytes allocated (near jump)

Block1:
    ...

Key Benefits:

  1. Immediately parallelizable
  2. Immediate determination of block addresses
  3. Optional shrink pass (output size reasonable)
  4. Predictable block layout
  5. Single-pass efficiency
  6. Massive performance gain - only encode necessary instructions

@ZehMatt
Copy link
Contributor

ZehMatt commented Jan 26, 2025

So I wrote this small test to see what the cost of doing 1'000'000 instructions are with zasm:

    TEST(SerializationTests, Encode1MillionRandom)
    {
        Program program(zasm::MachineMode::AMD64);
        x86::Assembler assembler(program);

        const auto count = 1'000'000;
        for (int64_t i = 0; i < count; ++i)
        {
            const auto& instr = tests::data::Instructions[i % std::size(tests::data::Instructions)];
            instr.emitter(assembler);
        }

        zasm::Serializer serializer{};
        ASSERT_EQ(serializer.serialize(program, 0x140015000), ErrorCode::None);
    }

This runs in under 500ms and peak memory usage is 312,668 K (301.4 MiB), I would say that is fairly efficient already. Like before, getting the instruction length is not trivial and since you have to consider all the corner cases you will end up with the same complexity as just encoding it.

Anyway, if you just need the instruction size you could do following:

    ZyanUSize ZydisEncoderGetInstructionSize(ZydisEncoderRequest* req, ZyanU64 address)
    {
        ZyanU8 buf[15];
        ZyanUSize len = sizeof(buf);
        ZyanStatus res = ZydisEncoderEncodeInstructionAbsolute(req, buf, &len, address);
        if (res != ZYAN_STATUS_SUCCESS)
        {
            return 0;
        }
        return len;
    }

Which is how the API probably would work if Zydis were to ever add something like this, it would have at least near identical complexity minus the writes to the buffer which are trivial already.

@laie
Copy link
Author

laie commented Jan 27, 2025

Still want to argue this feature would be valuable:

  1. My UnrealEngine4 sample has 12,465,412 instructions, for example. Real world samples often exceed 1,000,000 by far margin.
  2. Key point is easier parallelization - full UE4 decompilation takes 1,566ms with multithreading vs 15+ seconds without

Other than that I came up with a new workaround:

  • Can build the table myself by encoding all mnem + looking up Zydis insn table
  • I don't know .. I still think it might be a low hanging fruit

About Zydis:

  • Zydis is still great in this form, I can argue it has cleanest & good design in general among all x64 assemble/dissemblers.
  • I understand of this proposal to be dismissed, but doubt that I have delivered its importance properly.
  • Thank you for the attention again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants