Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend integer register PReg space to 128 registers #131

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Amanieu
Copy link
Contributor

@Amanieu Amanieu commented May 2, 2023

This PR steals space from the last currently unused register class to allow PReg to represent more registers, but only in RegClass::Int.

But why?

I'm using regalloc2's support for fixed stack slots (represented as PRegs) to let the register allocator manage VMContext state. This allows this state to be kept in registers as much as possible, even across basic blocks.

Since each piece of regalloc-managed VMContext state needs its own PReg, the current limit of 64 integer PRegs is insufficient: the first 32 are for actual physical registers, which leaves only 32 for fixed stack slots, but I need 46 fixed stack slots.

Better solutions

The code as shown is quite ugly, so I'm open to hearing about better solutions to this. #5 seems relevant here.

@cfallin
Copy link
Member

cfallin commented May 2, 2023

Interesting -- thank you for thinking about this problem at least!

I do agree that the current approach is a little ugly. This solves the problem for now, but I do wonder if this is the right time to look for a more general fix. Otherwise the limit seems likely to come into play again soon, or at least feels pretty arbitrary.

Rather than overload the concept of PReg for particular stack slots, it seems we should look for a way to encode a more general stack constraint. I think you're right that a side-table approach as mentioned in #5 would let us build this: we could take a whole u32 for "arbitrary user-defined location". I'd want to be careful that we don't add any metadata overhead to the common case: we should have one constraint encoding in Operand that means "look in the side-table", and we should use it once during our liverange-building scan which we do in (reverse) program order so we don't need any lookup tables to map instruction indices into the sparse side-sequence. I suspect we may need to rearrange things a bit so we can stash the index as an Alloc immediately when creating the liverange/bundle; I'd hope we don't need to increase VRegData's size by a word.

Incidentally, that would also give us a mechanism to solve bytecodealliance/wasmtime#6301, where we currently emit loads from BP+offset to bring stack-carried args into vregs in the function prologue; we could instead define Cranelift's use of this "custom location" index to map to BP offsets, and constrain args directly to the stack.

We'd need an Alloc variant for this as well, so taken as a packed u32 maybe this means the user only gets 30 bits or so of custom index-space, but that's almost certainly good enough.

Thoughts?

@Amanieu
Copy link
Contributor Author

Amanieu commented May 3, 2023

I spent some time thinking about what the future API of this crate might look like.

First of all, regarding #5, it is clear that we need more encoding space for Operand. I think the first step here would be to just use a 64-bit Operand and accept the (~3%) performance cost. Later, we can look into potential optimizations with side-tables to see if it is worth the cost in API complexity.

The main change that should happen is renaming PReg (both in the API and in the allocator internals) to Location which is defined as a 16-bit type with the following bit fields:

  • 2: class (Int, Float, Vector)
  • 1: kind (Mem or Reg)
  • 3: reserved for future expansion
  • 10: index (up to 1K values, should be more than enough)

Operand is then changed to be a 64-bit type with the following bit fields:

  • 30 30-bit index for vreg (leaving 2 bits for the register class)
  • 1: kind (Def or Use)
  • 1: pos (Early or Late)
  • 2: class (Int, Float, Vector)
  • 3: constraint (Fixed, Reuse, Reg, Stack, Any)
  • 16: Location for Fixed, index for Reuse.
  • 12: reserved for future expansion, or for size optimizations in the future

Allocation remains mostly the same, except that the 3 allocation kinds become:

  • None
  • Location (previously Reg)
  • SpillSlot (previously Stack)

Overall this is mostly an API/naming change. Most of the regalloc internals remain the same.

@cfallin
Copy link
Member

cfallin commented May 3, 2023

That's certainly an option. At least in Cranelift's use-case, though, we wouldn't accept a 3% regression; and also, 10 bits of index-space for custom locations would not be enough (we need to support up to 10K args for Wasm at least, per this implementation-limits discussion). I still think that having an out-of-band data structure (we can still have an OperandConstraint::UserStack(index) in the unpacked type, and push it to the main and side-vecs in OperandCollector) is the right approach here, because it decouples the upward pressure from index users and downward pressure from performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants