Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nextpnr-himbaechel placement failes on small design #1425

Open
zapta opened this issue Jan 19, 2025 · 15 comments
Open

nextpnr-himbaechel placement failes on small design #1425

zapta opened this issue Jan 19, 2025 · 15 comments

Comments

@zapta
Copy link

zapta commented Jan 19, 2025

Consider the design below, it includes N flip flops and a parity tree with N inputs and one output. When running with N=260, the build completes successfully and the pnr reports very low utilization

yosys -p "synth_gowin -top parity -json _build/hardware.json" -q parity.v
nextpnr-himbaechel --device GW1NR-LV9QN88PC6/I5 --json _build/hardware.json --write _build/hardware.pnr.json --report _build/hardware.pnr --vopt family=GW1N-9C --vopt cst=parity.cst
Using uarch 'gowin' for device 'GW1NR-LV9QN88PC6/I5'

...

Device utilisation:
	                 VCC:       1/      1   100%
	                 IOB:       2/    274     0%
	                LUT4:      90/   8640     1%
	              OSER16:       0/     80     0%
	              IDES16:       0/     80     0%
	            IOLOGICI:       0/    276     0%
	            IOLOGICO:       0/    276     0%
	           MUX2_LUT5:       4/   4320     0%
	           MUX2_LUT6:       0/   2160     0%
	           MUX2_LUT7:       0/   1080     0%
	           MUX2_LUT8:       0/   1080     0%
	                 ALU:     262/   6480     4%
	                 GND:       1/      1   100%
	                 DFF:     261/   6480     4%
	           RAM16SDP4:       0/    270     0%
	               BSRAM:       0/     26     0%
	              ALU54D:       0/     10     0%
	     MULTADDALU18X18:       0/     10     0%
	        MULTALU18X18:       0/     10     0%
	        MULTALU36X18:       0/     10     0%
	           MULT36X36:       0/      5     0%
	           MULT18X18:       0/     20     0%
	             MULT9X9:       0/     40     0%
	              PADD18:       0/     20     0%
	               PADD9:       0/     40     0%
	                 GSR:       1/      1   100%
	                 OSC:       0/      1     0%
	                rPLL:       0/      2     0%
	           FLASH608K:       0/      1     0%
	                BUFG:       0/     22     0%
	                DQCE:       0/     24     0%
	                 DCS:       0/      8     0%
	               DHCEN:       0/     24     0%
	              CLKDIV:       0/      8     0%
	             CLKDIV2:       0/     16     0%

However, when increasing N from 260 to 270, the placement fails with this error:

yosys -p "synth_gowin -top parity -json _build/hardware.json" -q parity.v
nextpnr-himbaechel --device GW1NR-LV9QN88PC6/I5 --json _build/hardware.json --write _build/hardware.pnr.json --report _build/hardware.pnr --vopt family=GW1N-9C --vopt cst=parity.cst -q
ERROR: Unable to find legal placement for cell 'counter_DFF_Q_267_D_ALU_SUM_CIN_ALU_COUT_CIN_ALU_COUT_HEAD_ALULC' of type 'ALU' after 52327 attempts, check constraints and utilisation. Use `--placer-heap-cell-placement-timeout` to change the number of attempts.
0 warnings, 1 error

The sample design used:

parity.v

module parity (
    input      sys_clk,
    output reg led  // Active low
);

  // This works with 260 but fails with 270. 
  localparam SIZE = 260;
  // localparam SIZE = 270;

  // Placement fails around 269.
  reg [SIZE-1:0] counter = 0;

  always @(posedge sys_clk) begin
    counter <= counter + 1;
    led <= ^counter;
  end

endmodule

parity.cst

IO_LOC "sys_clk"   52;
IO_PORT "sys_clk" IO_TYPE=LVCMOS33 PULL_MODE=NONE;

IO_LOC "led"       10; // Active low

Software version used (on macosx arm 64)

$yosys --version
Yosys 0.47+149 (git sha1 384c19119, aarch64-apple-darwin22.4-clang++ 18.1.8 -fPIC -O3)

$ nextpnr-himbaechel --version
"nextpnr-himbaechel" -- Next Generation Place and Route (Version nextpnr-0.7-135-g5eaa1b3f)
@yrabbit
Copy link
Contributor

yrabbit commented Jan 19, 2025

Because the GW1NR-9 has 45 columns (without 2x for IO), each cell accommodates 6 ALUs, so one row fits ~270 ALUs.
Putting them in one row is critical because the fast arithmetic carry wire goes like this: from left to right along one row.
Dragging the carry from the right end to the beginning of the row on the left edge with regular wires seems irrational to me.

@zapta
Copy link
Author

zapta commented Jan 19, 2025 via email

@yrabbit
Copy link
Contributor

yrabbit commented Jan 19, 2025

It has nothing to do with frequency - we just don't make ALU chains longer than fits in a row in the chip.

@zapta
Copy link
Author

zapta commented Jan 19, 2025 via email

@whitequark
Copy link
Member

Dragging the carry from the right end to the beginning of the row on the left edge with regular wires seems irrational to me.

All other architectures will do this if they have to.

@marzoul
Copy link
Contributor

marzoul commented Jan 19, 2025

Dragging the carry from the right end to the beginning of the row on the left edge with regular wires seems irrational to me.

Not necessarily all the way to the beginning of another row ; only to the leftmost part of the next row such that this non-carry, slow wire is shortest.

Only if the chain is longer than 2 full rows should this be routed up to the beginning of the next row.

(such hard-coded decision of splitting long chains is the obvious starting point, bu ultimately there will be conflicts with other long chains to be tightly placed in a small device... tasks for the future)

Another way of solving the issue consists in changing the RTL code to adapt to the targeted device, such that there are only chains short enough in the design. This is, for the cases where logic synthesis does not it automatically, or for reliable portability of source code.

@zapta
Copy link
Author

zapta commented Jan 19, 2025 via email

@whitequark
Copy link
Member

whitequark commented Jan 19, 2025

A ripple-carry adder is much simpler than carry-lookahead adder and the performance is still mostly sufficient, so a lot of digital logic devices implement hardware support for the RCA. A single 4-LUT can process addition of one position since it requires three inputs (A, B, Ci); output O is provided by the LUT and output Co is provided by the hardware carry block.

@Ravenslofty
Copy link
Collaborator

Why use a chain (linear time) and not a tree (logarithmic time)?

This is a case of the hidden factor of big-O mattering.

Because the ripple-carry has dedicated logic and routing between LUTs, carry-in to carry-out propagation time is about a tenth the time of logic propagating through a LUT (not even including routing delay, which is probably double the LUT propagation delay at minimum).

That means your adder needs to be huge for the tree to pay for itself.

@zapta
Copy link
Author

zapta commented Jan 19, 2025

@whitequark, if my math is correct, each 4 inputs LUT contributes in average about 3 inputs. The table is for a tree but should be about the same for 4 inputs LUT in a linear chain.

Image

@whitequark
Copy link
Member

Averages mean nothing; you should look into how adder synthesis works.

@zapta
Copy link
Author

zapta commented Jan 19, 2025 via email

@whitequark
Copy link
Member

counter <= counter + 1;

This is a 260-bit adder.

@zapta
Copy link
Author

zapta commented Jan 19, 2025 via email

@whitequark
Copy link
Member

I'm fairly sure the wide XOR does get synthesized as a tree. At least, abc should do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants