nextpnr-himbaechel placement failes on small design #1425

zapta · 2025-01-19T04:17:36Z

Consider the design below, it includes N flip flops and a parity tree with N inputs and one output. When running with N=260, the build completes successfully and the pnr reports very low utilization

yosys -p "synth_gowin -top parity -json _build/hardware.json" -q parity.v
nextpnr-himbaechel --device GW1NR-LV9QN88PC6/I5 --json _build/hardware.json --write _build/hardware.pnr.json --report _build/hardware.pnr --vopt family=GW1N-9C --vopt cst=parity.cst
Using uarch 'gowin' for device 'GW1NR-LV9QN88PC6/I5'

...

Device utilisation:
	                 VCC:       1/      1   100%
	                 IOB:       2/    274     0%
	                LUT4:      90/   8640     1%
	              OSER16:       0/     80     0%
	              IDES16:       0/     80     0%
	            IOLOGICI:       0/    276     0%
	            IOLOGICO:       0/    276     0%
	           MUX2_LUT5:       4/   4320     0%
	           MUX2_LUT6:       0/   2160     0%
	           MUX2_LUT7:       0/   1080     0%
	           MUX2_LUT8:       0/   1080     0%
	                 ALU:     262/   6480     4%
	                 GND:       1/      1   100%
	                 DFF:     261/   6480     4%
	           RAM16SDP4:       0/    270     0%
	               BSRAM:       0/     26     0%
	              ALU54D:       0/     10     0%
	     MULTADDALU18X18:       0/     10     0%
	        MULTALU18X18:       0/     10     0%
	        MULTALU36X18:       0/     10     0%
	           MULT36X36:       0/      5     0%
	           MULT18X18:       0/     20     0%
	             MULT9X9:       0/     40     0%
	              PADD18:       0/     20     0%
	               PADD9:       0/     40     0%
	                 GSR:       1/      1   100%
	                 OSC:       0/      1     0%
	                rPLL:       0/      2     0%
	           FLASH608K:       0/      1     0%
	                BUFG:       0/     22     0%
	                DQCE:       0/     24     0%
	                 DCS:       0/      8     0%
	               DHCEN:       0/     24     0%
	              CLKDIV:       0/      8     0%
	             CLKDIV2:       0/     16     0%

However, when increasing N from 260 to 270, the placement fails with this error:

yosys -p "synth_gowin -top parity -json _build/hardware.json" -q parity.v
nextpnr-himbaechel --device GW1NR-LV9QN88PC6/I5 --json _build/hardware.json --write _build/hardware.pnr.json --report _build/hardware.pnr --vopt family=GW1N-9C --vopt cst=parity.cst -q
ERROR: Unable to find legal placement for cell 'counter_DFF_Q_267_D_ALU_SUM_CIN_ALU_COUT_CIN_ALU_COUT_HEAD_ALULC' of type 'ALU' after 52327 attempts, check constraints and utilisation. Use `--placer-heap-cell-placement-timeout` to change the number of attempts.
0 warnings, 1 error

The sample design used:

parity.v

module parity (
    input      sys_clk,
    output reg led  // Active low
);

  // This works with 260 but fails with 270. 
  localparam SIZE = 260;
  // localparam SIZE = 270;

  // Placement fails around 269.
  reg [SIZE-1:0] counter = 0;

  always @(posedge sys_clk) begin
    counter <= counter + 1;
    led <= ^counter;
  end

endmodule

parity.cst

IO_LOC "sys_clk"   52;
IO_PORT "sys_clk" IO_TYPE=LVCMOS33 PULL_MODE=NONE;

IO_LOC "led"       10; // Active low

Software version used (on macosx arm 64)

$yosys --version
Yosys 0.47+149 (git sha1 384c19119, aarch64-apple-darwin22.4-clang++ 18.1.8 -fPIC -O3)

$ nextpnr-himbaechel --version
"nextpnr-himbaechel" -- Next Generation Place and Route (Version nextpnr-0.7-135-g5eaa1b3f)

The text was updated successfully, but these errors were encountered:

yrabbit · 2025-01-19T04:44:52Z

Because the GW1NR-9 has 45 columns (without 2x for IO), each cell accommodates 6 ALUs, so one row fits ~270 ALUs.
Putting them in one row is critical because the fast arithmetic carry wire goes like this: from left to right along one row.
Dragging the carry from the right end to the beginning of the row on the left edge with regular wires seems irrational to me.

zapta · 2025-01-19T05:20:03Z

It fails even if --freq is set to a low value. Is a total failure more rational than a successful completion with higher propagation delay? EDIT: The Gowin IDE has no problem handling this successfully.

yrabbit · 2025-01-19T05:44:09Z

It has nothing to do with frequency - we just don't make ALU chains longer than fits in a row in the chip.

zapta · 2025-01-19T05:52:04Z

FYI, the Gowin IDE is able to fit with N=1000 (I didn't try a higher number), so there must be a way to fit it.

…

On Sat, Jan 18, 2025 at 9:44 PM YRabbit ***@***.***> wrote: It has nothing to do with frequency - we just don't make ALU chains longer than fits in a row in the chip. — Reply to this email directly, view it on GitHub <#1425 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQVMQI26VYSWCCHUW5AHTD2LM3T5AVCNFSM6AAAAABVOJGAPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBQGYZTMNRTGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

whitequark · 2025-01-19T13:20:30Z

Dragging the carry from the right end to the beginning of the row on the left edge with regular wires seems irrational to me.

All other architectures will do this if they have to.

marzoul · 2025-01-19T15:03:33Z

Dragging the carry from the right end to the beginning of the row on the left edge with regular wires seems irrational to me.

Not necessarily all the way to the beginning of another row ; only to the leftmost part of the next row such that this non-carry, slow wire is shortest.

Only if the chain is longer than 2 full rows should this be routed up to the beginning of the next row.

(such hard-coded decision of splitting long chains is the obvious starting point, bu ultimately there will be conflicts with other long chains to be tightly placed in a small device... tasks for the future)

Another way of solving the issue consists in changing the RTL code to adapt to the targeted device, such that there are only chains short enough in the design. This is, for the cases where logic synthesis does not it automatically, or for reliable portability of source code.

zapta · 2025-01-19T16:17:03Z

A couple of layman questions: 1. Why use a chain (linear time) and not a tree (logarithmic time)? 2. Why does each element in the chain process only 1 input and not let's say 3 (using a LUT)?

…

On Sun, Jan 19, 2025 at 7:03 AM Adrien Prost-Boucle < ***@***.***> wrote: Dragging the carry from the right end to the beginning of the row on the left edge with regular wires seems irrational to me. Not necessarily all the way to the beginning of another row ; only to the leftmost part of the next row such that this non-carry, slow wire is shortest. Only if the chain is longer than 2 full rows should this be routed up to the beginning of the next row. (such hard-coded decision of splitting long chains is the obvious starting point, bu ultimately there will be conflicts with other long chains to be tightly placed in a small device... tasks for the future) Another way of solving the issue consists in changing the RTL code to adapt to the targeted device, such that there are only chains short enough in the design. This is, for the cases where logic synthesis does not it automatically, or for reliable portability of source code. — Reply to this email directly, view it on GitHub <#1425 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQVMQNQPJ4K5EAV2EA2T5L2LO5FVAVCNFSM6AAAAABVOJGAPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBQHA4TOMRWHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

whitequark · 2025-01-19T16:29:01Z

A ripple-carry adder is much simpler than carry-lookahead adder and the performance is still mostly sufficient, so a lot of digital logic devices implement hardware support for the RCA. A single 4-LUT can process addition of one position since it requires three inputs (A, B, Ci); output O is provided by the LUT and output Co is provided by the hardware carry block.

Ravenslofty · 2025-01-19T17:41:26Z

Why use a chain (linear time) and not a tree (logarithmic time)?

This is a case of the hidden factor of big-O mattering.

Because the ripple-carry has dedicated logic and routing between LUTs, carry-in to carry-out propagation time is about a tenth the time of logic propagating through a LUT (not even including routing delay, which is probably double the LUT propagation delay at minimum).

That means your adder needs to be huge for the tree to pay for itself.

zapta · 2025-01-19T19:03:43Z

@whitequark, if my math is correct, each 4 inputs LUT contributes in average about 3 inputs. The table is for a tree but should be about the same for 4 inputs LUT in a linear chain.

whitequark · 2025-01-19T19:19:37Z

Averages mean nothing; you should look into how adder synthesis works.

zapta · 2025-01-19T19:42:26Z

Why adder? The operation here is not sum but a wide exlusive-or.

…

On Sun, Jan 19, 2025 at 11:19 AM Catherine ***@***.***> wrote: Averages mean nothing; you should look into how adder synthesis works. — Reply to this email directly, view it on GitHub <#1425 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQVMQO7W7FOJQDHFKS6YO32LP3F5AVCNFSM6AAAAABVOJGAPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBQHE4DQMJXGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

whitequark · 2025-01-19T19:56:46Z

counter <= counter + 1;

This is a 260-bit adder.

zapta · 2025-01-19T22:55:52Z

I see. I assumed that the difficulty here is due to the wide exclusive-or operation rather than the counter but you are right, there is also a wide addition. I will try with a wide counter alone and see what happens.

…

On Sun, Jan 19, 2025 at 11:57 AM Catherine ***@***.***> wrote: counter <= counter + 1; This is a 260-bit adder. — Reply to this email directly, view it on GitHub <#1425 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQVMQIIZPCNCRUDIKKJH332LP7RJAVCNFSM6AAAAABVOJGAPCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBRGAYDAMBZGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

whitequark · 2025-01-19T23:00:28Z

I'm fairly sure the wide XOR does get synthesized as a tree. At least, abc should do that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nextpnr-himbaechel placement failes on small design #1425

nextpnr-himbaechel placement failes on small design #1425

zapta commented Jan 19, 2025

yrabbit commented Jan 19, 2025

zapta commented Jan 19, 2025 via email •

edited

Loading

yrabbit commented Jan 19, 2025 •

edited

Loading

zapta commented Jan 19, 2025 via email

whitequark commented Jan 19, 2025

marzoul commented Jan 19, 2025

zapta commented Jan 19, 2025 via email

whitequark commented Jan 19, 2025 •

edited

Loading

Ravenslofty commented Jan 19, 2025

zapta commented Jan 19, 2025

whitequark commented Jan 19, 2025

zapta commented Jan 19, 2025 via email

whitequark commented Jan 19, 2025

zapta commented Jan 19, 2025 via email

whitequark commented Jan 19, 2025

nextpnr-himbaechel placement failes on small design #1425

nextpnr-himbaechel placement failes on small design #1425

Comments

zapta commented Jan 19, 2025

yrabbit commented Jan 19, 2025

zapta commented Jan 19, 2025 via email • edited Loading

yrabbit commented Jan 19, 2025 • edited Loading

zapta commented Jan 19, 2025 via email

whitequark commented Jan 19, 2025

marzoul commented Jan 19, 2025

zapta commented Jan 19, 2025 via email

whitequark commented Jan 19, 2025 • edited Loading

Ravenslofty commented Jan 19, 2025

zapta commented Jan 19, 2025

whitequark commented Jan 19, 2025

zapta commented Jan 19, 2025 via email

whitequark commented Jan 19, 2025

zapta commented Jan 19, 2025 via email

whitequark commented Jan 19, 2025

zapta commented Jan 19, 2025 via email •

edited

Loading

yrabbit commented Jan 19, 2025 •

edited

Loading

whitequark commented Jan 19, 2025 •

edited

Loading