-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune the compiled code to its minimum expression #639
base: dev_branch
Are you sure you want to change the base?
Conversation
@davidmallasen @davideschiavone With the latest modifications i see that the whole code + data of a small application could very well fit inside 1 memory bank. Yet there is the limitation of needing to have 2 at least. Why is there such limitation? |
In principle there is not, it's legacy, if you want we can remove it - still, having 2 banks would be more efficient than one bank (Harvard architecture) so maybe two 16KB memories is better than one single 32kB - however, two 16kB of memories together are bigger than one bank of 32kB - so it's a trade-off - but only one bank would be very inefficient (2 cycles to fetch an instruction every time it is done at the same time of any other load or dma transaction, and 3/4 cycles for a lw/sw from the cve2 CPU) |
I agree, but for HEEPidermis, for example, there is a chance that accesses to memory from peripherals are sparse. |
having two small ones should be already possible (AFAIK) - maybe having only 1 is more tricky as the mcu-gen has probably been tailored around having at least 2, which is not just for the HW but for the generation of the linker scripts - nothing too hard to fix, but not just a parameter |
btw - question @JuanSapriza , why do we have the print functions if we are not using it in your example? shouldn't that be pruned? |
Tomorrow i will test having two smaller ones. I think i tried it a while ago but dont remember the result. WIll also see how to have only one. Regarding the printfs, it's sadly not as simple. The stdlibs are included in some of the drivers, so we might need to do some heavy prunning. Today i managed to remove several things, but realized that the cost of making the compilation flow so much more complex and sensitive vs having 4kB of printf definitions was probably not worth it now. |
With 2x16 kB banks everything fine. A Hello World:
|
@davideschiavone is suggesting to try this out. https://github.com/Velko/FsLibc/tree/master |
Just tried with 1 mem bank. Only needed to change a check in system.py where it was verifying that mem banks >=2 .... beyond that everything else worked out of the box jaahahah |
I went down to the syscalls, handlers and asserts and removed the use of Things that i need to check: 🔲 Changing this exception printings might break everything - we need tests over exceptions! |
Removed 1.7kB of program memory by adding a macro that resolved the UART NCo in compile time instead of before every printf. This does not stop you from changing the baudrate in runtime, but you need to recompute the NCO. |
* Don't disable global interrupts with DMA interrupt * Add vsim option * Add DMA functionality for SPI subaddressing * Add setup functions to bsp * Fix DMA subaddressing mode. * Add tests for DMA subaddressing mode. * Fix verible * Debug in CI * Delete commented parts in dma.sv and uncomment test sections in example_dma_subaddressing. Restore original version of example_power_manager. * Add hardware fifo mode to DMA This DMA mode allows data to be forwarded to an external accelerator with I/O fifos tightly coupled with the DMA * Fix verible * Revert "Debug in CI" This reverts commit 560df9e. * Fix CI * Test CI * Modify tb.vlt Add testharness signals related to hw fifo mode to tb.vlt, as it may happen that they are actually undrive if no accelerator is coupled with the DMA * Fix example_power_manager dma.c change in DMA_CSR_REG_MIE_MASK in this PR causes the external interrupts to be disabled before the GPIO-related code portion in the example_power_manager. The previous version relied on the dma HAL to enable the external interrupts. The example has been modified to enable the external interrupts in the main. * Revert "Fix CI" This reverts commit ecdd737. * Revert "Test CI" This reverts commit a12dbea. * Blacklist dma subaddress mode in CI * Add documentation and image for new DMA modes * Fix hw_fifo package imports and add copyright * Fix hw_fifo_ctrl files * Fix verible --------- Co-authored-by: alessionaclerio22 <[email protected]>
* Added script to visualize memory usage after compilation * Improved clarity * Corrected the script that was not taking the values correctly in the vcase of no interleaved bus * removed the need for numpy * Corrected parsing of max memory size of code * Improved way of computing start and end of text and data * Added comments and simplified some functions * Removed the parsing of the map file. Now parsing the output of readelf * Fixed issues with representation * Catch problems with flash_exec * added readelf to the CI * added apt insall binutils * Simply quit the script if readelf not available * removed all version * corrected flash load * Fixed display of interleaved space
From the remaining 6.8 kB I see that the distribution is more or less as follows:
I trust we can reduce the size of those interrupts based on the following: Size of each fast interrupt handlerEach interrupt handler is dominated by the push/pop of all the registers.
|
In order to print an error message in the case where an interrupt that does not have an associated handler we were using the Sacrificing those messages the code can be reduced by 3kB when you do not use printf. |
Added a function 8 kB of code int main(){
printf("hi world");
return 0;
} 3.7 kB of code int main(){
_writestr("hi world");
return 0;
} |
|
#define INTERRUPT_HANDLER_ABI __attribute__((aligned(4), naked)) | ||
|
||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is dangerous, i think this is wrong. When you jump to an async event (and return from it) you need to push to the stack all the registers you are gonna modify within the ISR and then restore them. How do you know that you are gonna use only those 5 registers ?
Especially if they are week, so could be redefined by users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the handler the registers used are well known, the problem is indeed in the weak implementation.
If I managed to make all the registers be backed-up upon entering the weak implementation, that would be perfect. However, I could not manage to make it work. Can you think of anything that would prevent that?
I would not add the _writesrtr function among the system calls. It's not standard. Regarding the interrupt puts, that's legacy from the core testbench, i would simply remove it or call exit or whatever dealing with the testbench. We don't need to print anything here |
Where would you add it? Needs to be somewhere low as
|
Handlers should not print in general, but indeed assert could. For now leave it there |
7c1c330
to
94f2e17
Compare
…educe 24->16kB code size
…bring some troubles
… computed every time before sending a printf.
…ced during the inclusions needed for puts
…functions to only use _writestr
94f2e17
to
0e13225
Compare
I tested this on FPGA and also works with hello world and example_dma. |
can you @JuanSapriza please remove the ISRs 32 to 63? https://github.com/esl-epfl/x-heep/blob/main/sw/device/lib/crt/vectors.S#L84 basically deleting from line 84 to 117 included |
/* | ||
REMOVING THIS CALL REDUCES CODE SIZE CONSIDERABLY | ||
This will not print an error message in case of | ||
jumping to an unknown error handler. | ||
la a0, unknown_msg | ||
jal ra, puts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we remove these puts? either replacing them with _write or simply handling them with the testbench? can be done here on in another Pr in the future, in this case pls open an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave the comment to leave a record that they can be added if someone ever needs them, if you uncomment it it should still work. I can add this additional info if you think thats ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no I just mean removing the puts, not the ISRs - maybe replacing them with _write?
/* | ||
THESE STRINGS ARE NOT LONGER NEEDED | ||
Only the last two were used, and their | ||
call was removed to reduce base code size | ||
.section .rodata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they are not needed why don't we delete them?
sw/device/lib/runtime/handler.c
Outdated
} | ||
|
||
__attribute__((weak)) void handler_ecall(void) { | ||
printf("Environment call encountered\n"); | ||
//printf("Environment call encountered\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why commenting it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, this should be changed to a _writestr as in the others
|
||
// Requested baudrate is too high for the given clock frequency. | ||
if (nco != nco_masked) { | ||
if (uart->nco != nco_masked) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is nco set and where the division is performed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should i add a comment on this?
@@ -49,20 +50,19 @@ pid_t _getpid (void); | |||
int _isatty (int __fildes); | |||
int _link (const char *__path1, const char *__path2); | |||
_off_t _lseek (int __fildes, _off_t __offset, int __whence); | |||
ssize_t _read (int __fd, void *__buf, size_t __nbyte); | |||
int _read (int __fd, void *__buf, int __nbyte); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you check this also with the OpenHW Group compiler used in HEEPatia? and with LLVM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with clang they compile (in the CI). I can check with the OHW compiler on HEEPatia. But i would say if this does not work with that compiler we open an issue later and i fix it quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately I do not agree as as soon as X-HEEP gets revendorized in HEEPatia, everything goes broken - I am sure these changes may break it indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am checking now, will let you know in a while
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works with the OHW compiler.
However, few things to fix:
- The app size scripts goes crazy when code is on the IL banks -- will do a PR later
- OHW compiler requires you to explicitly include "syscalls.h" -- will try running all the apps later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compiled all the apps from HEEPatia with make test TEST_FLAGS=--compile-only
and 33 apps do not compile, most of them did already not compile on HEEPatia, and the others do not seem to be related to these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is crazy, i thought @LuigiGiuffrida98 and @davidmallasen fixed them...
Let's discuss this tomorrow
// We assume that the lowest baudrate will be 9600, and the largest 256000. | ||
// Given this range, by dividing by 100 it always remains integer and below 32-bits. | ||
// This saves us the need of performing 64-bit divisions to compute NCO. | ||
#define UART_NCO ((uint32_t)(((uint32_t)((uint32_t)(UART_BAUDRATE/100))<<20)/((uint32_t)(REFERENCE_CLOCK_Hz/100)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we are using the FLL and we change the reference clock? did you try this in HEEPatia for example? I suggest you re-vendor this fork of X-HEEP in HEEPatia to try things out (tagging @davidmallasen for information)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting question. In theory, as long as the frequency is multiple of 100 it should work just fine. this could be not true if e.g. we use 32'768 Hz.
What i can do is add a macro that chooses the way of computing this: precise/light. Then, would this be chosen at which level? Not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, a macro could be fine, like "TARGET_USE_FLL" or "USING_FIXED_CLK" etc. Keep in mind that this will increment the maintenance effort.... how big is the div64 bit function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
almost 2kB!!! it would be 30% of the code of the minimum application
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I added an #ifdef
in syscalls
to prevent using a NCO that might not be defined.
That way if NCO is defined (you are using one of the FPGA targets, then we use the hardcoded NCO (because also the system clock will be hardcoded --i understand--). If that is not the case (e.g. you are targetting asic) then you compute the NCO with the 64-bit division.
uart.base_addr = mmio_region_from_addr((uintptr_t)UART_START_ADDRESS);
uart.baudrate = UART_BAUDRATE;
uart.clk_freq_hz = soc_ctrl_get_frequency(&soc_ctrl);
#ifdef UART_NCO
uart.nco = UART_NCO;
#else
uart.nco = ((uint64_t)uart.baudrate << (NCO_WIDTH + 4)) / uart.clk_freq_hz;
#endif
Thanks @davidmallasen !!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this by removing the define for TARGET=pynq-z2
and compiles 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also works fine on FPGA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only thing I do not like is that for chips (or any other thing) that will have a fixed clock frequency, we need to remember to define UART_NCO, which is little descriptive about it... we should have something more generic like "USE_FIXED_CLK_FREQ" or any other thing... we can discuss about it -
as I am on this topic, back at HEEPocrates time (with my old crappy Make flow for SW), the target folder was also exposed at HEEPocrates level (i.e. different than the X-HEEP one) while I think now it is lost, which is wrong... we should have a target for each CHIP (see indeed the fixed FLL thing) - for HEEPnosis the only thing I had to change was the USE_TARGET_FLL which we solved it by having the COMPILIER_FLAGS passed command line (thx @JuanSapriza ) - but in principle, it does not scale - we should have a TARGET for the chip itself, and another one (maybe called BOARD or whatever else) for the BOARD, being the testbench in case of X-HEEP_SIM, being something else in the X-HEEP_FPGA_PYNQ, etc... now they are a bit mixed, but if you see HEEPocrates it has many targets depending on the configuration of the testing board, the simulation, etc... we should re-think this - this thing I am writing here does not belong to this PR - it's just a not that I will now transfer to a Github issue - just posting it here to open the discussion on this PR tomorrow (pls @JuanSapriza add this PR in the heepidermis slides)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Yesterday with @davidmallasen we reached a similar conclusion (or at least that crossed my mind).
|
the CPU has at most 32 ISRs (or let's say, the ID of the last one is 31, but physically they are 19) so it is physically impossible to jump to ISR 32 to 63. Don't know why the code is there, must be a copy and paste from another testbench |
@@ -83,38 +83,6 @@ vector_table: | |||
j __no_irq_handler | |||
// 64-32 : not connected on Ibex | |||
j verification_irq_handler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the ISR number 32? i.e. the 33th? if so we need to delete this one as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but do you know why would it be different? (it did not jump to _no_irq_handler
Objective
The objective of this PR is to minimize the code as much as possible.
Some new improvements will not affect compilation at all, such as adding
--gc-sections
, which simply removes unused sections and can free several kB.Other changes might be more drastical and I will set them as optional and heavily disadviced configurations.
Background
Just leaving here as documentation. I am compiling the following code
The initial size was 24 kB. By removing the unused sections it goes down to 12 kB.
The remaining 12 kB are mostly due to:
Tasks
🟩 Remove unused sections
🟧
Not compile unused interrupt handlers⚠️ 🟩 Prune functions in syscalls to make the lighter
🟩 Remove unnecessary calls to standard libraries
🔲
Optionally make data come immediately after text🔲
Reduce default data usage🟩 Allow to have only 1 memory bank
🔲
Fix the use ofCOMPILER_FLAGS=-Os
❓ Move the power manager from
0x3000
(or any rounded address)🔲 Remove handlers 32-63 from here
🔲 Protect against frequencies not multiple of 100 in the calculation of the NCO
Important⚠️
This PR builds on top of #636. It is blocked by the merging of that one first (or i will have to cherry pick changes)