I've done everything I wanted to do, to optimize the code and move all things I could into flash (program) space. The result is a Forth that can use 1.4K of the Arduino UNO's 2K SRAM. Not bad, especially if you consider that half of the remaining memory is basically reserved for run-time (CPU) stack - and this number can be tweaked.
Lots of FORTH code can fit now...
I proceeded with the "hello world" of the HW universe :-) https://www.youtube.com/watch?v=xePollbCzow