Memory Management
Memory is the limiting resource on the Ice40 FPGA's.
The largest Ice40 FPGA has 1Mbit of single port memory, divided into 4 blocks of 256Kbits organized as 16 bit wide words. The FPGA also has 80Kbits of dual port memory. The Hana cpu allocates one large block to each large core, and 20Kbits of dual port memory to each small core. To the developer, each large core looks like a 16 bit processor, with 16K words of memory. Each small core looks like a 16bit processor with 1.25K words of 16 bit words. But instructions are compressed, so that in 4K words of memory, you could probably fit about 7K instructions.
The Hana processor is based on the J1 and Mecrisp cpus. They require two port memory access. Every clock cycle, the J1 processor reads an instruction from memory, and either reads or writes data to memory. That works fine on more expensive FPGA's with dual port memory, but completely breaks down on the more economical ICE40 FPGAs. So in the large cores, we have to worry about contention for memory. The simple solution is to give priority to actually required data reads and writes, and pause the instructions. But that would slow the processor.
To reduce the number of instruction fetches, and to minimize memory consumption, instructions are compressed. Hana instruction size varies from 4 bits to 8 bits. The most frequently used instructions are stored in 4 bits. That includes call, exit, swap, >r, r<, dup, and over. Less frequently used instructions require 8 bits. Instructions which require an address, such as read, write, jump and conditional jump, require 4 bits plus the address. Small addresses (<4096) are stored in 12 bits (4bits plus 1 byte). The Hana processor also compresses small literals. Small literals ( between -127 and 128) are also stored in 12 bits. The compression algorithm is available in the open source Forth code. If you want documentation, just ask. I expect it to be changing rapidly, so I am not yet documenting it. In particular I am not quite confident what the most frequent forth commands are. I also expect the choice of words implemented in hardware to evolve rapidly.
To further avoid contention, and speed processing the instruction are pre-fetched. They are read whenever possible and cached. This includes reading instructions at jump addresses, so that frequently there is no stall during a jump. Thus there is no need for instruction inlining, which saves precious memory.
The large memory blocks on the ICE40 boards are 16 bits wide. So a 16 bit processor can access memory in a single cycle. For applications which require wider data, a good compromise is to go with 24 bit wide data. That is described below.
Built using the Forest Map Wiki