A Family of Many Core Forth Processors --> Documentation of J1 Versions.

Documentation of J1 Versions.

There are many many different versions of the J1 processor. Here I will introduce them.


The J1 processor is a striped down 16 bit  stack machine designed to be very small and very fast.  Its goal was to copy bits from a camera to udp packets.  That is all it had to do.  So no interrupts, no lights, no nothing else.  It is less than 200 lines of Verilog, and quite understandable.  But it did require dual port RAM.  An instruction is read every clock cycle, and on average every 10th clock cycle application data is read or written.

The J1 by itself was otherwise not very useful, so James Bowman released the J1a, J1b, and the J4.  And then Mecrisp-Ice released a bunch more versions. 

The j1a is able to run on SRAM's having only pseudo-dual port, whereas the original J1 was designed to run on real dual port SRAMS, which aren't available on all FPGA's.

https://github.com/jamesbowman/swapforth/issues/74
The J1 and J1B  is all about one-basic-forth-instruction-per-clock, and so needs true dual port SRAM so the first port is always available each clock to read the instruction, and the second port may optionally do a read or a write to ram so as to allow @ and ! words to run in a single cycle without interfering with the next instruction read.

Pseudo dual-port SRAM's can read from an address whilst writing to a different address, but each of the two ports is dedicated to only read or write. With true dual port SRAM, each of the ports can read OR write. It's possible to 'emulate' a true dual port sram with only pseudo dual port blocks, but it costs performance, since you need to clock the sram twice as fast to do that.

ice40 architecture FPGA's only have pseudo-dual port embedded ram blocks, and the j1a was written to run on an ice40hx1k chip.

So to accommodate memory access apart from just reading the next instruction, the j1a core has to have an 'alternate' mode, which is done by setting pc[12] in the program counter: If set, the next 'instruction fetch' is really just the second half of a two-cycle @ instruction started on the previous cycle, else it's a normal instruction fetch. The return stack is used to save the actual next instruction location, so it also gets popped into PC when this happens.

You can see pc[12] concatenated into the instruction decode on line 44 of j1a/verilog/j1.v, and ditto lines 78, 88, 97 and 104, since any behaviour normally depending on instruction decode needs to do something different during the second phase of @.

Of course, this then means there is no need for opcode 8'b011?1100 which j1b needs to have so one can put the second port's read data onto the stack.

Instead that opcode is allowed to be used for a 'minus' op in the j1a, which would otherwise require - to be compiled into a defined word INVERT 1+ + rather than just a normal, single instruction like + does.

The other difference can be seen if you diff j1a/basewords.fs j1b/basewords.fs: j1a has opcodes for 2/ and 2* , whereas j1b instead has opcodes for rshift and lshift : j1b has a full shifter unit rather than just a one-step only.

   

pc[12] isn't actually used to address the SRAM: the SRAM is generated in j1a/mkrom.py so that the initial contents can be set at FPGA compile time, so that the FPGA also bootstraps the core at configure time. (This isn't so necessary now that the icestorm tools have the ability to just replace SRAM contents without a recompile, but they couldn't do that back when j1a was written, and it's a neat way to make the FPGA configuration logic do your SoC core's bootstrapping too).

Which is to say that j1a.v which is the 'top' for the j1a core, includes ../build/ram.v, which you won't find except as the template in j1a/mkrom.py.

The highest ram fetch address bit the design uses is code_addr[11] (which is pc[11]), with higher bits ignored. The design has 2^12 = 4096 addresses, but they're stored in two 2048 way blocks, consisting of 8 2048x2 memories to each store 16 bits.

It's a little confusing IMHO, but 'din' in ram.v is the connection flowing data from the RAM to the core, and vice-versa for 'dout'.

Another thing which makes the J1 very fast: note that top of stack st0 and next-on-stack st1 are not actually both stored in the 'stack()modules: this is because you very often want to change both in one cycle, sost0is actually an ordinary register, as ispcanddsp`, the latter only being used to keep track of stack depth.

It makes one realize that pc is really the true top of return stack, and forth words like >r are really writing to 'next on return stack'. Every cycle starts with reading the top of return stack from memory, be it to fetch an instruction or just load TOS from ram.

Stack movements are just encoded as two-bit signed integers in the ALU opcode format - one return stack and one for data stack, although -2 isn't used. I suppose if you had some reason to need popping a double-word in one cycle, you might change the stacks to allow that. basewords.fs defined r-2 but never uses it.

You could in principle have opcodes that operated to replace any number of stack items - you'd just rearrange the core such that the top few logical stack items are also registers, like st0 is, thus allowing the core to potentially update all at once. Handy if you wanted to put an op for m*/ in there!

This makes the J1 design pretty interesting for custom FPGA SoC use, IMHO.

Of course, in practise I've found it much easier to extend the I/O section (in icestorm/j1a.v) to allow just hooking up 'accelerator' units, added to the design on an as-needed basis.

The only 'deep' core modding I did was the j4a, which is kinda 4x j(1/4)a in a sense. Has 4x the context, and 'looks' like a 1/4 speed j1a to the code... until you put the other 'cores' to work (they're logical only, the ALU, SRAM and IO are all shared).

Mainly it just has funky 'stack' modules, with a little bit of pipelining and tuning. It's probably got a bug, but has mostly worked out pretty well for me.

It lets me run multiple dumb spin loop bit-bang IO to control/talk to different chips at different rhythms without any interlocks or glitches. Just for a maximum of 4 'threads', but this is heaps for simple thing like a PID controller.

A nice consequence it has is that you can have a spin-loop based app running and still talk to swapforth over rs232 to get/set variables in SRAM without any timing changes. You can even actively hack / rewrite code for different jobs without upsetting at all ones that are running.

Having no DRAM, no wait cycles, no bubbles and only an 'emergency' interrupt system (to recover crashed cores) is incredibly freeing when you're writing a real-time controller. Kind of like having a RTOS in hardware, only better; since the timing is FPGA-state-machine rock-solid, and interlocks are impossible.

Anyway, the code is so short and beautiful for the J1 cores that 'documenting' them is probably more about learning to read verilog than anything else.

Better to have a single source of truth and all that.
But certainly you are free to write your own paper on it ;)

One interesting observation: insn[12] is never actually used, instead, it's completely ignored... in all J1 cores.

There are other parts of the instruction space which are 'available':

J1 uses a 4-bit field to select one of 16 ops, but that could easily be extended to one of 32 ops, since that thirteenth bit is already 'free'...
Also, within the 'func' codes in insn[6:4], only 5 of 8 possible combinations are used.




 Built using the  Forest Map Wiki