Extreme Stack Machines
Chuck Moore is the founding father of Forth. He has always sought simplicity in implementing software and silicon. Savage minimalism is apparent in the design of the SEAforth chips produced by IntellaSys, for whom Chuck Moore is the CTO. Everything about these chips is different. The result is astonishing processing power in a tiny chip that uses very little power. The SEAforth 40C18 has gone to production and will be shipping in December 2008. A full toolchain is available based around VentureForth, a Forth system tuned to the needs and capabilities of the SEAforth chips.
I do not have the space to do more than outline the capabilities of such a device and to discuss some of the software issues involved in writing applications for it. Technical information is available at the IntellaSys website.
Each chip has 40 cores. Each core contains a C18 stack machine, ROM, RAM, and interconnects to its nearest neighbors. The ROM for each core contains a BIOS that can be used by application code in RAM. A BIOS for an edge core can contain code that emulates a serial port, an SPI port, or a DRAM controller. Interconnects on cores at the edge of the chip can include I/O. The inner cores lacking I/O provide functions needed by their neighbors.
Each core runs asynchronously -- there is no common clock or crystal. When a CPU reads or writes data to/from a neighbour that is not ready, it just goes to lowpower sleep until the neighbor completes the transfer. Each core runs at 600+ MHz, giving a total processing power of up to 26 billion operations per second. Programs are loaded from an external EEPROM or Flash.
That was the good news. Each core is tiny with an 18-bit data bus, and 64 18-bit words each of ROM and RAM. Instructions are packed four to a word. The core implements a subset of the A and B register system discussed in the previous section. The B register is optimized to be a pointer. The two stacks cache the top items (two for the data stack and one for the return stack), and then contain an additional eight items arranged as a circular list. Every time a stack is pushed or popped, the next or previous item is selected. An odd but useful side effect of this is that a pop does not actually destroy data, it just goes to the bottom of the stack. Devious programmers take advantage of this feature to use the stacks as small data caches. Unless you have followed Chuck Moore's work over the last 10 or so years, you will find this design radically different from anything else. It's actually just another CPU whose subtleties you have to learn.
With a maximum of a few hundred opcodes in each core, your major problem is not programming each core, it's floor planning. Chuck Moore says that he can program a core in a day. People with more experience programming SEAforth than me repeatedly say that getting the floor plan and interconnects right is the major part of programming a SEAforth chip.
You have to decide how to partition the cores and explore the consequences of how each core is used. To illustrate the issues of floor planning, I look at a QPSK transmitter and receiver system implemented on the previous IntellaSys 24-core device. Compared to a conventional microcontroller, the SEAforth devices have very few peripherals in hardware, the main ones being digital-to-analogue and analogue-to-digital converters. The ring of cores around the edge of the chip are programmed to perform functions that would normally be performed in hardware. Later versions of the SEAforth family will have enough cores to approach the "one core per pin" model.
The 24-core chip is a 6x4 array. The reference clock input (core 4), and the two outputs (cores 21 and 22) have to be at an edge. Because of other functions needed in other cores, the path though the chip includes one core (core 9) which simply passes data. Core 16 provides data to the modulator in core 15. What we see here is that each core performs a simple task, and then passes data to the next core in the chain. In this example the partitioning is relatively simple. The receiver is a different matter.
It is at this point that we see the need for a model to help us partition programs across cores. We can view each core as a node executing a process block. Each node is connected to its neighbors by signal carriers (sometimes called "wires"). Thinking of the chip as a set of process blocks and signal carriers is the key to successfully factoring problems to fit the SEAforth chip. Multicore chip programming introduces communication as a design element that is as important as the programming of the cores.
Here we see the designer taking account of where the analogue-to-digital converters are placed on the chip. The majority of application code runs from RAM. This permits a programmer to redefine the function of a core during run-time. One example of this is for selecting the modulation in a multi-mode receiver. Because the SEAforth 40C18 is a low-power device, it is ideal for software defined radio (SDR) applications. All this begs the question of how programs are loaded into RAM.
The architecture of the core places the I/O ports in memory from which code can be executed. If the BIOS code sets the core to execute an instruction from an interconnect port and there is no data, the core goes to sleep until the data is available. If the data is an opcode set (up to four instructions), the instructions are executed. It is possible in these four instructions to read a data block into core memory or to pass it on to another port. The latter is known colloquially as a "port pump". Port pumps need not use any local RAM space. When a chip is reset, the BIOS in one or more edge cores looks for data, e.g., from a serial SPI Flash device. The start of this stream triggers a port pump to transfer data (code) to a neighboring core which in turn passes the stream to its neighbours.
You can take advantage of this mechanism to aid development and debugging. The compiler remains interactive when your code has compiled, and is connected to a core on the target device through an umbilical link.
The use of interactive compilers and umbilical debugging links is common practice in cross-compiled Forth systems. These same techniques are used to load and execute programs in multicore chips. Because port pumps can be essentially non-invasive, they can be used for debugging as well as for loading programs. Once you have selected a boot core for the umbilical link, you can get to any core you want to debug. However, there is not quite a free lunch.
In practice, you work by debugging the nodes furthest away, and pull back to the boot/umbilical link core. Because each core is asynchronous, the simulator cannot be cycle accurate across multiple cores. As with any software, you have to do some testing on the real target. You have to design testability into the system as well as design the code.
Debugging is a major issue in multicore programming, regardless of the interconnect system, and is a work in progress for all multicore architectures. The approach used in the IntellaSys SEAforth chips provides small low-power multicore chips that can be programmed by mortals like us. It is the change in thought processes in going from single core programming to multicore programming that causes the learning curve.
Acknowledgements
The staff at IntellaSys provided a great deal of support over the last year, in particular Chet Brown, Debbie Davis, Dean Sanderson and Dylan Smeder. Chuck Moore started all this a while ago, and has shown what can be done with so little.
Gary Bergstrom has provided trenchant comments and input for this and the previous article. His code is published with permission.