The default version of the system is made up of a controller, two nodes, a memory controller and a NoC. To keep things simple, only one node is shown in the images. The controller, node and memory controller are also referred to as components – more high level modules without any logic which should make the documentation more organized.
There are a lot of different licenses used for the different parts. With exception of the NoC everything is licensed with a free software compatible license. The license of the NoC restricts the use to research.
1.3. Design decisions
- The whole system should be as easy to understand as possible. Code is often written in a very verbose way and clever tricks are avoided.
- It should be as reusable as possible. Every module of the system has a specific function an could be used in a different system. The additional communication needed is accepted.
- The NoC bridges are not considered to be part of the NoC, but of the nodes. This should make it easier to swap the NoC for a different one.
- It was made with simulation in mind. There are parts of the system (like the defines_xxx.vh files) that might be cumbersome to use for synthesis especially if the block designer is used.
- For FPGAs the only proprietary software considered is Vivado because as the old saying goes: You either love Vivado or you have never used Quartus Prime before.
1.4. Control scheme
The control is done over the NoC and uses the control module to record what programs the nodes should execute.
1.4.1 Step 1 – node reading a program (no program set)
Once the system is started, the self-aware modules of the nodes start to read from the control to see if there are any programs their respective CPUs should execute. This is done via a simple AXI read request. The programs are represented by their starting address in memory and returned to the self-aware module in the rdata field. If rdata is 0, there is no program set for the node. Address 0 is reserved and the start of the controller program and no other program can start at this address.
The read request from the self-aware module is special because of the address that is used. Every read and write request where the msb of the address is set is considered a control signal. In the memory controller, the detector identifies these requests by checking the address and forwards these requests to the control by setting the demux output. Every other request is sent to the memory.
1.4.2 Step 2 – controller setting a program
Via the software running on the controller CPU a program can be set for a node like this:
NODE_6 = PROG_MUL;
The controller software makes heavy use of #defines. For the previous assignment the following is necessary:
#define PROG_MUL ( 0x1f0e8 ) #define NODE_6_ADDR ( 0b10000000000000000000000010011000 ) #define NODE_6 *( (volatile int*) NODE_6_ADDR )
An explanation of these lines:
- The 1st line defines a hex value that is the starting address of the corresponding program in memory. So if a CPU would start to read from this address it would start to execute a mul program. Please note that the addresses are byte aligned and if the memory is modeled using an array the corresponding index is 0x1f0e8 / 4.
- The 2nd line is the address which is used to identify the node in the control module. Although it looks more like an ID, it is referred to as an address, as this is the address that is used in the AXI requests. It is declared using binary which causes warnings during compilation, but makes reading the value a lot simpler. As you can see, the msb is set to 1. Recall, that is indicates a control signal and requests using this address are sent to the control module and not the memory.
Going from right to left i.e. from the back to the fornt:
- The first 2 bits are set to 0. This is due to the byte addressing previously mentioned. There are no sections of memory that can be accesses using these bits and the PicoRV32 that is used for the CPU even discards them. If a different CPU would be used the bits could be utilized, although this could cause incompatibilities and should be avoided.
- The next 5 bits identify the node. In this case those bits are 00110 which is 6 in decimal given an indication why the corresponding define is called NODE_6_ADDR.
- The rest of the bits are used to declare the nature of the control signal. Observing only this remaining section shows that the lsb is set to 1 which is tells the control module that the request is about setting a program.
- The 3rd and last line hides all the crazy pointer stuff behind a single define making the code a lot more readable.
If the previously noted code:
NODE_6 = PROG_MUL; is executed by the CPU the result is an AXI write request. The address is NODE_6 and the data is PROG_MUL. Contrary to the controller, the self-aware modules don’t run any software. They are fully implemented in hardware.
On the hardware side the code for the control module can be reduced to:
pico_sel = latched_awaddr[ PICO_MSB:PICO_LSB ]; if ( latched_awaddr[ INDEX_PROG ] == 1'b1 ) begin axi_offsets[ pico_sel ] = latched_wdata; if ( latched_wdata == 0 ) active[ pico_sel ] = 1'b0; else active[ pico_sel ] = 1'b1; end
- The ID of the node that this request concerns is extracted.
- If the bit at index 7 (INDEX_PROG) is set to 1 the request is about setting a program.
- The starting address of the program is stored in the wdata field of the AXI request and saved in an array using the ID of the node as the index.
- If wdata is 0 this means that write comes from the self-aware module of a node and is used to indicate it’s completion. Anything else is coming from the controller software and used to set an address.
Here you can also see, that the node is set to active / busy even though at this point the CPU of the node is not executing anything. Only after the self-aware module reads the address, the CPU is started.
To recap: The assignment of a program is simply a write to a memory address. The address is 0b10000000000000000000000010011000 and the data is 0x1f0e8. Due to the msb being 1 the write is sent to the control where the other bits of the address are used to determine what the control signal is about. In this case it is setting a program for node 6 which causes the 0x1f0e8 to be written into an array at index 6.
1.4.3 Step 3 – node reading a program
Once the self-aware module reads an address that is not 0, it turns on the CPU and sets the AXI offset to the read address. This offset is needed as every CPU starts reading from address 0 and each program is compiled the same way. The offset to the address essentially moves all the reads and writes to the right memory space. So instead of reading from 0, the CPU reads form 0+0x1f0e8 in the case of the previous example.
TODO vll code von self-aware module
The same signal that is used to turn on the CPU is also connected to the AXI mux and switches the AXI communication from the self-aware module to the CPU. The self-aware module now stops reading from the control and waits for the CPU to finish. Please not that during this time the done signal of the self-aware module is asserted which can look like a deadlock.
In the hardware such a read can be processed in the control module like this:
pico_sel = latched_araddr[ PICO_MSB:PICO_LSB ]; if ( latched_araddr[ INDEX_PROG ] == 1'b1 ) latched_rdata = axi_offsets[ pico_sel ];
1.4.4 Step 4 – node’s CPU executing the program
The CPU is now executing the program. Instead of reading from the control, the CPU is reading from the memory as the msb of the addresses is never set to 1. Although the CPU of a node could also read from the control it is not intended to do so. Think of the CPU as a brainless unit just executing instructions without realizing the existence of the other CPUs.
While the CPU is running, the controller can read from the control to get the state of the nodes. This simply returns the busy flag register. There is no way for the controller to interfere with the execution in any way or terminate it.
The busy flag register can be obtained like:
int busy = GET_BUSY;
There are defines that make checking the status of the individual nodes easy which can be found in ./TODO/defines.h. One example is
NODE_1_READY that could be used in an if to check if a new program can be assigned to node 1.
1.4.5 Step 5 – node’s CPU signaling completion
Once the CPU is finished it writes a certain value to a certain address. This is detected by the detector who signals the completion to the self-aware module. The self-aware modules turns the CPU off and writes a 0 to the control, representing that there is now no program set for this node. Once this request has been received, the control also changes the flag in the busy register where the controller will learn of the completion the next time it reads the busy flag register.
You can find the corresponding instructions at the end of the start.S file. The inclusion of this write can be controlled via a define. TODO.
It is possible for the controller to set a program for a busy node, but this will never be executed as the value is set back to 0 once the self-aware module signals the CPU’s completion.
Per default every CPU is connected to a debugger to show what they are currently doing. This is the case for the nodes as well as the controller itself. Using special print functions in the programs ASCII characters can be collected by the debugging system and sent to a host computer or testbench. The main goal of this relay thingy is to condense all the output to a single connection allowing easy usage of a UART connection.
The output of the debuggers is connected to a debugger relay, where all the characters are buffered until a newline (10 in decimal) is received. At this point the debugger relay starts to output the characters with an additional signal that is set to 1 for a single clk cycle. The message always starts with an identifier to match the message to a CPU. At the moment this ID is dependent on the order the CPU’s debugger is connected to the debugger relay as is based on the ASCII character for 0 and incremented for each additional debugger. Referencing the ASII table this means that after node 9, special characters are used something that should be improved in the future. This ID was previously sent be the debugger’s themselves which knew the node ID and were able to utilize it that way. However, this needed bigger buffers in the debugger relay and at that time resource efficiency was the highest goal as an FPGA was targeted the time this system was created.
At the moment it is not possible to dynamically connect all nodes to the debugger relay and this has to be done by hand. Furthermore they are not as efficient as possible as every debugger is connected via a 7-bit parallel connection which is could be reduced to a serial one. This would mean, that for each character 7 bits have to be transmitted which would degrade the systems performance as the CPU has to stall until the character has been transmitted. This could be avoided by adding a buffer to the debugger that allows it to send all the characters while the CPU is already working on the next instructions.
There are different IDs used in the system that must not be confused.
- Node ID: Using a parameter while instantiating a node sets the ID that is used by the control and the debugging system. This ID is also used in the controller software to set programs and read the node status
- NoC ID: Used for the communication over the NoC and not as noticable. Depending on which ports of the NoC a node is connected to the ID differs and is used to determine the sender of packages.
- Debug ID: The first character sent by the debugger relay. Depending on the order the debuggers are connected and incremented for each based on the ASCII value for the character 0. This should be fixed soon.
1.7. Known problems
There are a few problems that are known and tolerated at the moment. If something should not work in different situations (e.g. on FPGA) they might be the cause of the problem. Some might be related to a bug in the tools and are going to be reported once the issues can be condensed into a smaller examples.
- In the contoller component one AXI Light interface requires a parameter.
The line in question: if_axi_light #( .AXI_WSTRB_WIDTH(`AXI_WSTRB_WIDTH) ) if_axi_light_debugger();
Without the parameter this causes the following error:
%Error: Internal Error: ../../rtl/controller.sv:40: ../V3LinkDot.cpp:1317: No symbol for interface alias rhs
Solution: Just provide the parameter as this is only redundant information.
The read_resp task in the AXI Light interface requires non-blocking assignments (<= instead of =) The lines in question: rresp <= t_rresp; rdata <= t_rdata; Solution: Ignore for the time being and hope for the best.
Resolved: This problem apparently just disappeared. It was most likely a side effect caused by a different bug.
- The debugging system causes a segmentation fault in the testbench.
In the sim_main.cpp the chars from the debuggers are collected in a string and printed to the terminal once a newline has been received. On one computer this is not possible as the char array causes a segmentation fault.
Solution: Print every char directly without collecting them in a string.
- The debug function print_dec() does not always work.
Solution: Use print_hex instead.
- It is not possible to set the entry point during compilation. The compiler always defaults to calling the main function.
Solution: The function in the assembly startup file has been renamed to main to make sure this one is called. The main in main.c has been renamed to my_main and called in the aforementioned assembly file.
Stack pointer is used before it is set. When libraries are linked the stack pointer is used during some initialization before it can be set in the main function located in start.S. Solution: The stack pointer is set in the CPU and constant for every program.
Resolved: The stack pointer can now be set before it is used.
2. Getting started
Please note: The system was developed on a standard GNU/Linux distribution and I am unsure how well everything works on Windows or macOS.
2.1.1. RISC-V GNU toolchain
Specifically the toolchain for RV32I.
The Makefiles expect the toolchain to be installed in /opt/riscv32i/. It is advised that the following guide is used for the installation:
For simulating the system.
The newest version available is recommended:
2.1.3. GTKwave – optional
To display the tracefile and only used for debugging.
Any version your package manager offers should suffice.
2.2. Running a taskset on a system
Go to the dir ./tools/ and run “$python3 ./main.py _2x2_main.conf”
The main Makefile can be run from the project root.
|make||compiles the HDL|
|make run||executes the simulation|
|make wave||executes the simulation with a tracefile enabled|
|make clean||removes the compiled simulation environment and any tracefile|
|make sw||compiles all the programs and the controller software|
|make clean_sw||removes all the compiler output of the programs and controller software|
|make programs||compiles all the programs|
|make clean_programs||removes all the compiler output of the programs|
|make controller||compiles the controller software|
|make clean_controller||removes all the compiler output of the controller software|
Located in ./sw/programs
This Makefile is used by all the programs and should not be called from the ./sw/programs directory. Instead each program directory contains a Makefile where specific additions can be made like the inclusion of an addition library.
|make small||compiles the code for a small node (rv32i)|
|make big||compiles the code for a big node (rv32im)|
|make clean||removes compiler output|
This Makefile produces many different files for debugging purposes. The file rv32i_main.hex and rv32im_main.hex are the ones used by the system.
Located in ./sw/controller
Similar to the software Makefile but separated should the need for a greater difference arise.
In contrast to the AXI Light interface, the NoC one does not contain any tasks at the moment.
Once the master bridge has received a response the response is merged and sent to the flit buffer which sends it back over the NoC.
The flit buffer collects every flit that is received via the NoC. By observing the bits identifying the sender the flit is placed in the respective buffer. Once all the flits of one AXI requests have been collected all their data is merged together and passed to the master bridge where the AXI request is extracted and sent to the bridge_master. While the AXI request is being processed the flit_buffer continues to collect flits from the NoC, but only sends the next request to the master_bridge once the previous one has been returned.
The memory shipped with the system should only be used during simulation. Due to difficulties synthesizing the memory correctly it is advised to use one that is provided by the tools.
7.8.1 register based memory
This is a very simple memory only meant to be used with the Verilator. Due to the way the memory is accessed it cannot be synthesized as a BRAM and e.g. Vivado will use registers instead severely reducing the efficiency of the system. This is most notable when a post-implementation timing simulation is run. Synthesizing a small memory and running it on an FPGA does work however (if my memory serves me well).
7.8.2. bram based memory
The bram based memory was an effort to create a memory that Vivado recognizes correctly.
If I remember correctly, Vivado did create a BRAM, but there were issues reading from it. During post-implementation timing simulations the PicoRV32 would assert the unknown instruction trap because some read before contained wrong values. This was discovered by Sayandip De. Using the BRAM IP-Core provided by Vivado fixed this issue.
The debugger is listening for a specific write that uses a predefined address.
If this address has been detected, the write data is set to the top as well as a signal puls indicating the arrival of a new char to be outputted.
In the simulation environment this is handled in the sim_main.cpp file where the chars of the respective debugger modules are collected until a newline is received indicating that the debug string is complete. At this time the debug string is printed to the terminal.
The debugger_relay collects all the messages sent by the different debuggers and outputs them on a single output. It is meant to be connected to a UART to allow communication with a host computer.
Internally the debugger_relay contains buffers for all the debuggers and once a newline (TODO add hex value) has been received the message outputted.
- If new debuggers are added, they have to be manually connected to the deubber_relay in the top file. This could be avoided by using the generate construct, but was omitted as adding the signals by hand is not too bad in the short term.
- There was an effort to use the UART IP core found in Vivado that uses an AXI-Interface, but this interface never sent a response to the debugger_relay. All further efforts in this regard have been abandoned.
Software compiled with no initialization is considerable smaller and the preferred way especially if many programs need to be placed in memory. It is advised to just try running the programs without initialization and see if it works. Per default software is compiled with initialization.
TODO how to change that
The initialization should be found in some crt0.S file, but I was not able to locate this file, only it’s compiled counterpart crt0.o (TODO somewhere in the toolchain). As a quick fix this mystical file was copied into the programs directory and linked there by hand. It is possible to look into the file by (TODO objdump), but no attempts at reverse engineering have been made and the – for this CPU – unfit initialization has been kept.
Additionally nostart is set so the place of where crt0 will be in the resulting program can be influenced.
The first file that should be executed is sp.S or sp_nostart.S depending on the need for initialization. Both set the stack pointer and only differ in what function they call next. sp.S calls _start that is located in crt0.o. main is called sometime later by unknown forces while sp_nostart.S calls main directly. Please note that main is not as usual in main.c, but in the start.S (TODO rename file) file where the registers are initialized. There the actual main function is called which has been renamed to my_main.
The initialization can also be omitted by disabling the linker entirely (TODO arg. glaub es ist -c).
As is evident there is a lot of room for improvement.
A linker script has been avoided as it would only increase the complexity of the system.
There are a few things to keep in mind:
- Due to difficulties to setting the entry point, the main function has to be called my_main at the moment.
- There are various print functions in the util.h (found in _libs) that can be used with the debugger. They should be used as little as possible as they can greatly increase the size of the program.
- At the end the function signal_fin() has to be called to signal the self_awareness, that the program execution is finished. An endless loop afterwards is recommended. This should be moved into the start.S in the future.
- The system does not support malloc. There is a library provided called memmgr (found in _libs) that can be used to replace the usual functions. Please have a look at dhrystone to see how it can be used.
- No optimization (-O0) is advised. -O1 optimizes the mul away and anything higher causes EBREAKs / ECALLs. Normally the latter gives control to an underlying system, but as nothing is there the CPU crashes.
The source code should tell you everything you need to know, especially the defines.h. If anything is unclear, please feel free to contact me.
This tool can be used to translate the instructions for binary to hex. This makes inspecting the program files for debugging purposes easier. The tool has been released by SiFive under the Apache 2.0 license.
In can be used like this:
freedom-bin2hex.py --bit-width 32 main.bin > main.hex
This tool adds
The mampacker can be used to pack different programs into memory without wasting any space between them and not causing any stack overflows. The crucial part is in the placement of the stack pointer. As there is no memory management the stack could grow into the text section of a program (the instructions) which causes the CPU to crash as it tries to execute the data that is stored there as instructions. A simple remedy to this problem would be a generous placement of the stack pointer which wastes memory that might be limited. If you think about placing programs into memory you might want to spread them out leaving enough space between them to accommodate the stack and changed to the program that cause the .text section to grow. However, this wastes space and the threat of stack overflows is not really addressed. Ideally, each program would start right after the stack pointer of the previous one.
Using simulators, especially the Verilator, wast address spaces might not be a problem, but on an FPGA small memories could allow the usage of a simple BRAM instead of the actual RAM. Furthermore the memory working with the Cadence simulator Incisive is set up to read every word at the beginning of the simulation which causes a relatively long time.
In order to determine the memory the stack is going to use a few methods have been explored. Statically determining the stack might seem straightforward, but gets complicated if libraries are linked. Simulators that simulate the RISC-V ISA have been tried but abandoned. As this was very tangential to the actual project not a lot of time was expended troubleshooting. One of the simulators, Spike expects the program to start at a specific address requiring messing with Makefiles that are hard to grasp to begin with. Furthermore I was unable to execute anything. Every attempt resulted in the following message: “terminate called after throwing an instance of ‘trap_load_access_fault'”. Another Simulator, rv8 seemed more promising initially, but as far as I can remember it always crashed with a segmentation fault.
As the determination of the optimal stack pointer “only” requires the execution of the code on a CPU and was possible by simulating a PicoRV32 with the Verilator this rather crude approach was chosen.
At first the programs are compiled with a generous stack pointer and executed on PicoRV32. The simulation environment checks all writes to memory to determine what addresses (or indices if you consider the memory as an array) are used. While referencing the program instructions, the bss and stack sections of the memory are identified and the minimal stack position calculated. The program is then compiled again with the new stack pointer. After the minimal stack pointers for each program has been identified a header file is created that contains all the starting addresses of the programs. The instructions of the programs are also collected in one .hex file where the space between the instructions reserved for the stacks is filled with zeros.
For the controller software a somewhat generous stack pointer is used as it is hard to determine it’s stack in this setup. The stack pointer for the controller software can be thought of as an offset to the programs.
10. FPGA (using Vivado)
how to create a memory file
TODO – create hex to coe python script.
Zynq suppoted boards
The Xilinx SDK can only be used to program the Zynq SoC on the board and not the CPUs of the MPSoC.
10.1. IP Cores
An IP core can be created for each module or for each component. The latter requires less work and results in a more clear block design as less boxes have to be connected, but makes debugging using the Integrated Logic Analyzer more difficult.
To package an IP core please follow these steps:
- Create a wrapper that does not contain any interfaces as input or output as this is not allowed by Vivado. There is a wrapper for the AXI offset included you can use as a reference.
- Create a Vivado project and include the following files:
- The module and the wrapper
- Every interface used (found in /rtl/interfaces)
- The define files for the interfaces (found in /configurations/x like “defines_axi.vh”)
- Click on the files and make sure that they are recognized as the correct type under “Type” in “Source File Properties”. Should Vivado complain about assignments this might be the cause of the issue.
- The module, wrapper and interfaces should be “System Verilog”
- The define files should be “Verilog Header”
- Open all the System Verilog files and include the Verilog Headers at the beginning: e.g. `include “defines_axi.vh”. If the interfaces are not be shown under “Sources” in the “Hierarchy” tab, select the “Library” tab to find them.
- Make sure that every parameter has a default value.
- Synthesize the code to make sure it is working. If the Verilog Headers have not been included correctly the synthesis might still work. However you will get an error during the next step.
- Package the IP core with the following recommendation:
- Remove all the memory mapped stuff under “Addressing and Memory”. Vivado likes to assign everything AXI related a memory space. This should not be needed most of the time.
10.1.1 CONNECT NoC
The CONNECT NoC requires special attention.
- Rename the .hex files to .data.
- Open the mkNetworkSimple.v file where the .hex (now .data) files are read and update the path.
- Create a Vivado project and include all .v and .data files.
- Synthesize the code to make sure it is working.
- Package the IP core.