Neural Chip Design [4/4: SoC Integration & Firmware]
This is a series of articles [overview] outlining the workflow of 15 steps, which I developed over the past few years through building my own DNN accelerator: Kraken [arXiv paper].
After building and testing each module and combining them hierarchically, it is time to build an SoC around it and control it. I used a Xilinx Zynq Z706 development board with a Z-4045 chip, which has an ARM Cortex processor and a Kintex FPGA on the same silicon die.
The following is the overview of the design. Gray-colored modules are Xilinx IPs. Two soft DMAs pull input \(\hat{X}\) and weight \(\hat{K}\) from the off-chip DDR and feed as two AXI4-Streams, which are then synchronized by the input pipe and provided to the Kraken Engine. The output \(\hat{Y}\) is stored back into the DDR through another soft DMA. The three soft DMAs are controlled by commands issued by the ARM Cortex core, as dictated by the firmware which I then developed.
- SoC Block Design: Build FPGA projects with Vivado manually and synthesize
- Automation: TCL scripts to automate the project building and configuration
- C++ Firmware: To control the custom modules
- Hardware Verification: Test on FPGA, compare output to golden model
- Repeat 11-14
11. SoC Block Design
I add my custom modules to a Vivado block design, add soft DMAs from the IP catalog, configure them, connect them to my main module, run block & connection automation, copying down TCL commands at every step. Below is the final block design I get, first manually, then automating it with TCL scripts.
12. TCL Automation
Xilinx Vivado projects are notoriously buggy. They crash once in a while and get corrupted. Vivado also auto-generates hundreds of small files, which contain absolute paths, and don't play well with a different Vivado version. Therefore, it is a bad idea to version control them.
The best practice is to script the project flow. Once I manually copy down the TCL commands, I change them into parameterized code.
I then spend a couple of days debugging the TCL script to ensure it can reliably rebuild a project from scratch. These TCL scripts and the source Verilog files are tracked by git.
13. C++ Firmware
I then write the C++ code to be run on the ARM processor, which instructs the DMA to pull data from memory and push it back. When multiple DMAs are involved, this is fairly tricky. Right after starting a DMA operation, the parameters for the next DMA iteration must be calculated in advance, to prevent stalling the DMA.
13.1. OOP Wrappers for DMA Drivers
I find the C code provided by Xilinx a bit counterintuitive. Therefore, I have written an OOP wrapper for the Xilinx DMA, which is open-sourced here:
13.2. OOP Architecture for DNN models & config bits in C++
The firmware needs to be flexible, such that I can create any DNN by chaining layer objects. For this, I write the layer class, with necessary features like extracting configuration bits and appending to data.
class Layer
{
public:
int idx, H_IN, W_IN, C_IN, C_OUT, KH_IN, KW_IN;
bool IS_NOT_MAX, IS_MAX, IS_LRELU;
Layer * PREV_P = nullptr;
Layer * NEXT_P = nullptr;
int BLOCKS, BLOCKS_PER_ARR;
u8 MAX_FACTOR, SUB_CORES, EFF_CORES, ITR, COUT_FPGA, COUT_VALID, COUT_INVALID;
u8 KW_PAD;
int OUT_W_IN, OUT_BLOCKS, OUT_MAX_FACTOR, OUT_BLOCKS_PER_ARR, OUT_KH;
int DATA_BEATS_PIXELS;
int BEATS_LRELU = 0;
int WORDS_PIXELS_PER_ARR;
int WORDS_WEIGHTS_PER_ITR, WORDS_WEIGHTS;
int WORDS_OUT_PER_TRANSFER, TRANSFERS_OUT_PER_ITR;
int WORDS_OUT_PER_TRANSFER_ARR [3];
chunk_s * input_chunk_p = nullptr;
chunk_s * output_chunk_p = nullptr;
bool done_write = false;
Layer ( int idx,
int H_IN, int W_IN, int C_IN, int C_OUT,
int KH_IN, int KW_IN,
bool IS_NOT_MAX, bool IS_MAX, bool IS_LRELU):
idx (idx),
H_IN (H_IN),
W_IN (W_IN),
C_IN (C_IN),
C_OUT (C_OUT),
KH_IN (KH_IN),
KW_IN (KW_IN),
IS_NOT_MAX(IS_NOT_MAX),
IS_MAX (IS_MAX),
IS_LRELU (IS_LRELU)
{
BLOCKS = H_IN / UNITS;
MAX_FACTOR = IS_MAX ? 2 : 1;
BLOCKS_PER_ARR = BLOCKS / MAX_FACTOR;
KW_PAD = KW_IN - 2*IS_MAX;
SUB_CORES = MEMBERS / KW_IN;
EFF_CORES = COPIES * GROUPS * SUB_CORES / MAX_FACTOR;
ITR = (int)(std::ceil((float)C_OUT / (float)EFF_CORES));
COUT_FPGA = EFF_CORES * ITR;
COUT_VALID = C_OUT % EFF_CORES;
COUT_VALID = (COUT_VALID == 0) ? EFF_CORES : COUT_VALID;
COUT_INVALID = EFF_CORES - COUT_VALID;
/* LRELU BEATS */
BEATS_LRELU += 1; //D
BEATS_LRELU += ceil(2.0/KW_IN); // A
for (int clr_i=0; clr_i < KW_IN/2+1; clr_i ++){
int clr = clr_i*2 +1;
for (int mtb=0; mtb < clr; mtb ++){
int bram_width = MEMBERS/clr;
int bram_size = 2*SUB_CORES;
int BEATS_ij = ceil((float)bram_size/bram_width);
BEATS_LRELU += BEATS_ij;
}
}
DATA_BEATS_PIXELS = BLOCKS_PER_ARR * W_IN * C_IN;
WORDS_PIXELS_PER_ARR = DATA_BEATS_PIXELS * UNITS_EDGES;
WORDS_WEIGHTS_PER_ITR = (S_WEIGHTS_WIDTH/8) + (BEATS_LRELU + C_IN*KH_IN) * COPIES * GROUPS * MEMBERS;
WORDS_WEIGHTS = ITR * WORDS_WEIGHTS_PER_ITR;
if (IS_NOT_MAX && IS_MAX)
{
WORDS_OUT_PER_TRANSFER_ARR[0] = SUB_CORES * COPIES * GROUPS * UNITS_EDGES;
WORDS_OUT_PER_TRANSFER_ARR[1] = COPIES * GROUPS * UNITS_EDGES;
WORDS_OUT_PER_TRANSFER_ARR[2] = COPIES * GROUPS * UNITS_EDGES / MAX_FACTOR;
TRANSFERS_OUT_PER_ITR = BLOCKS/MAX_FACTOR * W_IN/MAX_FACTOR * (1 + 2 * SUB_CORES);
}
else
{
WORDS_OUT_PER_TRANSFER = SUB_CORES * COPIES * GROUPS * UNITS_EDGES / MAX_FACTOR;
TRANSFERS_OUT_PER_ITR = BLOCKS/MAX_FACTOR * W_IN/MAX_FACTOR;
}
};
void set_config()
{
input_chunk_p->data_p[0] = (s8)(IS_NOT_MAX);
input_chunk_p->data_p[1] = (s8)(IS_MAX);
input_chunk_p->data_p[2] = (s8)(IS_LRELU);
input_chunk_p->data_p[3] = (s8)(KH_IN/2);
#ifdef DEBUG
for (int i=4; i<UNITS_EDGES; i++) input_chunk_p->data_p[i] = 0;
#endif
Xil_DCacheFlushRange((UINTPTR)input_chunk_p->data_p, UNITS_EDGES);
};
void set_out_params()
{
/* Next layer can be null (if this is last) or can have multiple next layers.
* We are interested in how to arrange the output values of this, to match the next
*/
OUT_W_IN = W_IN / MAX_FACTOR;
OUT_BLOCKS = (H_IN / MAX_FACTOR) / UNITS;
OUT_MAX_FACTOR = (NEXT_P == nullptr) ? 1 : NEXT_P->MAX_FACTOR;
OUT_BLOCKS_PER_ARR = OUT_BLOCKS/OUT_MAX_FACTOR;
OUT_KH = (NEXT_P == nullptr) ? KH_IN : NEXT_P->KH_IN;
}
inline s8* get_input_pixels_base_p()
{
return (s8*)(input_chunk_p->data_p) + UNITS_EDGES;
}
inline s8* get_output_pixels_base_p()
{
return (s8*)(output_chunk_p->data_p) + UNITS_EDGES;
}
};
auto build_yolo_mod()
{
std::array<Layer,21> layers = {
Layer(1, H_RGB ,W_RGB , 3, 32, 3, 3,false, true, true),
Layer(2, H_RGB/2 ,W_RGB/2, 32, 64, 3, 3,false, true, true),
Layer(3, H_RGB/4 ,W_RGB/4, 64, 128, 3, 3,true, false, true),
Layer(4, H_RGB/4 ,W_RGB/4, 128, 64, 1, 1,true, false, true),
Layer(5, H_RGB/4 ,W_RGB/4, 64, 128, 3, 3,false, true, true),
Layer(6, H_RGB/8 ,W_RGB/8, 128, 256, 3, 3,true, false, true),
Layer(7, H_RGB/8 ,W_RGB/8, 256, 128, 1, 1,true, false, true),
Layer(8, H_RGB/8 ,W_RGB/8, 128, 256, 3, 3,false, true, true),
Layer(9, H_RGB/16,W_RGB/16, 256, 512, 3, 3,true, false, true),
Layer(10, H_RGB/16,W_RGB/16, 512, 256, 1, 1,true, false, true),
Layer(11, H_RGB/16,W_RGB/16, 256, 512, 3, 3,true, false, true),
Layer(12, H_RGB/16,W_RGB/16, 512, 256, 1, 1,true, false, true),
Layer(13, H_RGB/16,W_RGB/16, 256, 512, 3, 3,false, true, true),
Layer(14, H_RGB/32,W_RGB/32, 512,1024, 3, 3,true, false, true),
Layer(15, H_RGB/32,W_RGB/32,1024, 512, 1, 1,true, false, true),
Layer(16, H_RGB/32,W_RGB/32, 512,1024, 3, 3,true, false, true),
Layer(17, H_RGB/32,W_RGB/32, 64, 128,1024, 512,true, false, true),
Layer(18, H_RGB/32,W_RGB/32, 64, 128, 512,1024,true, false, true),
Layer(19, H_RGB/32,W_RGB/32,1024,1024, 3, 3,true, false, true),
Layer(20, H_RGB/32,W_RGB/32,1024,1024, 3, 3,true, false, true),
Layer(21, H_RGB/32,W_RGB/32,1024, 45, 1, 1,true, false, false)
};
for (int i=0; i < N_LAYERS; i++)
{
if (i!=0 ) layers[i].PREV_P = &layers[i-1];
if (i!=N_LAYERS-1) layers[i].NEXT_P = &layers[i+1];
layers[i].set_out_params();
}
return layers;
}
13.3. C++ Code to control multiple DMAs effectively
Next, I write C++ functions to reshape the output (\hat{Y}\) on the fly (after each small DMA packet) to generate the next layers input \(\hat{X}\). Also, configuration bits need to be calculated and appended to the packet to make it complete.
void restart_output()
{
static int i_w=0, i_w_flipped=0, i_blocks=0, i_bpa=0, i_arr=0, i_cout=0, i_itr=0, i_layers=i_layers_start;
static volatile s8 * write_p = layers[i_layers].get_output_pixels_base_p();
static bool is_new_layer=true;
static volatile s8 * write_p_old = 0;
Xil_DCacheFlushRange((UINTPTR)write_p_old, UNITS_EDGES);
if ((i_itr == 0 && i_blocks == 31) || (i_itr == 1 && i_blocks == 0)){
for (int i=0; i<UNITS_EDGES; i++){
PRINT(" %d,", write_p_old[i]);
}
PRINT("] \r\n");
PRINT("(%d,%d,%d,%d-%d,:) -- %p [", i_arr, i_bpa, i_w_flipped,i_itr,i_cout, write_p);
}
write_p_old = write_p;
// start transfer
dma_weights_im_out.s2mm_start( (UINTPTR)write_p,
layers[i_layers].WORDS_OUT_PER_TRANSFER);
pad_prev(i_w_flipped,i_blocks,i_bpa,i_arr,i_cout,i_layers);
// set config
if (is_new_layer && i_layers != N_LAYERS-1)
{
layers[i_layers].NEXT_P->set_config();
layers[i_layers].NEXT_P->done_write = false;
is_new_layer = false;
}
// PREPARE NEXT INDICES
// blocks = 31 (a=1,bpa=15), w_f = 191 (w = 190), itr = 0
if (i_w < layers[i_layers].OUT_W_IN-1)
{
i_w += 1;
// Flip last KW-1 columns : flipped = 2w-(kw+iw)
// For max: kw <- kw-2
if (i_w > layers[i_layers].OUT_W_IN - layers[i_layers].KW_PAD)
i_w_flipped = 2 * layers[i_layers].OUT_W_IN - (i_w + layers[i_layers].KW_PAD);
else
i_w_flipped = i_w;
}
else
{
i_w = 0;
i_w_flipped = 0;
PRINT(" i_blocks: %d, write_p: %p \r\n", i_blocks, write_p);
if (i_blocks < layers[i_layers].OUT_BLOCKS-1)
{
i_blocks += 1;
i_arr = i_blocks % layers[i_layers].OUT_MAX_FACTOR;
i_bpa = i_blocks / layers[i_layers].OUT_MAX_FACTOR;
}
else
{
i_blocks = 0;
i_arr = 0;
i_bpa = 0;
PRINT(" i_itr: %d \r\n", i_itr);
if (i_itr >= layers[i_layers].ITR-1)
{
is_new_layer = true;
i_itr = 0;
i_cout= 0;
if (i_layers < N_LAYERS-1)
i_layers += 1;
else
{
i_layers = 0;
done = true;
PRINT("All Layers done \r\n");
}
/* Chaining*/
if (i_layers == N_LAYERS-1)
{
layers[0].input_chunk_p = &temp_in_chunk;
layers[i_layers].output_chunk_p = &temp_out_chunk;
}
else
{
layers[i_layers].output_chunk_p = get_chunk();
layers[i_layers].NEXT_P->input_chunk_p = layers[i_layers].output_chunk_p;
}
PRINT("Writing to new layer: chained_chunks (idx:%d -> idx:%d), data_p= %p \r\n",
layers[i_layers].idx, layers[i_layers].NEXT_P->idx,
layers[i_layers].output_chunk_p->data_p);
layers[i_layers].print_output_params();
}
else if (i_itr == 0)
{
i_itr += 1;
i_cout = layers[i_layers].COUT_VALID;
}
else
{
i_itr += 1;
i_cout += layers[i_layers].EFF_CORES;
}
}
}
// blocks = 31 (a=1,bpa=15), w_f = 191, itr = 0
write_p = unravel_image_abwcu(layers[i_layers].get_output_pixels_base_p(),
i_arr,i_bpa,i_w_flipped,i_cout,0, i_layers);
}
14. Hardware (FPGA) Verification
15. Repeat 11-14
I spent weeks or months repeating 11-14, to finally get the hardware outputs to match the golden model, and hence the original DNNs. Once I spent a month figuring out a bug where the system worked perfectly in randomized simulations but had wrong values for just 6 bytes out of 4 million bytes. Finally, I found it's a bug in Vivado's compiler.