Aba's Blog

Over two years ago, most of us (including myself) rooted #WeWantGota. A fair percentage of us were legit racists. Most of us were “I’m not a racist, but”. Many believed democracy is inefficient & Sri...

GitHub - abarajithan11/cocotb-example

CocoTB: FPGA/ASIC Testbenches in Python + Automated Testing in GitHub

Abarajithan Gnaneswaran — Mon, 04 Apr 2022 15:52:02 GMT

Verilog and VHDL are quite good for designing digital circuits. But it is often painful to write Verilog testbenches to simulate them. If you disagree, try writing a neural network or a multi-dimensional Fourier Transform in SystemVerilog.

I recently came across cocotb, a popular python-based alternative to SystemVerilog testbenches. It doesn’t do the simulation itself (a good thing), but interfaces with widely used simulators such as iVerilog, Verilator, Synopsys VCS, Cadence Incisive, and more. You just write your clock-cycle accurate testbenches in python, specify a simulator and run. You can dump and view the waveforms in GTK Wave.

GTK Wave, showing the output from Icarus Verilog Simulator

A simple register in SystemVerilog, python testbench, pytest setup for multiple parameters

This means you can use NumPy, SciPy, TensorFlow, PyTorch… the entire python arsenal to build your golden models, generate test vectors and compare with outputs. PDB, a powerful python debugger is also available at your disposal.

Wait, it gets better. Cocotb-test, another library uses PyTest to test your designs across several sets of parameters automatically. This whole thing can then be automated into regression tests in a CI/CD pipeline. Whenever a commit is pushed into the GitHub repository or a branch is merged into master, a set of integration testbenches run over several parameters to ensure your design stays clean.

GitHub actions setup, Automated test results, Failed test example

For more details, check out this talk at CERN & UPenn. They use cocotb to verify the chips that go into the Large Hadron Collider.

CERN + UPenn Talk

In the following repository, I have built two designs: a simple register and an AXI-Stream FIFO, built parameterized testbenches, pytest to run over multiple parameters, and set up GitHub Actions to run them on push to the repo. Feel free to check them out.

Contribute to abarajithan11/cocotb-example development by creating an account on GitHub.

GitHubabarajithan11

In a nutshell, cocotb brings the joy of the software world into the dark and desolate world of hardware verification.

First posted on Facebook:

FPGA/ASIC Testbenches in Python + Automated Testing in GitHubVerilog and VHDL are quite good for designing digital circuits. But it is often painful to write Verilog testbenches to simulate them....

GitHub - abarajithan11/zynq-oop-drivers: Object oriented drivers for Xilinx IPs (DMA...) for ZYNQ PSoC

Upper Gartmore Camping [2022]

Abarajithan Gnaneswaran — Sun, 27 Feb 2022 02:36:00 GMT

Gartmore & Frogmore are gorgeous tea estates near the Maskeliya Reservoir, next to Hatton. Just above them, on top of a mountain range between Ratnapura and Hatton, there is a relatively flat, wide plain of thick bushes, with a stream flowing through them, to fall over a cliff as the Upper Gartmore falls.

I took an 8-hour bus from Jaffna to Kandy, stayed at a backpacker's hostel there, met my school friends for a dinner. I slept for 2 hours, took the bus to Hatton at 3.30 am, and reached there at 5.30 am. We took a van and reached Gartmore estate around 11 am.

Hike through Frogmore Estate

Maskeliya Reservoir

Unfortunately, the supervisor of the tea estate does not allow tourists to reach the top of the mountain on a vehicle. We started hiking through the forest, lost track and ended up in a Hindu kovil as it started raining heavily. When the rain subsided, we asked a local to guide us. He took us through the 6-hour strenuous hike through thick jungle, at times climbing near vertically. We walked through thick, leech-infested undergrowth, with barely visible & perilous steep fall on one side to finally reach the flat camping grounds.

Camping Grounds

After reaching the gorgeous camping grounds at around 4 pm, we set up the camp, collected firewood, and washed off our legs oozing with blood. The sunset was amazing:

Sunset

Camping

We had bought a lot of food: Noodles, sausages, coffee, chips, biscuits, red bull... Dinner was divine. Few of us set up our DSLR cameras to shoot the milky way and the starry sky.

Upper Gartmore Falls

Sripada & Maskeliya Reservoir visible at once

The next morning, we walked through the stream to the Upper Gartmore falls, where we got this scenic view of Adam's peak (Sripada) and Maskeliya Reservoir.

When climbing down, we were fortunate to get a three-wheeler to a point after walking for an hour. We reached Hatton and then Colombo.

Neural Chip Design [4/4: SoC Integration & Firmware]

Abarajithan Gnaneswaran — Sat, 29 Jan 2022 10:37:39 GMT

This is a series of articles [overview] outlining the workflow of 15 steps, which I developed over the past few years through building my own DNN accelerator: Kraken [arXiv paper].

After building and testing each module and combining them hierarchically, it is time to build an SoC around it and control it. I used a Xilinx Zynq Z706 development board with a Z-4045 chip, which has an ARM Cortex processor and a Kintex FPGA on the same silicon die.

The following is the overview of the design. Gray-colored modules are Xilinx IPs. Two soft DMAs pull input $\hat{X}$ and weight $\hat{K}$ from the off-chip DDR and feed as two AXI4-Streams, which are then synchronized by the input pipe and provided to the Kraken Engine. The output $\hat{Y}$ is stored back into the DDR through another soft DMA. The three soft DMAs are controlled by commands issued by the ARM Cortex core, as dictated by the firmware which I then developed.

SoC Block Design: Build FPGA projects with Vivado manually and synthesize
Automation: TCL scripts to automate the project building and configuration
C++ Firmware: To control the custom modules
Hardware Verification: Test on FPGA, compare output to golden model
Repeat 11-14

11. SoC Block Design

I add my custom modules to a Vivado block design, add soft DMAs from the IP catalog, configure them, connect them to my main module, run block & connection automation, copying down TCL commands at every step. Below is the final block design I get, first manually, then automating it with TCL scripts.

12. TCL Automation

Xilinx Vivado projects are notoriously buggy. They crash once in a while and get corrupted. Vivado also auto-generates hundreds of small files, which contain absolute paths, and don't play well with a different Vivado version. Therefore, it is a bad idea to version control them.

The best practice is to script the project flow. Once I manually copy down the TCL commands, I change them into parameterized code.

set TUSER_WIDTH_MAXPOOL_IN     [expr $BITS_KW2      + $I_KW2]
set TUSER_WIDTH_LRELU_IN       [expr $BITS_KW       + $I_CLR]
set TUSER_CONV_DW_BASE         [expr 1 + $I_IS_BOTTOM_BLOCK ]
set TUSER_CONV_DW_IN           [expr $MEMBERS*$BITS_KW + $BITS_OUT_SHIFT + $BITS_MEMBERS + $TUSER_CONV_DW_BASE]
set TUSER_WIDTH_LRELU_FMA_1_IN [expr 1         + $I_IS_LRELU]
set TUSER_WIDTH_CONV_IN        [expr $I_IS_SUM_START     + 1]

set DEBUG_CONFIG_WIDTH_W_ROT   [expr 1 + 2*$BITS_KW2 + 3*($BITS_KH2      + $BITS_IM_CIN + $BITS_IM_COLS + $BITS_IM_BLOCKS)]
set DEBUG_CONFIG_WIDTH_IM_PIPE [expr 3 + 2 + $BITS_KH2      + 0]
set DEBUG_CONFIG_WIDTH_LRELU   [expr 3 + 4 + $BITS_FMA_2]
set DEBUG_CONFIG_WIDTH_MAXPOOL 1
set DEBUG_CONFIG_WIDTH         [expr $DEBUG_CONFIG_WIDTH_MAXPOOL + $DEBUG_CONFIG_WIDTH_LRELU + 2*$BITS_KH2      + $DEBUG_CONFIG_WIDTH_IM_PIPE + $DEBUG_CONFIG_WIDTH_W_ROT]

# **********    STORE PARAMS    *************
set file_param [open $RTL_DIR/include/params.v w]
if {$MAC_TYPE == "XILINX"} {
  puts $file_param "`define MAC_XILINX 1"
}
if {$REG_TYPE == "ASIC"} {
  puts $file_param "`define ASIC_REG 1"
}
puts $file_param "
`define SRAM_TYPE   \"$SRAM_TYPE\"  
`define MAC_TYPE    \"$MAC_TYPE\"  
`define UNITS    $UNITS  
`define GROUPS   $GROUPS 
`define COPIES   $COPIES 
`define MEMBERS  $MEMBERS
`define DW_FACTOR_1 $DW_FACTOR_1
`define OUTPUT_MODE \"$OUTPUT_MODE\"
`define KSM_COMBS_EXPR $KSM_COMBS_EXPR
`define KS_COMBS_EXPR $KS_COMBS_EXPR

Part of the tcl script that calculates parameters and writes the params.v file

I then spend a couple of days debugging the TCL script to ensure it can reliably rebuild a project from scratch. These TCL scripts and the source Verilog files are tracked by git.

create_bd_design "sys"

# ZYNQ IP
create_bd_cell -type ip -vlnv xilinx.com:ip:processing_system7:5.5 processing_system7_0
apply_bd_automation -rule xilinx.com:bd_rule:processing_system7 -config {make_external "FIXED_IO, DDR" apply_board_preset "1" Master "Disable" Slave "Disable" }  [get_bd_cells processing_system7_0]
set_property -dict [list CONFIG.PCW_FPGA0_PERIPHERAL_FREQMHZ $FREQ_LITE CONFIG.PCW_FPGA1_PERIPHERAL_FREQMHZ $FREQ_LOW CONFIG.PCW_FPGA2_PERIPHERAL_FREQMHZ $FREQ_HIGH CONFIG.PCW_EN_CLK2_PORT $FREQ_LITE CONFIG.PCW_USE_S_AXI_ACP {1} CONFIG.PCW_USE_DEFAULT_ACP_USER_VAL {1} CONFIG.PCW_USE_FABRIC_INTERRUPT {1} CONFIG.PCW_EN_CLK1_PORT {1} CONFIG.PCW_EN_CLK2_PORT {1} CONFIG.PCW_IRQ_F2P_INTR {1} CONFIG.PCW_QSPI_PERIPHERAL_ENABLE {0} CONFIG.PCW_SD0_PERIPHERAL_ENABLE {0} CONFIG.PCW_I2C0_PERIPHERAL_ENABLE {0} CONFIG.PCW_GPIO_MIO_GPIO_ENABLE {0}] [get_bd_cells processing_system7_0]

# Accelerator
create_bd_cell -type module -reference axis_accelerator axis_accelerator_0

# Weights & out DMA
set IP_NAME "dma_weights_im_out"
create_bd_cell -type ip -vlnv xilinx.com:ip:axi_dma:7.1 $IP_NAME
set_property -dict [list CONFIG.c_include_sg {0} CONFIG.c_sg_length_width {26} CONFIG.c_m_axi_mm2s_data_width {32} CONFIG.c_m_axis_mm2s_tdata_width {32} CONFIG.c_mm2s_burst_size {8} CONFIG.c_sg_include_stscntrl_strm {0} CONFIG.c_include_mm2s_dre {1} CONFIG.c_m_axi_mm2s_data_width $S_WEIGHTS_WIDTH_LF CONFIG.c_m_axis_mm2s_tdata_width $S_WEIGHTS_WIDTH_LF CONFIG.c_m_axi_s2mm_data_width $M_DATA_WIDTH_LF CONFIG.c_s_axis_s2mm_tdata_width $M_DATA_WIDTH_LF CONFIG.c_include_s2mm_dre {1} CONFIG.c_s2mm_burst_size {16}] [get_bd_cells $IP_NAME]

# Im_in_1
set IP_NAME "dma_im_in"
create_bd_cell -type ip -vlnv xilinx.com:ip:axi_dma:7.1 $IP_NAME
set_property -dict [list CONFIG.c_include_sg {0} CONFIG.c_sg_length_width {26} CONFIG.c_m_axi_mm2s_data_width [expr $S_PIXELS_WIDTH_LF] CONFIG.c_m_axis_mm2s_tdata_width [expr $S_PIXELS_WIDTH_LF] CONFIG.c_include_mm2s_dre {1} CONFIG.c_mm2s_burst_size {64} CONFIG.c_include_s2mm {0}] [get_bd_cells $IP_NAME]

# # Interrupts
create_bd_cell -type ip -vlnv xilinx.com:ip:xlconcat:2.1 xlconcat_0
set_property -dict [list CONFIG.NUM_PORTS {4}] [get_bd_cells xlconcat_0]
connect_bd_net [get_bd_pins dma_im_in/mm2s_introut] [get_bd_pins xlconcat_0/In0]
connect_bd_net [get_bd_pins dma_weights_im_out/mm2s_introut] [get_bd_pins xlconcat_0/In2]
connect_bd_net [get_bd_pins dma_weights_im_out/s2mm_introut] [get_bd_pins xlconcat_0/In3]
connect_bd_net [get_bd_pins xlconcat_0/dout] [get_bd_pins processing_system7_0/IRQ_F2P]

# Engine connections
connect_bd_net [get_bd_pins processing_system7_0/FCLK_CLK1] [get_bd_pins axis_accelerator_0/aclk]
connect_bd_net [get_bd_pins processing_system7_0/FCLK_CLK2] [get_bd_pins axis_accelerator_0/hf_aclk]
connect_bd_intf_net [get_bd_intf_pins dma_im_in/M_AXIS_MM2S] [get_bd_intf_pins axis_accelerator_0/s_axis_pixels]
connect_bd_intf_net [get_bd_intf_pins dma_weights_im_out/M_AXIS_MM2S] [get_bd_intf_pins axis_accelerator_0/s_axis_weights]
switch $OUTPUT_MODE {
  "CONV"    {connect_bd_intf_net [get_bd_intf_pins dma_weights_im_out/S_AXIS_S2MM] [get_bd_intf_pins axis_accelerator_0/conv_dw2_lf_m_axis]}
  "LRELU"   {connect_bd_intf_net [get_bd_intf_pins dma_weights_im_out/S_AXIS_S2MM] [get_bd_intf_pins axis_accelerator_0/lrelu_dw_lf_m_axis]}
  "MAXPOOL" {connect_bd_intf_net [get_bd_intf_pins dma_weights_im_out/S_AXIS_S2MM] [get_bd_intf_pins axis_accelerator_0/max_dw2_lf_m_axis ]}
}

# AXI Lite
startgroup
apply_bd_automation -rule xilinx.com:bd_rule:axi4 -config { Clk_master {/processing_system7_0/FCLK_CLK0 ($FREQ_LITE MHz)} Clk_slave {/processing_system7_0/FCLK_CLK0 ($FREQ_LITE MHz)} Clk_xbar {/processing_system7_0/FCLK_CLK0 ($FREQ_LITE MHz)} Master {/processing_system7_0/M_AXI_GP0} Slave {/dma_im_in/S_AXI_LITE} ddr_seg {Auto} intc_ip {New AXI Interconnect} master_apm {0}}  [get_bd_intf_pins dma_im_in/S_AXI_LITE]
apply_bd_automation -rule xilinx.com:bd_rule:axi4 -config { Clk_master {/processing_system7_0/FCLK_CLK0 ($FREQ_LITE MHz)} Clk_slave {/processing_system7_0/FCLK_CLK0 ($FREQ_LITE MHz)} Clk_xbar {/processing_system7_0/FCLK_CLK0 ($FREQ_LITE MHz)} Master {/processing_system7_0/M_AXI_GP0} Slave {/dma_weights_im_out/S_AXI_LITE} ddr_seg {Auto} intc_ip {New AXI Interconnect} master_apm {0}}  [get_bd_intf_pins dma_weights_im_out/S_AXI_LITE]
endgroup

# AXI4
startgroup
apply_bd_automation -rule xilinx.com:bd_rule:axi4 -config { Clk_master {/processing_system7_0/FCLK_CLK1 ($FREQ_LOW MHz)} Clk_slave {/processing_system7_0/FCLK_CLK1 ($FREQ_LOW MHz)} Clk_xbar {/processing_system7_0/FCLK_CLK1 ($FREQ_LOW MHz)} Master {/dma_weights_im_out/M_AXI_MM2S} Slave {/processing_system7_0/S_AXI_ACP} ddr_seg {Auto} intc_ip {New AXI Interconnect} master_apm {0}}  [get_bd_intf_pins processing_system7_0/S_AXI_ACP]
apply_bd_automation -rule xilinx.com:bd_rule:axi4 -config { Clk_master {/processing_system7_0/FCLK_CLK1 ($FREQ_LOW MHz)} Clk_slave {/processing_system7_0/FCLK_CLK1 ($FREQ_LOW MHz)} Clk_xbar {/processing_system7_0/FCLK_CLK1 ($FREQ_LOW MHz)} Master {/dma_im_in/M_AXI_MM2S} Slave {/processing_system7_0/S_AXI_ACP} ddr_seg {Auto} intc_ip {/axi_mem_intercon} master_apm {0}}  [get_bd_intf_pins dma_im_in/M_AXI_MM2S]
apply_bd_automation -rule xilinx.com:bd_rule:axi4 -config { Clk_master {/processing_system7_0/FCLK_CLK1 ($FREQ_LOW MHz)} Clk_slave {/processing_system7_0/FCLK_CLK1 ($FREQ_LOW MHz)} Clk_xbar {/processing_system7_0/FCLK_CLK1 ($FREQ_LOW MHz)} Master {/dma_weights_im_out/M_AXI_S2MM} Slave {/processing_system7_0/S_AXI_ACP} ddr_seg {Auto} intc_ip {/axi_mem_intercon} master_apm {0}}  [get_bd_intf_pins dma_weights_im_out/M_AXI_S2MM]
endgroup

# HF Reset
set IP_NAME "reset_hf"
create_bd_cell -type ip -vlnv xilinx.com:ip:proc_sys_reset:5.0 $IP_NAME
connect_bd_net [get_bd_pins processing_system7_0/FCLK_CLK2] [get_bd_pins $IP_NAME/slowest_sync_clk]
connect_bd_net [get_bd_pins processing_system7_0/FCLK_RESET0_N] [get_bd_pins $IP_NAME/ext_reset_in]
connect_bd_net [get_bd_pins $IP_NAME/peripheral_aresetn] [get_bd_pins axis_accelerator_0/hf_aresetn]

# LF Reset
# NOTE: axi_mem_intercon gets created after axi lite
connect_bd_net [get_bd_pins axis_accelerator_0/aresetn] [get_bd_pins axi_mem_intercon/ARESETN]

save_bd_design
validate_bd_design

generate_target all [get_files  $PROJ_FOLDER/$PROJ_NAME.srcs/sources_1/bd/sys/sys.bd]
make_wrapper -files [get_files $PROJ_FOLDER/$PROJ_NAME.srcs/sources_1/bd/sys/sys.bd] -top
add_files -norecurse $PROJ_FOLDER/$PROJ_NAME.gen/sources_1/bd/sys/hdl/sys_wrapper.v
set_property top sys_wrapper [current_fileset]

TCL file that builds the project.

13. C++ Firmware

I then write the C++ code to be run on the ARM processor, which instructs the DMA to pull data from memory and push it back. When multiple DMAs are involved, this is fairly tricky. Right after starting a DMA operation, the parameters for the next DMA iteration must be calculated in advance, to prevent stalling the DMA.

13.1. OOP Wrappers for DMA Drivers

I find the C code provided by Xilinx a bit counterintuitive. Therefore, I have written an OOP wrapper for the Xilinx DMA, which is open-sourced here:

Object oriented drivers for Xilinx IPs (DMA...) for ZYNQ PSoC - GitHub - abarajithan11/zynq-oop-drivers: Object oriented drivers for Xilinx IPs (DMA...) for ZYNQ PSoC

GitHubabarajithan11

13.2. OOP Architecture for DNN models & config bits in C++

The firmware needs to be flexible, such that I can create any DNN by chaining layer objects. For this, I write the layer class, with necessary features like extracting configuration bits and appending to data.

class Layer
{
public:
	 int idx, H_IN, W_IN, C_IN, C_OUT, KH_IN, KW_IN;
	 bool IS_NOT_MAX, IS_MAX, IS_LRELU;

	 Layer * PREV_P = nullptr;
	 Layer * NEXT_P = nullptr;

	 int BLOCKS, BLOCKS_PER_ARR;
	 u8 MAX_FACTOR, SUB_CORES, EFF_CORES, ITR, COUT_FPGA, COUT_VALID, COUT_INVALID;
	 u8 KW_PAD;

	 int OUT_W_IN, OUT_BLOCKS, OUT_MAX_FACTOR, OUT_BLOCKS_PER_ARR, OUT_KH;

	 int DATA_BEATS_PIXELS;
	 int BEATS_LRELU = 0;
	 int WORDS_PIXELS_PER_ARR;
	 int WORDS_WEIGHTS_PER_ITR, WORDS_WEIGHTS;

	 int WORDS_OUT_PER_TRANSFER, TRANSFERS_OUT_PER_ITR;
	 int WORDS_OUT_PER_TRANSFER_ARR [3];

	 chunk_s * input_chunk_p  = nullptr;
	 chunk_s * output_chunk_p = nullptr;
	 bool done_write = false;

	 Layer ( int idx,
			 int H_IN, int W_IN, int C_IN, int C_OUT,
			 int KH_IN, int KW_IN,
			 bool IS_NOT_MAX, bool IS_MAX, bool IS_LRELU):
					 idx    (idx),
					 H_IN   (H_IN),
					 W_IN   (W_IN),
					 C_IN   (C_IN),
					 C_OUT  (C_OUT),
					 KH_IN   (KH_IN),
					 KW_IN   (KW_IN),
					 IS_NOT_MAX(IS_NOT_MAX),
					 IS_MAX    (IS_MAX),
					 IS_LRELU  (IS_LRELU)
	 {
		 BLOCKS     = H_IN / UNITS;
		 MAX_FACTOR = IS_MAX ? 2 : 1;
		 BLOCKS_PER_ARR = BLOCKS / MAX_FACTOR;

		 KW_PAD = KW_IN - 2*IS_MAX;

		 SUB_CORES = MEMBERS / KW_IN;
		 EFF_CORES = COPIES * GROUPS * SUB_CORES / MAX_FACTOR;
		 ITR       = (int)(std::ceil((float)C_OUT / (float)EFF_CORES));
		 COUT_FPGA = EFF_CORES * ITR;

		 COUT_VALID = C_OUT % EFF_CORES;
		 COUT_VALID = (COUT_VALID == 0) ? EFF_CORES : COUT_VALID;

		 COUT_INVALID = EFF_CORES - COUT_VALID;

		 /* LRELU BEATS */

		 BEATS_LRELU += 1; //D
		 BEATS_LRELU += ceil(2.0/KW_IN); // A

		 for (int clr_i=0;  clr_i < KW_IN/2+1; clr_i ++){
			 int clr = clr_i*2 +1;
			 for (int mtb=0;  mtb < clr; mtb ++){
				 int bram_width = MEMBERS/clr;
				 int bram_size  = 2*SUB_CORES;
				 int BEATS_ij = ceil((float)bram_size/bram_width);

				 BEATS_LRELU += BEATS_ij;
			 }
		 }

		 DATA_BEATS_PIXELS = BLOCKS_PER_ARR * W_IN * C_IN;

		 WORDS_PIXELS_PER_ARR  =      DATA_BEATS_PIXELS  * UNITS_EDGES;
		 WORDS_WEIGHTS_PER_ITR = (S_WEIGHTS_WIDTH/8) + (BEATS_LRELU + C_IN*KH_IN) * COPIES * GROUPS * MEMBERS;
		 WORDS_WEIGHTS         = ITR * WORDS_WEIGHTS_PER_ITR;

		 if (IS_NOT_MAX && IS_MAX)
		 {
			 WORDS_OUT_PER_TRANSFER_ARR[0] = SUB_CORES * COPIES * GROUPS * UNITS_EDGES;
			 WORDS_OUT_PER_TRANSFER_ARR[1] =             COPIES * GROUPS * UNITS_EDGES;
			 WORDS_OUT_PER_TRANSFER_ARR[2] =             COPIES * GROUPS * UNITS_EDGES / MAX_FACTOR;

			 TRANSFERS_OUT_PER_ITR = BLOCKS/MAX_FACTOR * W_IN/MAX_FACTOR * (1 + 2 * SUB_CORES);
		 }
		 else
		 {
			 WORDS_OUT_PER_TRANSFER = SUB_CORES * COPIES * GROUPS * UNITS_EDGES / MAX_FACTOR;
			 TRANSFERS_OUT_PER_ITR  = BLOCKS/MAX_FACTOR * W_IN/MAX_FACTOR;
		 }
	 };

	 void set_config()
	 {
		 input_chunk_p->data_p[0] = (s8)(IS_NOT_MAX);
		 input_chunk_p->data_p[1] = (s8)(IS_MAX);
		 input_chunk_p->data_p[2] = (s8)(IS_LRELU);
		 input_chunk_p->data_p[3] = (s8)(KH_IN/2);

#ifdef DEBUG
		 for (int i=4; idata_p[i] = 0;
#endif
		 Xil_DCacheFlushRange((UINTPTR)input_chunk_p->data_p, UNITS_EDGES);
	 };

	 void set_out_params()
	 {
		 /* Next layer can be null (if this is last) or can have multiple next layers.
		  * We are interested in how to arrange the output values of this, to match the next
		  */

		 OUT_W_IN   = W_IN / MAX_FACTOR;
		 OUT_BLOCKS = (H_IN / MAX_FACTOR) / UNITS;

		 OUT_MAX_FACTOR     = (NEXT_P == nullptr) ? 1 : NEXT_P->MAX_FACTOR;
		 OUT_BLOCKS_PER_ARR = OUT_BLOCKS/OUT_MAX_FACTOR;

		 OUT_KH = (NEXT_P == nullptr) ? KH_IN : NEXT_P->KH_IN;
	 }

	 inline s8* get_input_pixels_base_p()
	 {
		 return (s8*)(input_chunk_p->data_p) + UNITS_EDGES;
	 }
	 inline s8* get_output_pixels_base_p()
	 {
		 return (s8*)(output_chunk_p->data_p) + UNITS_EDGES;
	 }
};

auto build_yolo_mod()
{
	std::array layers = {
		Layer(1,	H_RGB   ,W_RGB  ,    3,  32,   3,   3,false, true, true),
		Layer(2,	H_RGB/2 ,W_RGB/2,   32,  64,   3,   3,false, true, true),
		Layer(3,	H_RGB/4 ,W_RGB/4,   64, 128,   3,   3,true, false, true),
		Layer(4,	H_RGB/4 ,W_RGB/4,  128,  64,   1,   1,true, false, true),
		Layer(5,	H_RGB/4 ,W_RGB/4,   64, 128,   3,   3,false, true, true),
		Layer(6,	H_RGB/8 ,W_RGB/8,  128, 256,   3,   3,true, false, true),
		Layer(7,	H_RGB/8 ,W_RGB/8,  256, 128,   1,   1,true, false, true),
		Layer(8,	H_RGB/8 ,W_RGB/8,  128, 256,   3,   3,false, true, true),
		Layer(9,	H_RGB/16,W_RGB/16, 256, 512,   3,   3,true, false, true),
		Layer(10,	H_RGB/16,W_RGB/16, 512, 256,   1,   1,true, false, true),
		Layer(11,	H_RGB/16,W_RGB/16, 256, 512,   3,   3,true, false, true),
		Layer(12,	H_RGB/16,W_RGB/16, 512, 256,   1,   1,true, false, true),
		Layer(13,	H_RGB/16,W_RGB/16, 256, 512,   3,   3,false, true, true),
		Layer(14,	H_RGB/32,W_RGB/32, 512,1024,   3,   3,true, false, true),
		Layer(15,	H_RGB/32,W_RGB/32,1024, 512,   1,   1,true, false, true),
		Layer(16,	H_RGB/32,W_RGB/32, 512,1024,   3,   3,true, false, true),
		Layer(17,	H_RGB/32,W_RGB/32,  64, 128,1024, 512,true, false, true),
		Layer(18,	H_RGB/32,W_RGB/32,  64, 128, 512,1024,true, false, true),
		Layer(19,	H_RGB/32,W_RGB/32,1024,1024,   3,   3,true, false, true),
		Layer(20,	H_RGB/32,W_RGB/32,1024,1024,   3,   3,true, false, true),
		Layer(21,	H_RGB/32,W_RGB/32,1024,  45,   1,   1,true, false, false)
	};

	for (int i=0; i < N_LAYERS; i++)
	{
		if (i!=0         ) layers[i].PREV_P = &layers[i-1];
		if (i!=N_LAYERS-1) layers[i].NEXT_P = &layers[i+1];
		layers[i].set_out_params();
	}
	return layers;
}

13.3. C++ Code to control multiple DMAs effectively

Next, I write C++ functions to reshape the output (\hat{Y}\) on the fly (after each small DMA packet) to generate the next layers input $\hat{X}$. Also, configuration bits need to be calculated and appended to the packet to make it complete.

void restart_output()
{
	static int i_w=0, i_w_flipped=0, i_blocks=0, i_bpa=0, i_arr=0, i_cout=0, i_itr=0, i_layers=i_layers_start;
	static volatile s8 * write_p = layers[i_layers].get_output_pixels_base_p();
	static bool is_new_layer=true;

    static volatile s8 * write_p_old = 0;
	Xil_DCacheFlushRange((UINTPTR)write_p_old, UNITS_EDGES);

	if ((i_itr == 0 && i_blocks == 31) || (i_itr == 1 && i_blocks == 0)){
		for (int i=0; iset_config();
		layers[i_layers].NEXT_P->done_write = false;
		is_new_layer = false;
	}

	// PREPARE NEXT INDICES
	// blocks = 31 (a=1,bpa=15), w_f = 191 (w = 190), itr = 0
	if (i_w < layers[i_layers].OUT_W_IN-1)
	{
		i_w += 1;
		// Flip last KW-1 columns : flipped = 2w-(kw+iw)
		// For max: kw <- kw-2
		if (i_w > layers[i_layers].OUT_W_IN - layers[i_layers].KW_PAD)
			i_w_flipped = 2 * layers[i_layers].OUT_W_IN - (i_w + layers[i_layers].KW_PAD);
		else
			i_w_flipped = i_w;
	}
	else
	{
		i_w = 0;
		i_w_flipped = 0;

		 PRINT(" i_blocks: %d, write_p: %p \r\n", i_blocks, write_p);

		if (i_blocks < layers[i_layers].OUT_BLOCKS-1)
		{
			i_blocks  += 1;
			i_arr      = i_blocks % layers[i_layers].OUT_MAX_FACTOR;
			i_bpa      = i_blocks / layers[i_layers].OUT_MAX_FACTOR;
		}
		else
		{
			i_blocks   = 0;
			i_arr      = 0;
			i_bpa      = 0;

			 PRINT(" i_itr: %d \r\n", i_itr);

			if (i_itr >= layers[i_layers].ITR-1)
			{
				is_new_layer = true;
				i_itr = 0;
				i_cout= 0;

				if (i_layers < N_LAYERS-1)
					i_layers += 1;
				else
				{
					i_layers = 0;
					done = true;
					PRINT("All Layers done \r\n");
				}

				/* Chaining*/
				if (i_layers == N_LAYERS-1)
				{
					layers[0].input_chunk_p = &temp_in_chunk;
					layers[i_layers].output_chunk_p = &temp_out_chunk;
				}
				else
				{
					layers[i_layers].output_chunk_p = get_chunk();
					layers[i_layers].NEXT_P->input_chunk_p = layers[i_layers].output_chunk_p;
				}
				PRINT("Writing to new layer: chained_chunks (idx:%d -> idx:%d), data_p= %p \r\n",
						    layers[i_layers].idx, layers[i_layers].NEXT_P->idx,
							layers[i_layers].output_chunk_p->data_p);

				layers[i_layers].print_output_params();
			}
			else if (i_itr == 0)
			{
				i_itr += 1;
				i_cout = layers[i_layers].COUT_VALID;
			}
			else
			{
				i_itr  += 1;
				i_cout += layers[i_layers].EFF_CORES;
			}
		}
	}
	// blocks = 31 (a=1,bpa=15), w_f = 191, itr = 0
	write_p = unravel_image_abwcu(layers[i_layers].get_output_pixels_base_p(),
								  i_arr,i_bpa,i_w_flipped,i_cout,0, i_layers);
}

14. Hardware (FPGA) Verification

Hardware testing.

15. Repeat 11-14

I spent weeks or months repeating 11-14, to finally get the hardware outputs to match the golden model, and hence the original DNNs. Once I spent a month figuring out a bug where the system worked perfectly in randomized simulations but had wrong values for just 6 bytes out of 4 million bytes. Finally, I found it's a bug in Vivado's compiler.

Neural Chip Design [3/4: RTL Design & Verification]

Abarajithan Gnaneswaran — Sat, 29 Jan 2022 10:36:04 GMT

This is a series of articles [overview] outlining the workflow of 15 steps, which I developed over the past few years through building my own DNN accelerator: Kraken [arXiv paper].

After building golden models and understanding the operations, its time to design and implement digital circuits that can accelerate those operations with:

low on-chip area
high fmax (hence short critical paths)
minimal multiplexers, registers, and SRAM usage

For this, I first design my modules in detail, on a whiteboard. I spend days or weeks doing this: optimizing designs and mapping out state machines for each module. Once I'm satisfied. I sit with VSCode and start writing synthesizable RTL. Once I'm done, I generate test vectors from the golden model, write testbenches to read and compare them and start debugging.

Steps:

Whiteboard: Design hardware
RTL Design: SystemVerilog/Verilog for the whiteboard designs
Generate Test Vectors: using Python Notebooks
Testbenches: SystemVerilog OOP testbenches to read the input vector (txt file), randomly control the valid & ready signals and get output vectors (txt files)
Debug: Python notebooks to compare the expected output with simulation output and to find which dimensions have errors.
Microsoft Excel: I manually simulate the values in wires with excel to debug
Repeat 3-8: For every module & every level of integration
ASIC Synthesis

3. Whiteboard

I almost always design my modules fully on a whiteboard before sitting down to write RTL. This helps to map out almost every register and multiplexer, get an idea of the critical paths, and also to reduce bugs.

4. RTL Design

SystemVerilog / Verilog

I then start writing modules, state-machines...etc. using synthesizable SystemVerilog, converting my whiteboard drawings into code. This is fairly straightforward. I've given a stripped-down example code of my conv engine. The key things to note are:

SystemVerilog - Verilog lacks a lot of features and has the potential to cause serious bugs. SystemVerilog is beautiful and a breeze to write and read.
Multidimensional wires and ports - I use them a lot, to group ports meaningfully and connect with each other. I prefer packed over unpacked, so multidimensional SystemVerilog ports can be seamlessly connected to Verilog wrappers with flat ports, without having to manually flatten them.
Readability - I take this seriously. Order does not matter in HDL, but I write in a way that the code top to bottom corresponds to left to right in my whiteboard diagram (the way signal flows).
Macro Parameters - I put all the parameters, derived parameters in a common file and include it in all modules. That file itself is written through a tcl script. This way, the parameters of all files are guaranteed to be the same, avoiding bugs and also making the code readable.
No always@clk: This might be a surprise. In my entire synthesizable codebase of 9000 lines, I have only one sequential always block: in a parametrized module named register.v. The register module has optional clken, different types of reset...etc. All other modules instantiate this whenever needed. This helps me to avoid bugs, and to visualize the signal flow, as it directly translates from my whiteboard to code.
FPGA & ASIC - Using preprocessor directives (`ifdef), I write code to suit both FPGA and ASIC. Registers have async reset in ASIC mode and sync reset in FPGA mode.

`timescale 1ns/1ps
`include "../include/params.v"

module conv_engine #(ZERO=0) (
    clk            ,
    clken          ,
    resetn         ,
    s_valid        ,
    s_ready        ,
    s_last         ,
    s_user         ,
    s_data_pixels  ,
    s_data_weights ,
    m_valid        ,
    m_data         ,
    m_last         ,
    m_user         
  );
  input  logic clk, clken, resetn;
  input  logic s_valid, s_last;
  output logic s_ready;
  output logic m_valid, m_last;
  input  logic [`TUSER_WIDTH_CONV_IN-1:0] s_user;
  input  logic [`COPIES-1:0][`UNITS -1:0]                          [`WORD_WIDTH_IN    -1:0] s_data_pixels;
  input  logic [`COPIES-1:0][`GROUPS-1:0][`MEMBERS-1:0]            [`WORD_WIDTH_IN    -1:0] s_data_weights;                                                                        
  output logic [`COPIES-1:0][`GROUPS-1:0][`MEMBERS-1:0][`UNITS-1:0][`WORD_WIDTH_OUT   -1:0] m_data;
  output logic [`TUSER_WIDTH_CONV_OUT-1:0] m_user;
  
  // Code ommited
  logic [`KW_MAX/2:0][`SW_MAX -1:0][`MEMBERS -1:0] lut_sum_start;
  logic [`COPIES-1:0][`GROUPS-1:0][`MEMBERS-1:0][`UNITS-1:0][`WORD_WIDTH_IN*2-1:0] mul_m_data ;
  logic [`COPIES-1:0][`GROUPS-1:0][`MEMBERS-1:0][`UNITS-1:0][`WORD_WIDTH_OUT -1:0] acc_s_data ;
  logic [`COPIES-1:0][`GROUPS-1:0][`MEMBERS-1:0][`UNITS-1:0][`WORD_WIDTH_OUT -1:0] mux_s2_data;

  generate
    genvar c,g,u,m,b,kw2,sw_1;
    // Code ommitted
    for (c=0; c < `COPIES; c++)
      for (g=0; g < `GROUPS; g++)
        for (u=0; u < `UNITS; u++)
          for (m=0; m < `MEMBERS; m++)
            if (m==0) assign mux_s2_data [c][g][m][u] = 0;
            else      assign mux_s2_data [c][g][m][u] = m_data     [c][g][m-1][u];
    assign mux_sel_next = mul_m_valid && mul_m_user[`I_IS_CIN_LAST] && (mul_m_kw2 != 0);

    register #(
      .WORD_WIDTH     (1),
      .RESET_VALUE    (0)
    ) MUX_SEL (
      .clock          (clk   ),
      .resetn         (resetn),
      .clock_enable   (clken ),
      .data_in        (mux_sel_next),
      .data_out       (mux_sel )
    );
    assign clken_mul = clken && !mux_sel;
            
    for (m=0; m < `MEMBERS; m++) begin: Mb
      for (kw2=0; kw2 <= `KW_MAX/2; kw2++)
        for (sw_1=0; sw_1 < `SW_MAX; sw_1++) begin
          localparam k = kw2*2 + 1;
          localparam s = sw_1 + 1;
          localparam j = k + s -1;

          assign lut_sum_start[kw2][sw_1][m] = m % j < s; // m % 3 < 1 : 0,1
        end
      assign acc_m_sum_start [m] = lut_sum_start[acc_m_kw2][acc_m_sw_1][m] & acc_m_user[`I_IS_SUM_START];
      // Code ommited
    end

    for (c=0; c < `COPIES; c++) begin: Ca
      for (g=0; g < `GROUPS; g++) begin: Ga
        for (u=0; u < `UNITS; u++) begin: Ua
          for (m=0; m < `MEMBERS; m++) begin: Ma
            processing_element PROCESSING_ELEMENT (
              .clk           (clk           ),
              .clken         (clken         ),
              .resetn        (resetn        ),
              .clken_mul     (clken_mul     ),
              .s_data_pixels (s_data_pixels [c]      [u]), 
              .s_data_weights(s_data_weights[c][g][m]   ),
              .mul_m_data    (mul_m_data    [c][g][m][u]),
              .mux_sel       (mux_sel       ),
              .mux_s2_data   (mux_s2_data   [c][g][m][u]),
              .bypass        (bypass        [m]),
              .clken_acc     (clken_acc     [m]),
              .acc_s_data    (acc_s_data    [c][g][m][u]),
              .m_data        (m_data        [c][g][m][u])
            );
    end end end end

    // Code ommitted
    assign m_user_base[`I_IS_BOTTOM_BLOCK:`I_IS_NOT_MAX] = acc_m_user[`I_IS_BOTTOM_BLOCK:`I_IS_NOT_MAX];
    assign m_user  = {m_clr, m_shift_b, m_shift_a, m_user_base};
  endgenerate
endmodule

module processing_element (
    clk    ,
    clken  ,
    resetn ,

    clken_mul,
    s_data_pixels, 
    s_data_weights,
    mul_m_data,

    mux_sel,
    mux_s2_data,
    bypass,
    clken_acc,
    acc_s_data,
    m_data
  );
  input  logic clk, clken, resetn;
  input  logic clken_mul, mux_sel, bypass, clken_acc;
  input  logic [`WORD_WIDTH_IN  -1:0] s_data_pixels, s_data_weights;
  input  logic [`WORD_WIDTH_OUT -1:0] mux_s2_data;
  output logic [`WORD_WIDTH_IN*2-1:0] mul_m_data;
  output logic [`WORD_WIDTH_OUT -1:0] acc_s_data;
  output logic [`WORD_WIDTH_OUT -1:0] m_data;

`ifdef MAC_XILINX
  multiplier MUL (
`else
  multiplier_raw MUL (
`endif
    .CLK    (clk),
    .CE     (clken_mul),
    .A      (s_data_pixels ),
    .B      (s_data_weights),
    .P      (mul_m_data    )
  );
  assign acc_s_data = mux_sel ? mux_s2_data  : `WORD_WIDTH_OUT'(signed'(mul_m_data));
  
`ifdef MAC_XILINX
  accumulator ACC (
`else
  accumulator_raw ACC (
`endif
    .CLK    (clk),  
    .bypass (bypass     ),  
    .CE     (clken_acc  ),  
    .B      (acc_s_data ),  
    .Q      (m_data     )  
  );
endmodule

5. Test Vector Generation

Python

Next, I write python functions to extracts weights and inputs of each layer from my custom framework model (2.2), and convert them into input test vectors. Their dimensions need to be split, reshaped, transposed and flattened to get the final input $\hat{X}$, and weights $\hat{K}$ packets which can be understood by the hardware I designed. Output $\hat{Y}$ is also transformed to match the hardware's outputs. Also, configuration bits need to be calculated and appended to the packet to make it complete.

The python functions perform these reshapes to generate input and output vectors (multidimensional tensors)

def get_weights(i_layers, i_itr, c):
    weights = c.LAYERS[f'{c.PREFIX_CONV}{i_layers}'].weights
    KH, KW, CIN, COUT = weights.shape
    max_factor = 2 if f'{c.PREFIX_MAX}{i_layers}' in c.LAYERS.keys() else 1
    print(f"get_weights - shape_in:(KH, KW, CIN, COUT) = {weights.shape}")
    
    '''
    Reshape
    '''
    weights = weights.transpose(3,0,1,2) #(COUT,KH,KW,CIN)
    weights = fill_invalid_scg(weights,KW=KW,max_factor=max_factor,c=c) #(ITR,EFF_CORES,KH,KW,CIN)
    ITR,EFF_CORES = weights.shape[0:2]
    weights = weights.transpose(0,4,2,1,3) #(ITR,CIN,KH,EFF_CORES,KW)

    '''
    * Data comes out of maxpool in the order: S,CGU
    * Data comes out of conv in the order   : CGMU and is transposed into S,CGUby hardware
    * Conv in takes weights in order        : CGM

    * Since system_out is SCG, first invalid should be filled that way, so that output data is continous and cin matches cout
    * After filling, we transpose it to CGM
    '''
    SUB_CORES = c.MEMBERS//KW
    weights = weights.reshape((ITR,CIN,KH, SUB_CORES,c.COPIES//max_factor,c.GROUPS ,KW)) # EFF_CORES = (SCG)
    weights = weights.transpose(0,1,2, 4,5, 3,6) # CGS
    weights = weights.reshape((ITR,CIN,KH,1,c.COPIES//max_factor,c.GROUPS,SUB_CORES,KW)) # (CGS)
    weights = np.repeat(weights,repeats=max_factor,axis=3)
    weights = weights.reshape((ITR,CIN,KH,c.COPIES,c.GROUPS,SUB_CORES,KW))
    weights = weights.reshape((ITR,CIN,KH,c.COPIES,c.GROUPS,SUB_CORES*KW))
    zeros = np.zeros((ITR,CIN,KH,c.COPIES,c.GROUPS,c.MEMBERS), dtype=weights.dtype)
    zeros[:,:,:,:,:,0:SUB_CORES*KW] = weights
    weights = zeros

    KERNEL_BEATS = CIN*KH
    weights = weights.reshape(ITR,KERNEL_BEATS,c.COPIES,c.GROUPS,c.MEMBERS)

    '''
    Add LRELU Beats
    '''
    lrelu = get_lrelu_config(i_layers=i_layers,c=c) 
        
    LRELU_BEATS = lrelu.shape[1]
    weights_beats = np.concatenate([lrelu,weights], axis=1) # (ITR, LRELU_BEATS + KERNEL_BEATS, COPIES, GROUPS, MEMBERS)

    _,H,W,CIN = c.LAYERS[f'{c.PREFIX_CONV}{i_layers}'].in_data.shape
    BLOCKS    = H // (SH*max_factor*c.CONV_UNITS)

    bram_weights_addr_max = LRELU_BEATS + SW*KH*CIN-1
    print("bram_weights_addr_max: ", bram_weights_addr_max)

    weights_config = 0
    weights_config |= (KW//2)
    weights_config |= (KH//2)               << (BITS_KW2)
    weights_config |= SW-1                  << (BITS_KW2 + BITS_KH2)
    weights_config |= (CIN   -1)            << (BITS_KW2 + BITS_KH2 + BITS_SW)
    weights_config |= (W     -1)            << (BITS_KW2 + BITS_KH2 + BITS_SW + BITS_CIN_MAX)
    weights_config |= (BLOCKS-1)            << (BITS_KW2 + BITS_KH2 + BITS_SW + BITS_CIN_MAX + BITS_COLS_MAX)
    weights_config |= bram_weights_addr_max << (BITS_KW2 + BITS_KH2 + BITS_SW + BITS_CIN_MAX + BITS_COLS_MAX + BITS_BLOCKS_MAX)

    weights_config = np.frombuffer(np.uint64(weights_config).tobytes(),np.int8)
    weights_config = np.repeat(weights_config[np.newaxis,...],repeats=ITR,axis=0)

    '''
    ADD c BEATS
    '''
    weights_dma_beats = np.concatenate([weights_config,weights_beats.reshape(ITR,-1)], axis=1)

    assert weights_dma_beats.shape == (ITR, 8 + (LRELU_BEATS + CIN*KH*SW)*c.COPIES*c.GROUPS*c.MEMBERS)
    print(f"get_weights - weights_dma_beats.shape: (ITR, 4 + (LRELU_BEATS + CIN*KH)*COPIES*GROUPS*MEMBERS) = {weights_dma_beats.shape}")

    np.savetxt(f"{c.DATA_DIR}{i_layers}_weights.txt", weights_dma_beats[i_itr].flatten(), fmt='%d')
    return weights_dma_beats

6. Testbench & Simulation

SystemVerilog

Next, I write testbenches for the modules. They are built around two custom SystemVerilog classes: AXIS_Slave, which reads a text file and loads data into an AXI stream port while conforming to the protocol, and AXIS_Master which reads data from a port and writes into a text file.

The control signals: valid and ready are randomized. They get toggled according to a given probability, to simulate the effects of memory bus freezing up and clearing.

`timescale 1ns/1ps
class AXIS_Slave #(WORD_WIDTH=8, WORDS_PER_BEAT=1, VALID_PROB=100);

  string file_path;
  int words_per_packet, status, iterations, i_words;
  int file = 0;
  int i_itr = 0;
  bit enable = 0;
  bit first_beat = 1;

  rand bit s_valid;
  constraint c { s_valid dist { 0 := (100-VALID_PROB), 1 := (VALID_PROB)}; };

  function new(string file_path, int words_per_packet, int iterations);
    this.file_path = file_path;
    this.words_per_packet = words_per_packet;
    this.iterations = iterations;
    this.file = $fopen(this.file_path, "r");
  endfunction

  function void fill_beat(
    ref logic [WORDS_PER_BEAT-1:0][WORD_WIDTH-1:0] s_data, 
    ref logic [WORDS_PER_BEAT-1:0] s_keep,
    ref logic s_last);

    if($feof(file)) $fatal(1, "EOF found\n");
    if (first_beat) first_beat = 0;

    for (int i=0; i < WORDS_PER_BEAT; i++) begin
      status = $fscanf(file,"%d\n", s_data[i]);
      s_keep[i] = i_words < words_per_packet;
      i_words  += 1;
    end
    if(i_words >= words_per_packet)
      s_last = 1;
  endfunction

  function void reset(
    ref logic s_valid,
    ref logic [WORDS_PER_BEAT-1:0][WORD_WIDTH-1:0] s_data, 
    ref logic [WORDS_PER_BEAT-1:0] s_keep,
    ref logic s_last
  );
    enable = 0;
    s_data = '{default:0};
    s_valid = 0;
    s_keep = 0;
    s_last = 0;
    first_beat = 1;
    i_words = 0;

    if (file != 0) $fclose(file);
    file = $fopen(file_path, "r");
  endfunction

  task axis_feed(
    ref logic aclk,
    ref logic s_ready,
    ref logic s_valid,
    ref logic [WORDS_PER_BEAT-1:0][WORD_WIDTH-1:0] s_data, 
    ref logic [WORDS_PER_BEAT-1:0] s_keep,
    ref logic s_last
  );
    // Before beginning: set all signals zero
    if (~enable) begin
      this.reset(s_valid, s_data, s_keep, s_last);
      @(posedge aclk);
      return;
    end

    @(posedge aclk);
    this.randomize(); // random this.s_valid at every cycle
    if (s_ready && (first_beat ? this.s_valid : s_valid)) begin
      // If s_last has passed with handshake, packet done. start next itr
      if(s_last) begin #1;
        i_itr += 1;
        this.reset(s_valid, s_data, s_keep, s_last);

        if (i_itr < iterations) 
          enable = 1;
        else begin
          $fclose(file);
          return;
        end
      end
      else #1;

      // If file is not open, keep valid down.
      if(file == 0) begin
        s_valid = 0;
        return;
      end
      else this.fill_beat(s_data, s_keep, s_last);
    end
    else #1;

    if (~first_beat) s_valid = this.s_valid;
  endtask
endclass

Stripped down version of AXIS_Slave class

Following is an example on how AXIS slave and master classes are utilized. Each module gets a testbench like this. Some modules get multiple slave and multiple masters.

module axis_tb_demo();
  timeunit 1ns;
  timeprecision 1ps;
  localparam CLK_PERIOD = 10;
  logic aclk;
  initial begin
    aclk = 0;
    forever #(CLK_PERIOD/2) aclk <= ~aclk;
  end

  localparam WORD_WIDTH        = 8;
  localparam WORDS_PER_PACKET  = 40;
  localparam WORDS_PER_BEAT    = 4;
  localparam ITERATIONS        = 6;
  localparam BEATS = int'($ceil(real'(WORDS_PER_PACKET)/real'(WORDS_PER_BEAT)));

  logic [WORDS_PER_BEAT  -1:0][WORD_WIDTH-1:0] data;
  logic [WORDS_PER_BEAT  -1:0] keep;
  logic valid, ready, last;

  string path = "D:/cnn-fpga/data/axis_test.txt";
  string out_base = "D:/cnn-fpga/data/axis_test_out_";

  AXIS_Slave #(
    .WORD_WIDTH    (WORD_WIDTH    ), 
    .WORDS_PER_BEAT(WORDS_PER_BEAT), 
    .VALID_PROB    (70            )
    ) slave_obj  = new(
      .file_path       (path), 
      .words_per_packet(WORDS_PER_PACKET), 
      .iterations      (ITERATIONS)
      );
  AXIS_Master #(
    .WORD_WIDTH    (WORD_WIDTH    ), 
    .WORDS_PER_BEAT(WORDS_PER_BEAT), 
    .READY_PROB    (70            ), 
    .CLK_PERIOD    (CLK_PERIOD    ),
    .IS_ACTIVE     (1             )
    ) master_obj = new(
      .file_base(out_base),
      .words_per_packet(-1),
      .packets_per_file(2)
      );
  initial forever  slave_obj.axis_feed(aclk, ready, valid, data, keep, last);
  initial forever master_obj.axis_read(aclk, ready, valid, data, keep, last);

  initial begin
    @(posedge aclk);
    slave_obj.enable <= 1;
    master_obj.enable <= 1;
  end

  int s_words, s_itr, m_words, m_itr, m_packets, m_packets_per_file;
  initial
    forever begin
      @(posedge aclk);
      s_words = slave_obj.i_words;
      s_itr = slave_obj.i_itr;
      m_words = master_obj.i_words;
      m_itr = master_obj.i_itr;
      m_packets = master_obj.i_packets;
      m_packets_per_file = master_obj.packets_per_file;
    end
endmodule

7. Debugging

Python Notebooks & SystemVerilog Simulations

I then run simulations, collect output vectors and compare them with expected vectors using python notebooks. Notebooks allow one to play around with data, quickly print and observe different dimensions...etc.

Integration testing the whole system

Observing outputs

The dimension indices of the locations where output doesn't match expected output. Very helpful in debugging.

8. Debugging with Microsoft Excel

Yep :-)

In some cases, the output from a module is garbage and does not match the expected output at all. Since it is a convolution over several values, it is near impossible to guess the bug by looking at such garbage numbers. In that case, I resort to Excel, where I manually transform a set of small input vectors through the logic, step by step to see what I should expect in every clock cycle. I then compare it to the waveforms I see in the simulator to figure out where the bug is.

9. Repeat

I move back and forth between the whiteboard, RTL code, python code, and simulation to fix bugs one by one. Some take weeks and make me want to pull my hair out. I also do this for each module, then put them together hierarchically, write integration testbenches, and test that too.

10. ASIC Synthesis

Once the design is verified in randomized simulations, I write the scripts for ASIC synthesis. Our university uses Cadence tools, so the following script is for Cadence Genus, using 65nm CMOS PDK from TSMC.

set TOP axis_accelerator_asic
# set TOP axis_conv_engine

#--------- CONFIG
set RTL_DIR ../../rtl
set XILINX 0
source ../../tcl/config.tcl
set_db hdl_max_loop_limit 10000000

set TECH 65nm
set NUM_MACS [expr $MEMBERS*$UNITS*$GROUPS*$COPIES]
set REPORT_DIR ../report/${TECH}/${TOP}/${NUM_MACS}
exec mkdir -p $REPORT_DIR

#--------- LIBRARIES
set LIB_DIR ../../../tsmc/${TECH}/GP
set_db library [glob $LIB_DIR/cc_lib/noise_scadv10_cln65gp_hvt_tt_1p0v_25c.lib $LIB_DIR/cc_lib/scadv10_cln65gp_hvt_tt_1p0v_25c.lib]
set_db lef_library [glob $LIB_DIR/lef/tsmc_cln65_a10_6X1Z_tech.lef $LIB_DIR/lef/tsmc_cln65_a10_6X2Z_tech.lef $LIB_DIR/lef/tsmc65_hvt_sc_adv10_macro.lef]
set_db qrc_tech_file $LIB_DIR/other/icecaps.tch
# set LIB_DIR ../../../tsmc/${TECH}/LP
# set_db library [glob $LIB_DIR/lib/sc12_cln65lp_base_hvt_tt_typical_max_1p00v_25c.lib $LIB_DIR/lib/sc12_cln65lp_base_hvt_tt_typical_max_1p20v_25c.lib]
# set_db lef_library [glob $LIB_DIR/lef/sc12_cln65lp_base_hvt.lef]

#--------- READ
read_hdl -mixvlog [glob $RTL_DIR/include/*]
read_hdl -mixvlog [glob $RTL_DIR/external/*]
read_hdl -mixvlog [glob $RTL_DIR/src/*]

#--------- ELABORATE & CHECK
set_db lp_insert_clock_gating true
elaborate $TOP
check_design > ${REPORT_DIR}/check_design.log
uniquify $TOP

#--------- CONSTRAINTS
set PERIOD [expr 1000.0/$FREQ_HIGH]
create_clock -name aclk -period $PERIOD [get_ports aclk]
set_dont_touch_network [all_clocks]
set_dont_touch_network [get_ports {aresetn}]

set design_inputs [get_ports {m_axis_tready s_axis_pixels_tvalid s_axis_pixels_tlast s_axis_pixels_tdata s_axis_pixels_tkeep s_axis_weights_tvalid s_axis_weights_tlast s_axis_weights_tdata s_axis_weights_tkeep}]
set design_outputs [get_ports {s_axis_pixels_tready  s_axis_weights_tready m_axis_tvalid m_axis_tlast m_axis_tdata m_axis_tkeep}]

set_input_delay  [expr $PERIOD * 0.6] -clock aclk $design_inputs
set_output_delay [expr $PERIOD * 0.6] -clock aclk $design_outputs

#--------- RETIME OPTIONS
set_db retime_async_reset true
set_db design:${TOP} .retime true

#--------- SYNTHESIZE
set_db syn_global_effort high
syn_generic
syn_map
syn_opt

#--------- NETLIST
write -mapped > ../output/${TOP}.v
write_sdc > ../output/${TOP}.sdc

#--------- REPORTS
report_area > ${REPORT_DIR}/area.log
report_gates > ${REPORT_DIR}/gates.log
report_timing  -nworst 10 > ${REPORT_DIR}/timing.log
report_congestion > ${REPORT_DIR}/congestion.log
report_messages > ${REPORT_DIR}/messages.log
report_hierarchy > ${REPORT_DIR}/hierarchy.log
report_clock_gating > ${REPORT_DIR}/clock_gating.log

build_rtl_power_models -clean_up_netlist
report_power > ${REPORT_DIR}/power.log

Neural Chip Design [4/4: SoC Integration & Firmware]

SoC design involves integrating other IPs, automating projects with TCL, writing C++ firmware, and testing on PSoC hardware.

Neural Chip Design [3/4: Digital Design & Verification]

Neural Chip Design [2/4: Golden Model]

Abarajithan Gnaneswaran — Sat, 29 Jan 2022 10:30:50 GMT

This is a series of articles [overview] outlining the workflow of 15 steps, which I developed over the past few years through building my own DNN accelerator: Kraken [arXiv paper].

Golden Models are essential to hardware (FPGA/ASIC) development. They model the expected behavior of a chip using a high-level language, such that they can be built relatively fast, with almost zero chance of error. The input and expected output test vectors for every RTL module are generated using them, and the simulation output from the testbench is compared against their 'gold standard.'

I first obtain pretrained DNNs from PyTorch / Tensorflow model zoo, analyze them, then load them into the custom DNN inference framework I have built with NumPy stack to ensure I fully understand each operation. I then generate test vectors from those golden models.

Steps:

PyTorch/TensorFlow: Explore DNN models, quantize & extract weights
Golden Model in Python (NumPy stack): Custom OOP framework, process the weights, convert to custom datatypes

1. Tensorflow / PyTorch

Tensorflow (Google) and PyTorch (Facebook) are the two competing open source libraries used to build, train, quantize and deploy modern deep neural networks.

Both frameworks provide high-level, user-friendly classes and functions such as Conv2D, model.fit() to build & train networks. Each such high-level API is implemented using their own low-level tensor operations (matmul, einsum), which also can be used by the users. Those operations are implemented using their C++ backend, accelerated by high performant libraries like eigen and CUDA. Once we define the models using Python, the C++ code underneath pulls the load, making them fast as well as user-friendly.

1.1. Download & Explore Pretrained DNN Models

As the first step, I obtained the pretrained models from either Keras.Applications or PyTorch Model Zoo.

import tensorflow as tf
from tensorflow.keras.applications import VGG16

vgg16 = VGG16(
    include_top=True,
    weights="imagenet",
    input_tensor=None,
    input_shape=None,
    pooling=None,
    classes=1000,
    classifier_activation="softmax",
)

vgg16.save('saved_model/vgg16')
vgg16.summary()

Get VGG16 from TensorFlow zoo (Keras.Applications)

import torch
model = torch.hub.load('pytorch/vision:v0.10.0', 'alexnet', pretrained=True)
model.eval()
torch.save(model, "alexnet.pt")

Get AlexNet from PyTorch Model Zoo

1.2. Build Models & Retrain if needed (PyTorch)

PyTorch is more intuitive, pythonic and bliss to work with. I use it to build new models and train them if needed.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets
from torchvision import transforms as T
from torch.optim.lr_scheduler import StepLR

H = 28
N = 32
device = torch.device("cuda")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(H**2, 32)
        self.fc2 = nn.Linear(32, 10)
    def forward(self, x):
        x = self.fc1(x)
        x = F.leaky_relu(x,0.01)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
        
model = Net().to(device)

Building a simple fully connected network in PyTorch

''' Create Data Loaders to efficiently pull data '''
transform = T.Compose([T.ToTensor(), T.Lambda(lambda x: torch.flatten(x))])
dataset1  = datasets.MNIST('../data', train=True, download=True, transform=transform)
dataset2  = datasets.MNIST('../data', train=False, transform=transform)

train_loader = torch.utils.data.DataLoader(dataset1,batch_size=N)
test_loader = torch.utils.data.DataLoader(dataset2,batch_size=N)

''' Functions to Test & Train '''

def train(model, device, train_loader, optimizer, epoch):
    model.train()
    for (data, target) in train_loader:
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        print(f'Train, epoch: {epoch}, loss: {loss.item():.6f}')

def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
    test_loss /= len(test_loader.dataset)
    print(f'Test, loss: {test_loss}, acc: {100*correct/len(test_loader.dataset):.0f}')

optimizer = optim.Adadelta(model.parameters(), lr=1.0)
scheduler = StepLR(optimizer, step_size=1, gamma=0.7)

''' Train for 50 Epochs '''
for epoch in range(1, 51):
    train(model, device, train_loader, optimizer, epoch)
    test(model, device, test_loader)
    scheduler.step()

''' Save Trained Model '''
torch.save(model, "mnist_cnn.pt")

Training the model in PyTorch with MNIST dataset

1.3. Convert Torch Models to Tensorflow

However, the support for int8 quantization for PyTorch is still experimental. Therefore, for most of my work, I use pretrained models from Tensorflow, whose quantization library (TFLite) is much superior.

Some models, like AlexNet, are not found in Keras.Applications. Therefore, I load them from PyTorch Model Zoo and convert them to ONNX (the common open-source format) and then load them in Tensorflow.

import torch
from PIL import Image
from torchvision import transforms

model = torch.load('alexnet.pt')

input_image = Image.open('dog.jpg')
preprocess = transforms.Compose([
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0)

torch.onnx.export(model, input_batch, "alexnet.onnx",
		input_names=['input'], output_names=['output'])

Open the saved AlexNet model and export as ONNX

# First install onnx2keras with: pip install onnx onnx2keras
import onnx
from onnx2keras import onnx_to_keras
onnx_model = onnx.load("alexnet.onnx")
k_model = onnx_to_keras(onnx_model, ['input'], change_ordering=True)
k_model.save('saved_model/alexnet')

Convert ONNX model to Keras (Tensorflow)

1.4. Quantize Models with TensorFlowLite

Following is an example of loading a float32 model (VGG16) from tensorflow's savedmodel format (1.1), testing it, quantizing it to int8, and testing & saving the quantized network.

import tensorflow as tf
filenames = glob("dataset/*.jpg")

'''
LOAD AND TEST FLOAT32 MODEL
'''
prep_fn = tf.keras.applications.vgg16.preprocess_input
model = tf.keras.models.load_model(f'saved_model/vgg16')
h = model.input_shape[1]

import cv2
from glob import glob
import numpy as np
def representative_data_gen():
    for im_path in filenames:
        im = cv2.imread(im_path)
        im = cv2.resize(im, (h,h))
        im = im[None,:,:,::-1]
        im = prep_fn(im)
        im = tf.convert_to_tensor(im)
        yield [im]

images = list(representative_data_gen())

predictions = np.zeros((len(images),), dtype=int)
for i, image in enumerate(images):
    output = model(image[0])[0]
    predictions[i] = output.numpy().argmax()
print(predictions)

'''
CONVERT AND SAVE INT8 MODEL (STATIC QUANTIZATION)
'''
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model_quant = converter.convert()

import pathlib
tflite_model_quant_file = pathlib.Path(f"tflite/vgg16.tflite")
tflite_model_quant_file.write_bytes(tflite_model_quant)

'''
LOAD AND TEST QUANTIZED MODEL
'''
interpreter = tf.lite.Interpreter(model_path=str(tflite_model_quant_file))
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()[0]
output_details = interpreter.get_output_details()[0]

images = list(representative_data_gen())
predictions = np.zeros((len(images),), dtype=int)

for i, image in enumerate(images):
    image = image[0]
    input_scale, input_zero_point = input_details["quantization"]
    image = image / input_scale + input_zero_point

    test_image = image.numpy().astype(input_details["dtype"])
    interpreter.set_tensor(input_details["index"], test_image)
    interpreter.invoke()
    output = interpreter.get_tensor(output_details["index"])[0]
    predictions[i] = output.argmax()

print(predictions)

1.5 Explore Model Architecture

Netron is a great tool for opening tensorflow's 32-bit models (savedmodel), tflite's int8 models (tflite), pytorch models (pt), ONNX models, and more, to observe the architecture and tensor names.

2. Golden Model

Python (NumPy stack)

After obtaining the pretrained model, I need to 100% understand what operations are involved and how they are applied as data flows through the network. The best way to do this is to re-do it myself from scratch and obtain exactly the same results.

2.1. Custom Quantization Scheme

2.2. Custom Inference Framework (OOP, Python)

For this, I built a custom framework in Python. It is structured like Keras with the following classes, inheriting as follows:

MyModel
MyLayer
- MyConv
- MyLeakyReLU
- MyMaxpool
- MyConcat
- MySpaceToDepth
- MyFlatten

A MyModel object has a list of objects from MyLayer's children's classes. It's constructor extracts weights from tflite and sets them to the layers. A set of images can flow through the layers through a recursive call to the last layer. Following is the stripped-down version of the MyConv implementation.

class MyConv(MyLayer):
    def __init__(self,
                 weights_biases,
                 prev_layer=None,
                 bn_weights=None,
                 name='',
                 np_dtype=np.float64,
                 np_dtype_sum=np.float64,
                 np_dtype_conv_out=np.float64,
                 float_dtype=np.float64,
                 bits_conv_out=32,
                 quantize=False,
                 float_ieee=True):

        MyLayer.__init__(self,
                         name=name,
                         prev_layer=prev_layer,
                         np_dtype=np_dtype,
                         quantize=quantize,
                         float_ieee=float_ieee)
        self.np_dtype_sum = np_dtype_sum
        self.np_dtype_conv_out = np_dtype_conv_out
        self.float_dtype = float_dtype

        assert len(weights_biases[0].shape) == 4
        self.weights = weights_biases[0].astype(self.np_dtype)
        self.weights_flipped = np.flip(
            self.weights, [0, 1]).astype(self.np_dtype)
        self.kernel = self.weights.shape[0:2]
        self.in_ch_n = self.weights.shape[2]
        self.out_ch_n = self.weights.shape[3]
        self.biases = weights_biases[1].astype(self.np_dtype)

        self.fuse_bn(bn_weights)

        if self.quantize:
            self.clip_max = 2**(bits_conv_out-1)-1
            self.clip_min = -2**(bits_conv_out-1)

    def np_out(self, in_data):
        if self.quantize:
            in_data = self.in_data.copy().astype(self.np_dtype_sum)
            weights = self.weights.copy().astype(self.np_dtype_sum)

            self.np_out_data = self.conv2d_einsum(in_data, weights)

            self.np_out_data = np.clip(self.np_out_data, self.clip_min, self.clip_max)
            self.np_out_data = self.np_out_data.astype(self.np_dtype_conv_out)
        else:
            out = self.conv2d_einsum(self.in_data, self.weights)
            out += self.biases
            self.np_out_data = self.decode(self.encode(out))
        return self.np_out_data

    def fuse_bn(self, bn_weights, epsilon=0.001):
        self.gamma, self.beta, self.mean, self.variance = bn_weights
        self.epsilon = epsilon

        scale = self.gamma / np.sqrt(self.variance + self.epsilon)

        self.pure_weights = self.weights.copy()
        self.pure_biases = self.biases.copy()

        self.weights = self.weights * scale
        self.weights_flipped = np.flip(self.weights, [0, 1])
        self.biases = beta + scale * (self.biases - self.mean)

    @staticmethod
    def conv2d_einsum(img, kernel):
        pad_h = kernel.shape[0]//2
        pad_w = kernel.shape[1]//2

        out_batch = []
        for n in range(img.shape[0]):
            padding = ((pad_h, pad_h), (pad_w, pad_w), (0, 0))
            img_pad = np.pad(img[n], padding, 'constant')

            sub_shape = tuple(np.subtract(img_pad.shape, kernel.shape[0:-1])+1)
            shape = kernel.shape[0:-1] + sub_shape
            strd = np.lib.stride_tricks.as_strided
            submatrices = strd(img_pad,shape,img_pad.strides*2,writeable=False)

            out_batch += [np.einsum('ijkl,ijkmno->mnl', kernel, submatrices)]
        return np.array(out_batch)

stripped-down version of the MyConv implementation

2.3. Rebuilding the model & Debugging

I then rebuild the model using the above framework, pass data and tweak things until I get the exact same output. That tells me I have understood all the operations going on inside the model.

Once I've understood the model inside-out, I start designing the hardware on the whiteboard.

How I designed modules in the whiteboard, then wrote SystemVerilog RTL, built testbenches, prepared test vectors, and debugged them.

Kraken: An Efficient Engine with a Uniform Dataflow for Deep Neural Networks

Neural Chip Design [1/4: Overview]

Abarajithan Gnaneswaran — Fri, 28 Jan 2022 16:57:48 GMT

Kraken Engine

In March 2020, I started building Kraken as a personal, passion project, an engine capable of accelerating any kind of Deep Neural Network (DNN) with convolutional layers, fully-connected layers, and matrix products using a single uniform dataflow pattern while consuming remarkably low on-chip area. I synthesized it in 65-nm CMOS technology at 400 MHz, packing 672 MACs in 7.3 mm2, with a peak performance of 537.6 Gops.

When benchmarked on AlexNet [336.6 fps], VGG16 [17.5 fps], and ResNet-50 [64.2 fps] (yeah, those are old, but those are the widely used benchmarks), it outperforms the state-of-the-art ASIC architectures in terms of overall performance efficiency, DRAM accesses, arithmetic intensity, and throughput, with 5.8x more Gops/mm2 and 1.6x more Gops/W.

I submitted the design as a journal paper to IEEE TCAS-1, the #3 journal in the field, and it is currently under review. You can find the paper at the following link.

Deep neural networks (DNNs) have been successfully employed in a multitude ofapplications with remarkable performance. As such performance is achieved at asignificant computational cost, several embedded applications demand fast andefficient hardware accelerators for DNNs. Previously proposed app…

arXiv.orgG Abarajithan

An overview of the series

Technologies Used

Throughout the project, I wrote code, moved back and forth, and interfaced between several technologies, from different domains such as hardware (RTL), software, and machine vision.

Python - Numpy, Tensorflow, PyTorch
SystemVerilog - RTL, Testbenches
TCL - Scripting the Vivado project, ASIC synthesis
C++ - Firmware to control the system-on-chip
Tools: Xilinx (Vivado, SDK), Cadence (Genus, Innovus)...

My Workflow

Through the Kraken project, I developed a workflow of 15 steps, which helps me to move between golden models in python, RTL designs & simulations in SystemVerilog, and firmware in C++. I have written them in detail as three more blog posts with code examples.

2/4: Golden Model

PyTorch/TensorFlow: Explore DNN models, quantize & extract weights
Golden Model in Python (NumPy stack): Custom OOP framework, process the weights, convert to custom datatypes

Neural Chip Design [2/4: Golden Model]

I wrote golden models in python to model the intended behavior of the chip. This outlines how I got pretrained models, extracted weights, applied my own quantization...etc.

Neural Chip Design [3/4: Digital Design & Verification]

3/4: RTL Design & Verification

Whiteboard: Design hardware modules, state machines
RTL Design: SystemVerilog/Verilog for the whiteboard designs
Generate Test Vectors: using Python Notebooks
Testbenches: SystemVerilog OOP testbenches to read the input vector (txt file), randomly control the valid & ready signals and get output vectors (txt files)
Debug: Python notebooks to compare the expected output with simulation output and to find which dimensions have errors.
Microsoft Excel: I manually simulate the values in wires with excel to debug
Repeat 3-8: For every module & every level of integration
ASIC Synthesis

How I designed modules in the whiteboard, then wrote SystemVerilog RTL, built testbenches, prepared test vectors, and debugged them.

Neural Chip Design [4/4: SoC Integration & Firmware]

4/4: System-on-Chip Integration & Firmware Development

SoC Block Design: Build FPGA projects with Vivado manually and synthesize
Automation: TCL scripts to automate the project building and configuration
C++ Firmware: To control the custom modules
Hardware Verification: Test on FPGA, compare output to golden model
Repeat 11-14

SoC design involves integrating other IPs, automating projects with TCL, writing C++ firmware, and testing on PSoC hardware.

Neural Chip Design [2/4: Golden Model]

Directory Structure

Simpler FPGA/ASIC projects can be done within a single folder. However, as the scope and complexity of the project grow, the need to work with multiple languages (Python, SystemVerilog, C, TCL...) and tools (Jupyter, Vivado, SDK, Genus, Innovus...) can grow out of control.

The project needs to be version-controlled (git) as well, to prevent data loss and to move between different stages of development like a time machine. However, FPGA and ASIC tools create a lot of internal files, which do not generalize between machines. Therefore, the building of such projects is automated via TCL scripts, and only such scripts and the source files are git-tracked.

The following is the structure I developed through my Kraken project.

kraken
│
├── hdl
│   ├── src        : rtl designs (SystemVerilog/Verilog)
│   ├── tb         : SystemVerilog testbenches
│   ├── include    : V/SV files with macros
│   └── external   : open-source SV/V libraries
│
├── fpga
│   ├── scripts    : TCL scripts to build & configure Vivado projects from source
│   ├── projects   : Vivado projects [not git tracked]
│   └── wave       : waveform scripts (wcfg)
│
├── asic
│   ├── scripts    : TCL scripts for synth, p&r from source
│   ├── reports    : reports from asic tools
│   ├── work       : working folder for ASIC tools, [not git tracked]
│   ├── log        : [not git tracked]
│   └── pdk        : technology node files, several GBs [not git tracked]
│
├── python
│   ├── dnns       : TensorFlow, Torch, TfLite extraction
│   ├── framework  : Custom framework
│   └── golden     : Golden models
│
├── data           : [not git tracked]
│   ├── input      : input test vectors, text files, generated by python scripts
│   ├── output_exp : expected output vectors
│   ├── output_sim : output vectors from hardware simulation
│   └── output_fpga: output vectors from FPGA
│
├── cpp
│   ├── src        : C++ firmware for the controller
│   ├── include    : header files
│   └── external   : external libraries
│
└── doc            : documentation: excel files, drawings...

I wrote golden models in python to model the intended behavior of the chip. This outlines how I got pretrained models, extracted weights, applied my own quantization...etc.

Custom Processor (Verilog), ISA, Compiler & Simulator (Python)

Vision-Based Adaptive Traffic Control on an MPSoC [ARM+FPGA]

Abarajithan Gnaneswaran — Fri, 28 Jan 2022 00:00:00 GMT

In a nutshell. My diagram from our patent

Overview

Overview of the project

As a modular, mass-manufacturable and decentralized edge solution to the adaptive traffic control problem, we designed and implemented a custom CNN accelerator on FPGA, modified & trained YOLOv2 object detector with custom Sri Lankan data to detect vehicles in the day, night & rain, and wrote custom algorithms to track vehicles, measure weighted critical flow with 100% (day) & 80% (night) accuracy, and calculate traffic green times (delta algorithm) on the ARM processor core.

A patent for our system is currently under review at NIPO and the system is being further developed with funding from World Bank via AHEAD into a product by a multidisciplinary team of engineers through the Enterprise (Business Linkage Cell) of the University of Moratuwa, together with RDA and SD&CC.

Our project won Gold at NBQSA 2020 (National ICT Awards) organized by British Computer Society in Tertiary Student Category and is currently (2021 December) competing at APICTA (Asia Pacific) Awards held in Malaysia.

It's storytime... How it started:

In my 3rd year (5th semester), by pure chance, we formed a team of four: Abrutech (Abarajithan, Rukshan, Tehara, Chinthana) for our processor project. Soon we figured out that we had great chemistry, we complemented each other's strengths and weaknesses, a team made in heaven. :-)

The first architecture I designedThe major challenge in designing a processor is the trade-off between the size of ISA, hardware complexity and user-friendliness. ABRUTECH is a unique custom processor highly optimized to manipulate matrices while preserving the functionalities of a generic processor…

CSIRO: End-to-End Machine Learning Pipeline

I had a knack for digital architecture design and I was good at coming up with ideas and algorithms. Rukshan was the best verification engineer I have seen. With infinite patience and thoroughness, he never skipped a corner case. A Verilog module he wrote and tested is as good as formally verified. Tehara had the passion and patience to modify neural networks and train them for weeks. Chinthana was the street smart guy, the jack-of-all-trades, who could learn and do any task within hours.

By the end of that project, I had decided that I want this group forever, especially for the final year project. Also, the feeling I got when I designed architecture was nothing but pure bliss. It was addictive, I had never felt that before. I was in love.

In the next semester, we all started our internships. At CSIRO (Australia), as I worked with building CNNs on Tensorflow/Keras, training them and implementing them on edge devices (Jetson TX2), I itched to design an architecture to accelerate complex CNNs, such as object detectors. That was a perfect idea for our team. Rukshan was implementing a Maxpool engine for a simpler CNN at NTU (Singapore), Chinthana was building a python compiler for ML on ASIC at Wave Computing and Tehera was working object detectors (RCNNs) at Zone 24x7. So, we collectively decided on this topic and were brainstorming remotely from three countries.

In May 2018, I was chosen for a 6 month paid internship program at DATA61 CSIRO (The national science agency of Australia), in Brisbane, Australia. I was supervised by Dr. Nicholas Hudson and Dr. Navinda Kottege. The main task assigned to me was to develop an end-to-end pipeline for robotics-related

Aba’s Inference Framework for Precision Experiments

However, when we proposed the idea to a few scientists at CSIRO and lecturers in our department, they suggested that an FPGA implementation of a custom CNN engine would be a waste of time, as GPU-based systems are the popular ones for inference back then, and that we will not be able to publish this. Disheartened, we searched for a staff-proposed project where this solution might make sense. We found Prof. Rohan's project, funded by the world bank, titled Vision-Based Traffic Control. It was a topic that was attempted a few times in the past and failed. One team had tried to fly hot air balloons, another MSc team tried a simpler approach with Raspberry Pi, of passing the images through an edge-detection kernel, counting white pixels and using a custom fully-connected network of 2 layers trained on a custom dataset of few hundred images to estimate a traffic level (1-5) from the white pixel count.

We proposed to tackle this problem with our own mass-manufacturable, robust solution. A pre-trained and then fine-tuned object detector running on our custom engine in an FPGA and a custom algorithm tested in VISSIM simulations to control the traffic lights based on those detections. This received a huge backlash from some staff members, who pointed out a few valid concerns. We may not have enough time to obtain government permission to demonstrate this on the road, and the CNNs may not work with Sri Lankan data. After some traumatizing back and forth via academic politics, we picked the project in February.

As a response to "Like everyone, you will start collecting data way too late... in August, and find your model doesn't work", I vowed to demonstrate it within ten days! Rukshan built a data collection device, powered by a power bank through a 40-feet wire and programmed a python GUI to control its 2-axis servo and camera via Wi-Fi. We collected 750 images, Tehara ran pretrained YOLOv2 on them and we showed that we can detect ALL vehicles in both day and night time, acing the feasibility presentation within 10 days. With that start, we were good to go.

Methodology

The loop detectors and radar-based traffic sensing methods lacked accuracy, especially for smaller vehicles like motorcycles, which jam the traffic in developing countries. Vision-based adaptive controls in developing countries primarily rely on processing in a central server, which requires a high s setup cost. Research level edge solutions based on processors use simple algorithms (like edge detection) and aren't robust enough in different lighting conditions. Ones that run on GPU-like systems aren't mass manufacturable. Therefore, we came to the conclusion that our method of a state-of-the-art, robust object detector accelerated on a mass-manufacturable FPGA-based system best solves all these issues.

1. Machine Learning

1.1 Inference Framework for Precision Experiments (Aba)

I built a Keras-like object-oriented inference framework using the multidimensional operations of NumPy. Layer types (Conv, Relu...etc) are implemented as subclasses of the common Layer class. They can be chained to build a model, which can forward propagate an image through multiple datapaths.

Aba’s Inference Framework for Precision Experiments - aba_framework.py

Gist262588213843476

Part of my framework code

Part of my code shows convolution via Einstein summation and custom requantization scheme

1.2 Data Collection Devices (Rukshan, Chinthana) & Annotation (Tehara)

Built four remotely powered, wirelessly data-collection devices. Collected, annotated and augment traffic images to create a Sri Lankan traffic dataset (1500 images).

1.3 CNN Architecture & Training (Tehara)

Optimized the architecture of the YOLOv2 object detection neural network for hardware implementation. Trained YOLOv2 and TinyYOLO.

Fused batch normalization into convolution by modifying the weights and biases accordingly.
Interchanged conv => leaky-relu => max-pool to conv => max-pool => leaky-relu to reduce power.
Changed the output layer from 80 classes to 5 classes, by reusing weights of appropriate classes.
Changed grid size from (13 x 13) to (12 x 8) and designed the sensing algorithm accordingly
Trained with custom Sri Lankan Traffic Dataset (three-wheelers)

2. CNN Accelerator Design (Aba)

I designed our first engine, which in retrospect resembled ShiDianNao (ISCA’15). With 9-muxes x 24, 3-muxes x 48, 16-bit registers x 144, multipliers x 3, accumulators x 3, each core, performed 12 of 3x3 convolutions in 9 clock cycles. Rukshan implemented it and demonstrated it in simulation. But we couldn't fit enough processing elements of it in our FPGA to run YOLOv2 at any respectable speed. The LUT count exceeded, thanks to large 9-way multiplexers. Several registers stayed unused over most clocks. In addition, Rukshan hated the ad-hoc solutions we came up with to handle edge cases, calling them cello tape solutions.

Limited by the resources and the memory bandwidth of our FPGA, I went back to the whiteboard. On an arid afternoon of August that I vividly remember, I conceived Kraken’s cornerstone. Staying in Colombo to work on this without going home for mid vacations, staring at the whiteboard and sweating in the unbearable Colombo heat, it flashed to me. That I can separate the convolution into horizontal and vertical, and shift them independently, such that partial sums snake through the 3 PEs in a core to compute a full 3x3 convolution. Within an hour, I had the accelerator design ready, with data rates and clock cycles calculated.

Rukshan hates my habit of changing the architecture, introducing new features every day, in the name of optimizing it. As he spends several weeks verifying designs without skipping a corner case, it is a nightmare when I change a verified design. Basically, I'm progressive and he's conservative, making us the ideal pair of designers. I spent a few days convincing him, showing that the core 2.0, was 4 times faster, used five times fewer 3-muxes, zero 9-muxes, about 20 times fewer registers (for the same speed), with 100% utilization of all multipliers and adders.

He finally came around and started implementation. I designed some support modules and implemented them, and I started working on the PS-PL (ARM-FPGA) coordination. It took me a while to wrap my head around the Vivado design flow and I finally cracked it after I got some hands-on experience at the workshop organized by International Center for Theoretical Physics, Italy in Assam, India.

Four of us stayed together at Tehara's home for a month-long sprint. The final system could run the 3x3, 1x1 layers of YOLOv2, implemented at 50 MHz on Xilinx Z706. The fmax was disappointing, and we knew why. Without any guidance in digital design, we implemented our AXI stream modules to handle the handshakes combinationally, resulting in an extremely long path from the back to the front of the pipe for the ready signal. Also, the enable signals of the nested counters of 5-levels were implemented combinationally, causing another long path. We didn't have time to redesign and verify it, so we decided to go with what we had.

System Design on ZYNQ Programmable SoC (left); One of the 192 cores, able to do 3x3 and 1x1 convolutions (right)

3. Tracking Algorithm (Aba)

Meanwhile, I designed a lightweight, IOU-based, standalone (no libraries) object tracking algorithm, robust to broken tracks, double-counting...etc and implemented them in both python and C (bare-metal on ARM side of ZYNQ FPGA.

Near 97% vehicle counting accuracy in the daytime, 85% accuracy in the night, rainy time, on test data (on a road the CNN has never seen before)
Object detector (YOLOv2) has less accuracy. But tracking algorithm is designed to obtain near 100% accuracy in vehicle counting and identification

4. Delta Algorithm for Traffic timing (Aba), VISSIM verification (Chinthana)

With that, I also designed and tested 8 algorithms based on density, bounding box count, flow...etc. Chinatha figured out the VISSIM software, built a sophisticated intersection based on a real-world one, and tested the algorithms. None of them converged.

Our supervisor had suggested an algorithm, where static time is changed a little, based on the ratio of the number of vehicles, which failed to converge on testing. Thinking about it, I figured a way to put it into an equation, but with traffic flow. We named it the Delta Algorithm. During poor visibility, it naturally falls back to static timing. Chinthana tested it and found it converges. We tuned the sensitivity parameter.

Acknowledgement

We are forever indebted to our families of Tehara and Chinthana for hosting our team for weeks/months during strikes and study breaks, allowing us to work together. In addition, we thank our Supervisors: Prof. Rohan and Prof. Saman Bandara for their support and assistance.

Behind the Scenes

My team: abrutech

We stayed for months at each other's homes to work together day and night. We became a part of each other's families, celebrating birthdays, night outs... the most icon team in our batch, bonded for life! <3

How I work

My whiteboard designs throughout the project. Some were not implemented. Just to show you how I work. :-)

My Paper-Writing Workflow [Inkscape, Python, Mendeley, VSCode, Git]

Abarajithan Gnaneswaran — Sun, 21 Nov 2021 06:32:20 GMT

I'm quite used to LaTeX, thanks to our department encouraging students to make assignments and reports in LaTeX from the sophomore level. Through countless such submissions, I have tried multiple workflows: TexMaker, Overleaf, VSCode + Latex Workshop and more.

This year, I began writing my first paper (Kraken), targeting a journal. Since it was on my personal, passion project, wrote the entire paper, drew all diagrams, made all figures and revised it several times, before handing it over to my advisor, who reviewed it. Because this is my first paper, I wanted everything: the diagrams, graphs, their text styles and all to be perfect. I experimented and learned a lot on the way to craft my own workflow for this and my future publications.

Kraken: An Efficient Engine with a Uniform Dataflow for Deep Neural Networks

arXiv.orgG Abarajithan

Kraken paper

In a nutshell,

I draw diagrams with Inkscape, save them as EPS (without text) + TEX (text only), such that text is rendered in latex for a uniform style.
I use the python ecosystem to handle data. Pandas for spreadsheets, Matplotlib and Seaborn for beautiful plots. SciencePlots to conform to IEEE style.
I manage my references with Mendeley, sync them across my devices. I use Resilo to sync my handwritten notes on papers between my tablet and laptop
I write pure LaTeX with VSCode + LaTeX workshop, compile it with MiKTeX, version control everything with git and sync it to overleaf via GitHub for my supervisor.

Drawings - Inkscape

Which one looks better?

LEFT: one I drew with draw.io for our patent, RIGHT: one I drew with Inkscape for my paper

Tools like draw.io lack precision. Also, the size and style of their fonts do not match the rest of the LaTeX text. Inkscape [get] is a professional vector graphics tool, a fully-functional, lightweight, free and open-source alternative to Adobe Illustrator. Therefore, I learnt it to draw the two diagrams in the paper.

First, find the column/text width of your LaTeX document with \the\columnwidth in points (1 pt = 1/72.27 inch). Then set the Inkscape canvas size in File > Document Properties > Page > Custom Size. You can create a grid, either rectangular or isometric, in File > Document Properties > Grid. You can start drawing with Bezier Curve tool [B], organize the drawing by layers...etc

Adding Text

Text can be added in multiple ways using the Text tool [T]. You can simply add text, but that will get rendered with the image, resulting in any font/size you want. But I wanted the text to be rendered in LaTeX, such that it has uniform size and style, to match the rest of the document. For this, I type the text, with any required latex symbols as follows, and save a copy of the image as EPS+LaTeX. This creates an .eps file without text and a .tex file with text only. Make sure to keep the master copy of your drawing updated in the default SVG format (only save a copy in the EPS+TEX format) to avoid data loss.

(1) select the text, adjust the alignment. Inkscape has a bug here. You need to check few times that it is aligned correctly. (2) File > Save a Copy > EPS > Omit Text in PDF. This will create a pair of files.

Then add the tex file into your LaTeX document using either include or import:

Since this renders text seperately, the text might have moved slightly. Make nessasary adjustments, and recheck the text alignment to make it work.

The first figure (this one) took me over 10 days to make. To learn inkscape from scratch, try like five different text insertion methods and finally get it right. The next figure (following one) took me just a few hours.

Line, bar, pie charts - Python Ecosystem

If you prefer Matlab, skip this. I, on the other hand, love the python stack. Numpy is the best in manipulating like 6,7-dimensional arrays, and the object-oriented and intuitive nature of python libraries make them a treat to use. I also found this awesome package: SciencePlots [get], which works on top of other packages and makes IEEE & Science style plots with ease.

Science + Nature vs Science + IEEE styles

To build the spreadsheet, I populate a pandas dataframe with some regular python code. Yes, I had to spend a few days learning pandas for this, but it's a great investment. Then I use matplotlib to make plots and seaborn to make them prettier.

My code

Drawings in my paper

Reference Management - Mendeley

I had to read and summarize like hundred papers for a literature review. Keeping them organized is a nightmare with regular file manager. Which one do you prefer among the following?

(1) Regular file manager (2,3) Mendeley

Mendeley is available on Web, Windows, Mac and iOS. It has tons of features, including the following.

Add references as you browse the internet. Install the Mendeley extension and click it from arXiv, IEEE, ResearchGate, Springer, raw PDF...etc. It will fetch the data (author, journal, year, abstract, possibly pdf) and save it to its Web version. You can sync that to all your devices.
Group by labels, sort by author/year...etc.
Search within your references fast
Highlight and add comments using the built-in PDF tool. You may take notes as well. Everything will be synced between all your devices.
Easily export bibliography as LaTeX. Select a set of references, right-click, copy as BibTeX, then paste into your .bib file.

I summarize the papers into a Google / Online Excel Sheet, so my supervisor also can have a look.

Sync Handwriting between Tablet and PC

I prefer drawing on the papers and writing stuff on them with my Android tablet and pen. Mendeley does not work with Android and I'm not sure if its iOS version supports handwriting. Therefore, to sync documents, I use Resilio Sync. It can be installed in both PC [get] and tablet [get]. Folders from the PC can be connected to those in the tablet, such that any changes made on either are reflected in another one within minutes.

I use Xodo [get], an excellent PDF manager on my tablet to draw on the papers. These then reflect in the PC (and within Mendeley).

Pure LaTeX, but faster! - VSCode

"What's wrong with Overleaf?"
"Oh you're a simpleton happy with LyX!"

Nope, I write pure LaTeX. Overleaf is great, but I hate web-based tools. I prefer something local, more responsive (thanks to my shitty connection) and with proper version control. Since I wrote the entire journal paper alone, all 15 pages of it, and revised it like ten times, I had the freedom to develop my own workflow. When I finally passed it on to my supervisor for final revision, I synced it with Overleaf so he can work with it comfortably and his changes will be reflected in my local machine. To compile LaTeX locally, you need a compiler like MiKTeX: miktex.org.

VSCode + Latex Workshop

Maintained by Microsoft, yet free and open-source, VSCode [get] is the best editor/IDE out there for almost any language, hands-down. Unless you are a vim user, of course. By installing the right extensions, you can make it into a super-powerful, IDE of any language. Latex Workshop [get] is the extension we need. Coupled with VScode's native features, it provides a killer LaTeX experience. Check out some of its coolest features below. More features are listed here. The theme I use is One Dark Pro Monokai Darker [get].

Multiple tabs, symbols pane, preview equations on hover, ultra-fast intelligent autocomplete, easy-on-eyes dark theme, pretty colors (theme can be chosen independently), seamless git

Version Control - git

Git is a pain to a lot of people. It was the same to me, until mid-2020. When I was stranded in New Delhi due to COVID lockdowns amidst my solo backpacking, I started reading some books on git. Then I figured that it's what I have been missing all my life.

Git is a version control tool. Basically, a time machine that helps you to move to any point in the development, compare two points (diff), manage them collaboratively...etc. It guarantees that "commits" (aka snapshots) of your folder are immutable and non-deletable, giving you the confidence to clean up your code and try new things after each commit.

It has a horrible interface. The commands barely make sense. But underneath, it has a beautiful data model. Once I understood that deeply, I was able to recall or google commands at will and use git like a pro. Now I'm addicted to the safety it provides. Even for a personal project, I start tracking it with git. Note, a git is a local tool. You don't need an internet connection to use it.

An awesome MIT video that explains git beautifully. This video changed my life.

Version Control (Git)

the missing semester of your cs education

Corresponding article to the video

Okay, how do I use git for paper-writing? Well, simple. Once you understand git, start tracking all your TEX, EPS, SVG, BIB, python, ipynb files using git. You will feel confidence and power. When you add something and save (recompile) successfully, make a commit with an appropriate commit message (eg: "Modify table horz_conv, to add circles around y").

Overleaf Sync

Right, all good. But my supervisor is not familiar with git and Mendeley. What do I do? He is very familiar with overleaf though. So, I set up sync to overleaf via Github. I had to purchase a student account for $8/month.

First, create a GitHub repo (remote), connect overleaf to that empty repo, push your local repo to remote, then import changes into overleaf. When the supervisor changes something, I can push that from overleaf to GitHub and pull it into the local repo. Following is a tutorial

How do I connect an Overleaf project with a repo on GitHub, GitLab or BitBucket?

An online LaTeX editor that’s easy to use. No installation, real-time collaboration, version control, hundreds of LaTeX templates, and more.

Overleaf, Online LaTeX Editor

Conclusion

Yeah, I had to learn 80% of this over several weeks to set up my workflow. But I believe this is a worthy investment, as I can continue to write future papers with this easily.

If you think I am taking this too far, here's what I aspire to be. This mathematics freshman takes lecture notes with LaTeX using vim (yep) while drawing all mathematical drawings in real-time (as the prof draws on blackboard) on Inkscape.

How I’m able to take notes in mathematics lectures using LaTeX and Vim

A while back I answered a question on Quora: Can people actually keep up with note-taking in Mathematics lectures with LaTeX.There, I explained my workflow of taking lecture notes in LaTeX using Vim and how I draw figures in Inkscape.Howev…

Gilles Castel

Afghanistan: A Sad Story

Abarajithan Gnaneswaran — Sat, 07 Aug 2021 14:53:00 GMT

TLDR of some counterintuitive facts:

77% of Afghan people supported the US invasion.
The US invaded to avenge 9/11. There’s not enough oil there.
The US did war crimes, some serious development but failed to build a self-sustaining nation as they got distracted with Iraq in 2003.
Afghan people have been intercours'ed by the British, Soviets, local tribal leaders, Taliban and US, one after the other.
Taliban means "student" in Arabic. They were the fatherless refugee children orphaned in the Soviet war, raised in Wahhabist schools of Pakistan funded by Saudi Arabia.
Taliban committed at least 15 massacres and genocides targeting Shia and Hazara (Muslims) when they were in power. This is why people are desperate to flee.
Taliban assassinated the one honest leader (Massoud) who promoted peaceful democracy, as they couldn't bribe him.
Trump cut a secret deal with the Taliban to withdraw by May 1st.
Afghan state failed because military was “incompetent fools, corrupt to the patrol level”, 30% of police became bandits, public lost trust in govt.

Afghanistan is made of a lot of different ethnicities and tribes in rural areas scattered across mountains. British drew their borders arbitrarily. Their biggest ethnicity: Pashtun is split between Pakistan and Afghanistan. So people don’t respect that border and keep moving back and forth. This allowed the Taliban to regroup and train in Pakistan easily. Khyber pass, an extremely strategic path through Hindu Kush mountains through which, Indo-Aryans, Genghis Khan, Persians, Mughals and British invaded India, was later used by American troops and Taliban to enter Afghanistan.

Following British independence in 1919, the king tried to modernize the country by educating everyone (including women) and abolishing women's face veil. These liberal reforms led to a civil war with tribal leaders. After few more kings, in 1964 Afghanistan became more democratic, allowing multiple parties. One party (PDPA) took the power in a non-violent coup and started implementing communism, with the support of Soviet Russia.

That "democratic party" banned forced marriages and promoted the education & job security of women. They also tortured and killed local Muslim leaders and forced atheism. This and dependence on the Soviet Union angered the people. Islamist riots by Afghan Mujahedeen (armed tribal leaders) broke out. The Soviet Union invaded to quell the rebellion, killed millions and raped women. They systematically depopulated rural areas with landmines and even toy grenades targeted at children. In response, the US supplied anti-aircraft guns to Mujahedeen. After 9 years, in 1989, as USSR collapsed, the Soviets withdrew in defeat, much like the US does now.

History of Afghanistan - Wikipedia

Wikimedia Foundation, Inc.Contributors to Wikimedia projects

People cheered, considering Mujahedeen as liberators. They then became warlords, tore up their areas and started a civil war between themselves. Half of Kabul was reduced to rubble. UN tried to form a coalition govt of them in 1992 and failed. Fatherless children from the Soviet war grew up in refugee camps in Pakistan, and were indoctrinated into wahhabism by madarasas funded by Saudi Arabia. They called themselves Taliban, that is students. Pakistan opened the refugee camps, in 1994, they invaded Afghanistan. People welcomed them as liberators.

In 1996, the Taliban seized Kabul with the support of Saudi Arabia and Pakistan and brought the sharia rule. They forbade women from leaving homes and studying. They committed at least 15 systematic massacres and genocides targeting the Shia and Hazara, torturing and killing 4000 at a time. Ahmed Shah Massoud, a respected tribal leader, railed up enough opposition, set up democratic institutions and promoted women rights. Taliban tried to bribe him by giving the Prime Minister position, but he declined and asked for a democratic solution. He addressed the EU in 2001, stating that the Taliban and Al Qaeda had introduced "a very wrong perception of Islam". The Taliban assassinated him in 2001.

On 11/9/2001, Afghan-based Al-Qaeda attacked the world trade centre. Taliban gave Osama Bin Laden safe harbour and in return US-led forces invaded Afghanistan. There isn’t enough oil there though. In December 2001, the Taliban government was toppled. 77% of Afghan people supported the American invasion [source 1] [source 2].

NATO started building a government [source], training the Afghan army and police and implementing reforms. Women education rose from 0% to 60% (2016), free media and platforms for public debate were established. Infant mortality rates fell by half. In 2005, fewer than 1 in 4 Afghans had access to electricity. By 2019, nearly all did. The Afghan geography makes centralized control impossible. Most of the country being over 2000m in elevation makes the road networks non-existent. The US started rebuilding the Soviet-built circular highway with the help of the world bank, Saudi Arabia and Iran (wow), to unify the country. Then they got occupied with Iraq, Taliban strengthened themselves and started blowing up the highway and construction workers.

Afghan society is geared more towards family and tribe than having a “unified afghan” feeling. The popular people who got Afghan govt positions were corrupt tribal warlords. In the leaked secret documents, the US military officials call Afghan soldiers "incompetent fools and corrupt to the patrol level” and blame themselves “we moved too slowly initially when Taliban were defeated. When they rebounded, we trained too quickly” (and quality fell) [source]. The US built an Afghan military that dwarfs the military of even developed countries. The US spent 20 years and 2 trillion dollars there, which mostly went to corrupt Afghan officers. 30% of the new police escaped with their weapons to put up their private checkpoints, becoming bandits, robbing people. The rural public lost trust in the centralized govt, police and military.

Meanwhile, the Taliban regrouped in Pakistan and started attacking and controlling the strategic choke points. The US conducted war crimes, torture and killings of Taliban prisoners and suspects. The US killed 9 children in 2003 and 100 civilians (mostly children) in 2009. This turned the international perception and rural Afghans against the US occupation. Of civilian casualties, 40% was due to the US & Afghan govt and 60% was due to the Taliban.

After Osama was killed in 2011, the occupation became unpopular among the US public. They didn’t want their children being killed in an unrelated conflict. The US govt also did not have a clear objective after killing Osama. As a last attempt, Obama tried a "surge" of troop inflow, but the Taliban just surged their attacks. Obama started pulling out. Trump stroke a secret deal with the Taliban, without even consulting the Afghan govt they built, that the US would pull out by May 1st 2021. He reduced the 15,500 troops to 2,500. Biden says US succeeded in its objective: avenging 9/11 and making sure the region doesn't breed terrorism targeting mainland USA. Given Afghanistan is now a Taliban controlled terrorist state, US might have failed in that objective too.

An exit strategy wasn’t planned. When the Taliban attacked this year, with only 2500 US troops, Biden had to either send more resources and restart the war or stop the war by pulling out immediately, he decided on the latter [source]. The US military shut off the lights in their airbase and slipped disgracefully without informing the Afghan army. The president fled the country betraying his people. The incompetent and corrupt Afghan govt and army collapsed like a house of cards and most of them switched sides to the Taliban for bribes.

Now people are desperate to flee through the borders and Kabul airport, fearing the extremist rule, genocides and massacres of the Taliban. The US has a deal with the Taliban to continue the evacuation of allies and special visa Afghans. US & French ambassadors fled, but the UK ambassador stays back writing visas for the Afghan translators, helping them escape. China quickly joined hands with the Taliban and India is shocked to have a terrorist country in its footsteps. The future of Afghanistan and its people is as uncertain as ever before.

Disclaimer: I'm no expert. Feel free to mention any missing points in the comments, will add them.

Afghanistan: A Sad Story - Who is responsible?TLDR of some counterintuitive facts: -- 77% of Afghan people supported the US invasion.-- US invaded to avenge 9/11. There’s not enough oil...

Strange Loops: Ken Thompson and the Self-referencing C Compiler | ScienceBlogs

First posted on Facebook. Click to see comments & shares

How safe are our compilers?

Abarajithan Gnaneswaran — Thu, 05 Aug 2021 14:59:00 GMT

An important argument placed against e-voting machines is that there is absolutely no way to prevent large-scale fraud. Even if the firmware is open-source, how do you verify it's the same code programmed into the machine? Even if it was compiled in front of your eyes, how do u know the compiler itself is safe?

GCC, the most popular C/C++ compiler itself is actually written in C. Then how is it compiled? Using older versions of itself! This fact was used by Ken Thompson, the legend who designed and implemented UNIX, the OS on which Linux (hence all android phones, data centers, mars rovers) and macOS (an overpriced piece of shit ;-) ) are built upon.

When he wrote the 'login' program of UNIX (during early days), he put a backdoor for debugging purposes. That is, given his secret password, any UNIX machine would unlock. But anyone else who reads the source code of login would notice this and panic. So, he hid that backdoor in the compiler, such that if the login program is compiled, it would insert the backdoor into the binary (assembly), else it would compile other programs normally.

But, wouldn't people read compiler's source code? For that, he first built the backdoor in the compiler roughly as follows. If future compiler code is given to the compiler, it would generate a binary for a rigged compiler with the above back door. Else, it would compile other programs normally.

He then compiled the compiler into binary and then removed the backdoor from the compiler's source code. Now the backdoor is practically invisible. it only exists as 1s and 0s in the binary. Anyone who reads the source code of the compiler sees it's perfectly fine. But when they add features to future versions of compiler code & compile it with his rigged compiler, it generates a new, rigged compiler binary. When they compile the login source code with that rigged compiler in the future, it would place the back door into the login binary.

Anyway, it was temporary, he never got caught and he revealed it when he received the Turing Award. However, this is an interesting phenomenon, and any kind of malware could be injected like this since we usually apt-get / download .exe of the compiler, which was in fact compiled by the older version of the same compiler.

Btw, this is pretty common. Recently I heard from a podcast, that few years ago hackers infiltrated a company that makes network-security software. They injected their virus into their development toolchain. In the next release, that got into the software and was distributed to all their customers, who are big software companies and the US govt itself. It was found months later.

I’m currently reading “I am a Strange Loop” by Douglas Hofstadter. I’ll be posting a review of it after I finish it. A “strange loop” is Hofstadter’s term for a Gödel-esque self-referential cycle. A strange loop doesn’t have to involve Gödel style problems - any self-referential cycle is a strange l…

Homegoodmath

First posted on Facebook:

How safe are our software? - Compiler vulnerabilitiesAn important argument placed against e-voting machines is that there is absolutely no way to prevent large-scale fraud. Even if the firmware...