<![CDATA[Aba's Blog]]>https://aba-blog.xyz/https://aba-blog.xyz/favicon.pngAba's Bloghttps://aba-blog.xyz/Ghost 4.32Mon, 11 Apr 2022 07:50:11 GMT60<![CDATA[#GoHomeRajapaksas​ - An Unprecedented Revolution]]>https://aba-blog.xyz/gohomerajapaksas/6253d8dc3ad45a0b6a47e8b4Mon, 11 Apr 2022 07:46:21 GMT

Over two years ago, most of us (including myself) rooted #WeWantGota. A fair percentage of us were legit racists. Most of us were "I'm not a racist, but". Many believed democracy is inefficient & Sri Lanka needs a dictator to develop. Some of my dearest friends justified the killing of Lasantha as “necessary”. Many were ready to trade democracy & freedom for the illusion of security & prosperity. Some believed the Kelaniya snakes story. Then some of us actually hoped that Gota who cleaned Colombo was the rational choice compared to Sajith who was blabbering about a national Artificial Intelligence.​

But all of us, except the racists, were deceived. Rajapaksas cleverly used Gota as a trojan horse, a step in the door, to win a landslide victory gaining 2/3rd in the parliament. Mahinda was made PM; they kept playing good-cop-bad-cop: "Gota is doing good, but Mahinda is ruining things"; the constitution was amended; Basil who ran a petrol shed and was not elected by the people was made the Finance Minister, because of their logic "Wada karanna Gota-Mahinda innawa, Wada karawanna Basil enna one".​

Sunil Rathnayake, who was unanimously sentenced to death for massacring 8 civilians in cold blood, was given a presidential pardon. Nationalists hailed that as "democratic justice". Duminda De Silva and corrupt politicians were similarly released, for which they were silent. The investigation report on Easter Attacks is being withheld from the public. The cardinal keeps making cryptic allegations.​

Apart from the corruption and looting that spanned decades, rotting our country to the core, and apart from the subtle racism by SLPP and some thug monks that divided and pitted us against each other, today their core values have driven Sri Lanka to the deathbed.

Their corruption starts with the habit of valuing blind loyalty over competence. Many of us were happy with this. “Yep, loyalty is the most important virtue. Who can you trust more than your family?”. With 2/3rd in the parliament, they gave the Health, Foreign, and Finance Ministries to absolute idiots, who kept fooling the public by throwing pots into rivers. The fertilizer ban wiped out the agricultural output (family of farmers, can confirm). A sinking ship was pulled into shallow waters, ruining the fishing industry.​

They surrounded themselves with yes-men “advisory committees”. CBSL governor apparently believed in Modern Monetary Theory and kept printing money, which led to over 50% food inflation today. They kept the rupee pegged causing most of the foreign worker's remittances to move to the black market and IT employees chose to keep their dollars outside the country. In addition to bleeding billions to support the pegged rupee, govt lost the dollar income and drained existing dollars fast, tanking the economy.​

And here we are. An unprecedented revolution in Sri Lankan history, without any political backing, even chasing out those thug monks. Politicians are unable to comprehend this. They first tried the usual racist cards: “Arab spring” and “extremists”. A bus was mysteriously burnt, just like the unsolved easter attacks. They abducted the Facebook group admin. They sent armed, unidentified STF to mess with protests. They tried diluting the movement with “225ma epa”. Yohani made a fake fundraiser. ​

Most of the 69 lakhs have now changed their ‘position’. A lot of bayyo recently got enlightened. Hardcore racists are keeping silent or have gone underground to talk about how #GoHomeGota insults the triple gem. The rest of the washing machines formed the “jana ganga” which we quite enjoyed.​

They keep trying every trick in their book and they keep failing. Their only objective is to do anything to quell the protests in the short term, so they can stay in power and keep looting. A new government cannot independently negotiate with IMF with Rajapaksas on the table scheming how to loot from that too. A new govt cannot implement unpopular, but necessary IMF reforms with Rajapaksas being in the opposition and wreaking havoc with publicity stunts like riding bicycles when the petrol price was raised by 3 rs. ​

We should abolish Executive Presidency, to make sure we don't get fooled again by another "wada karana wiruwa". Rajapaksas need to be sent home at any cost. It is now or never. We should keep the pressure.​

#GoHomeGota​
#GoHomeRajapaksas​
#GoHomeGota2022
#GoToJailGota
#GoToJailRajapaksas

### First posted on Facebook:

]]>
<![CDATA[CocoTB: FPGA/ASIC Testbenches in Python + Automated Testing in GitHub​]]>https://aba-blog.xyz/cocotb/624b01143ad45a0b6a47e7eeMon, 04 Apr 2022 15:52:02 GMT

Verilog and VHDL are quite good for designing digital circuits. But it is often painful to write Verilog testbenches to simulate them. If you disagree, try writing a neural network or a multi-dimensional Fourier Transform in SystemVerilog.​

I recently came across cocotb, a popular python-based alternative to SystemVerilog testbenches. It doesn’t do the simulation itself (a good thing), but interfaces with widely used simulators such as iVerilog, Verilator, Synopsys VCS, Cadence Incisive, and more. You just write your clock-cycle accurate testbenches in python, specify a simulator and run. You can dump and view the waveforms in GTK Wave.

This means you can use NumPy, SciPy, TensorFlow, PyTorch… the entire python arsenal to build your golden models, generate test vectors and compare with outputs. PDB, a powerful python debugger is also available at your disposal.​

Wait, it gets better. Cocotb-test, another library uses PyTest to test your designs across several sets of parameters automatically. This whole thing can then be automated into regression tests in a CI/CD pipeline. Whenever a commit is pushed into the GitHub repository or a branch is merged into master, a set of integration testbenches run over several parameters to ensure your design stays clean.​

For more details, check out this talk at CERN & UPenn. They use cocotb to verify the chips that go into the Large Hadron Collider.​

​In the following repository, I have built two designs: a simple register and an AXI-Stream FIFO, built parameterized testbenches, pytest to run over multiple parameters, and set up GitHub Actions to run them on push to the repo. Feel free to check them out.​

In a nutshell, cocotb brings the joy of the software world into the dark and desolate world of hardware verification.

### First posted on Facebook:

]]>
<![CDATA[Upper Gartmore Camping [2022]]]>https://aba-blog.xyz/upper-gartmore/621f9bcf4992a4018daad484Sun, 27 Feb 2022 02:36:00 GMT

Gartmore & Frogmore are gorgeous tea estates near the Maskeliya Reservoir, next to Hatton. Just above them, on top of a mountain range between Ratnapura and Hatton, there is a relatively flat, wide plain of thick bushes, with a stream flowing through them, to fall over a cliff as the Upper Gartmore falls.

I took an 8-hour bus from Jaffna to Kandy, stayed at a backpacker's hostel there, met my school friends for a dinner. I slept for 2 hours, took the bus to Hatton at 3.30 am, and reached there at 5.30 am. We took a van and reached Gartmore estate around 11 am.

## Hike through Frogmore Estate

Unfortunately, the supervisor of the tea estate does not allow tourists to reach the top of the mountain on a vehicle. We started hiking through the forest, lost track and ended up in a Hindu kovil as it started raining heavily. When the rain subsided, we asked a local to guide us. He took us through the 6-hour strenuous hike through thick jungle, at times climbing near vertically. We walked through thick, leech-infested undergrowth, with barely visible & perilous steep fall on one side to finally reach the flat camping grounds.

## Camping Grounds

After reaching the gorgeous camping grounds at around 4 pm, we set up the camp, collected firewood, and washed off our legs oozing with blood. The sunset was amazing:

## Camping

We had bought a lot of food: Noodles, sausages, coffee, chips, biscuits, red bull... Dinner was divine. Few of us set up our DSLR cameras to shoot the milky way and the starry sky.

## Upper Gartmore Falls

The next morning, we walked through the stream to the Upper Gartmore falls, where we got this scenic view of Adam's peak (Sripada) and Maskeliya Reservoir.

When climbing down, we were fortunate to get a three-wheeler to a point after walking for an hour. We reached Hatton and then Colombo.

]]>
<![CDATA[Neural Chip Design [4/4: SoC Integration & Firmware]]]>https://aba-blog.xyz/dnn-to-chip-4/61f5189b33068f34ce882ef5Sat, 29 Jan 2022 10:37:39 GMTThis is a series of articles [overview] outlining the  workflow of 15 steps, which I developed over the past few years through building my own DNN accelerator: Kraken [arXiv paper].

After building and testing each module and combining them hierarchically, it is time to build an SoC around it and control it. I used a Xilinx Zynq Z706 development board with a Z-4045 chip, which has an ARM Cortex processor and a Kintex FPGA on the same silicon die.

The following is the overview of the design. Gray-colored modules are Xilinx IPs. Two soft DMAs pull input $$\hat{X}$$ and weight $$\hat{K}$$ from the off-chip DDR and feed as two AXI4-Streams, which are then synchronized by the input pipe and provided to the Kraken Engine. The output $$\hat{Y}$$ is stored back into the DDR through another soft DMA. The three soft DMAs are controlled by commands issued by the ARM Cortex core, as dictated by the firmware which I then developed.

1. SoC Block Design: Build FPGA projects with Vivado manually and synthesize
2. Automation: TCL scripts to automate the project building and configuration
3. C++ Firmware: To control the custom modules
4. Hardware Verification: Test on FPGA, compare output to golden model
5. Repeat 11-14

## 11. SoC Block Design

I add my custom modules to a Vivado block design, add soft DMAs from the IP catalog, configure them, connect them to my main module, run block & connection automation, copying down TCL commands at every step. Below is the final block design I get, first manually, then automating it with TCL scripts.

## 12. TCL Automation

Xilinx Vivado projects are notoriously buggy. They crash once in a while and get corrupted. Vivado also auto-generates hundreds of small files, which contain absolute paths, and don't play well with a different Vivado version. Therefore, it is a bad idea to version control them.

The best practice is to script the project flow. Once I manually copy down the TCL commands, I change them into parameterized code.

I then spend a couple of days debugging the TCL script to ensure it can reliably rebuild a project from scratch. These TCL scripts and the source Verilog files are tracked by git.

## 13. C++ Firmware

I then write the C++ code to be run on the ARM processor, which instructs the DMA to pull data from memory and push it back. When multiple DMAs are involved, this is fairly tricky. Right after starting a DMA operation, the parameters for the next DMA iteration must be calculated in advance, to prevent stalling the DMA.

### 13.1. OOP Wrappers for DMA Drivers

I find the C code provided by Xilinx a bit counterintuitive. Therefore, I have written an OOP wrapper for the Xilinx DMA, which is open-sourced here:

### 13.2. OOP Architecture for DNN models & config bits in C++

The firmware needs to be flexible, such that I can create any DNN by chaining layer objects. For this, I write the layer class, with necessary features like extracting configuration bits and appending to data.

class Layer
{
public:
int idx, H_IN, W_IN, C_IN, C_OUT, KH_IN, KW_IN;
bool IS_NOT_MAX, IS_MAX, IS_LRELU;

Layer * PREV_P = nullptr;
Layer * NEXT_P = nullptr;

int BLOCKS, BLOCKS_PER_ARR;
u8 MAX_FACTOR, SUB_CORES, EFF_CORES, ITR, COUT_FPGA, COUT_VALID, COUT_INVALID;

int OUT_W_IN, OUT_BLOCKS, OUT_MAX_FACTOR, OUT_BLOCKS_PER_ARR, OUT_KH;

int DATA_BEATS_PIXELS;
int BEATS_LRELU = 0;
int WORDS_PIXELS_PER_ARR;
int WORDS_WEIGHTS_PER_ITR, WORDS_WEIGHTS;

int WORDS_OUT_PER_TRANSFER, TRANSFERS_OUT_PER_ITR;
int WORDS_OUT_PER_TRANSFER_ARR [3];

chunk_s * input_chunk_p  = nullptr;
chunk_s * output_chunk_p = nullptr;
bool done_write = false;

Layer ( int idx,
int H_IN, int W_IN, int C_IN, int C_OUT,
int KH_IN, int KW_IN,
bool IS_NOT_MAX, bool IS_MAX, bool IS_LRELU):
idx    (idx),
H_IN   (H_IN),
W_IN   (W_IN),
C_IN   (C_IN),
C_OUT  (C_OUT),
KH_IN   (KH_IN),
KW_IN   (KW_IN),
IS_NOT_MAX(IS_NOT_MAX),
IS_MAX    (IS_MAX),
IS_LRELU  (IS_LRELU)
{
BLOCKS     = H_IN / UNITS;
MAX_FACTOR = IS_MAX ? 2 : 1;
BLOCKS_PER_ARR = BLOCKS / MAX_FACTOR;

KW_PAD = KW_IN - 2*IS_MAX;

SUB_CORES = MEMBERS / KW_IN;
EFF_CORES = COPIES * GROUPS * SUB_CORES / MAX_FACTOR;
ITR       = (int)(std::ceil((float)C_OUT / (float)EFF_CORES));
COUT_FPGA = EFF_CORES * ITR;

COUT_VALID = C_OUT % EFF_CORES;
COUT_VALID = (COUT_VALID == 0) ? EFF_CORES : COUT_VALID;

COUT_INVALID = EFF_CORES - COUT_VALID;

/* LRELU BEATS */

BEATS_LRELU += 1; //D
BEATS_LRELU += ceil(2.0/KW_IN); // A

for (int clr_i=0;  clr_i < KW_IN/2+1; clr_i ++){
int clr = clr_i*2 +1;
for (int mtb=0;  mtb < clr; mtb ++){
int bram_width = MEMBERS/clr;
int bram_size  = 2*SUB_CORES;
int BEATS_ij = ceil((float)bram_size/bram_width);

BEATS_LRELU += BEATS_ij;
}
}

DATA_BEATS_PIXELS = BLOCKS_PER_ARR * W_IN * C_IN;

WORDS_PIXELS_PER_ARR  =      DATA_BEATS_PIXELS  * UNITS_EDGES;
WORDS_WEIGHTS_PER_ITR = (S_WEIGHTS_WIDTH/8) + (BEATS_LRELU + C_IN*KH_IN) * COPIES * GROUPS * MEMBERS;
WORDS_WEIGHTS         = ITR * WORDS_WEIGHTS_PER_ITR;

if (IS_NOT_MAX && IS_MAX)
{
WORDS_OUT_PER_TRANSFER_ARR[0] = SUB_CORES * COPIES * GROUPS * UNITS_EDGES;
WORDS_OUT_PER_TRANSFER_ARR[1] =             COPIES * GROUPS * UNITS_EDGES;
WORDS_OUT_PER_TRANSFER_ARR[2] =             COPIES * GROUPS * UNITS_EDGES / MAX_FACTOR;

TRANSFERS_OUT_PER_ITR = BLOCKS/MAX_FACTOR * W_IN/MAX_FACTOR * (1 + 2 * SUB_CORES);
}
else
{
WORDS_OUT_PER_TRANSFER = SUB_CORES * COPIES * GROUPS * UNITS_EDGES / MAX_FACTOR;
TRANSFERS_OUT_PER_ITR  = BLOCKS/MAX_FACTOR * W_IN/MAX_FACTOR;
}
};

void set_config()
{
input_chunk_p->data_p[0] = (s8)(IS_NOT_MAX);
input_chunk_p->data_p[1] = (s8)(IS_MAX);
input_chunk_p->data_p[2] = (s8)(IS_LRELU);
input_chunk_p->data_p[3] = (s8)(KH_IN/2);

#ifdef DEBUG
for (int i=4; i<UNITS_EDGES; i++) input_chunk_p->data_p[i] = 0;
#endif
Xil_DCacheFlushRange((UINTPTR)input_chunk_p->data_p, UNITS_EDGES);
};

void set_out_params()
{
/* Next layer can be null (if this is last) or can have multiple next layers.
* We are interested in how to arrange the output values of this, to match the next
*/

OUT_W_IN   = W_IN / MAX_FACTOR;
OUT_BLOCKS = (H_IN / MAX_FACTOR) / UNITS;

OUT_MAX_FACTOR     = (NEXT_P == nullptr) ? 1 : NEXT_P->MAX_FACTOR;
OUT_BLOCKS_PER_ARR = OUT_BLOCKS/OUT_MAX_FACTOR;

OUT_KH = (NEXT_P == nullptr) ? KH_IN : NEXT_P->KH_IN;
}

inline s8* get_input_pixels_base_p()
{
return (s8*)(input_chunk_p->data_p) + UNITS_EDGES;
}
inline s8* get_output_pixels_base_p()
{
return (s8*)(output_chunk_p->data_p) + UNITS_EDGES;
}
};
auto build_yolo_mod()
{
std::array<Layer,21> layers = {
Layer(1,	H_RGB   ,W_RGB  ,    3,  32,   3,   3,false, true, true),
Layer(2,	H_RGB/2 ,W_RGB/2,   32,  64,   3,   3,false, true, true),
Layer(3,	H_RGB/4 ,W_RGB/4,   64, 128,   3,   3,true, false, true),
Layer(4,	H_RGB/4 ,W_RGB/4,  128,  64,   1,   1,true, false, true),
Layer(5,	H_RGB/4 ,W_RGB/4,   64, 128,   3,   3,false, true, true),
Layer(6,	H_RGB/8 ,W_RGB/8,  128, 256,   3,   3,true, false, true),
Layer(7,	H_RGB/8 ,W_RGB/8,  256, 128,   1,   1,true, false, true),
Layer(8,	H_RGB/8 ,W_RGB/8,  128, 256,   3,   3,false, true, true),
Layer(9,	H_RGB/16,W_RGB/16, 256, 512,   3,   3,true, false, true),
Layer(10,	H_RGB/16,W_RGB/16, 512, 256,   1,   1,true, false, true),
Layer(11,	H_RGB/16,W_RGB/16, 256, 512,   3,   3,true, false, true),
Layer(12,	H_RGB/16,W_RGB/16, 512, 256,   1,   1,true, false, true),
Layer(13,	H_RGB/16,W_RGB/16, 256, 512,   3,   3,false, true, true),
Layer(14,	H_RGB/32,W_RGB/32, 512,1024,   3,   3,true, false, true),
Layer(15,	H_RGB/32,W_RGB/32,1024, 512,   1,   1,true, false, true),
Layer(16,	H_RGB/32,W_RGB/32, 512,1024,   3,   3,true, false, true),
Layer(17,	H_RGB/32,W_RGB/32,  64, 128,1024, 512,true, false, true),
Layer(18,	H_RGB/32,W_RGB/32,  64, 128, 512,1024,true, false, true),
Layer(19,	H_RGB/32,W_RGB/32,1024,1024,   3,   3,true, false, true),
Layer(20,	H_RGB/32,W_RGB/32,1024,1024,   3,   3,true, false, true),
Layer(21,	H_RGB/32,W_RGB/32,1024,  45,   1,   1,true, false, false)
};

for (int i=0; i < N_LAYERS; i++)
{
if (i!=0         ) layers[i].PREV_P = &layers[i-1];
if (i!=N_LAYERS-1) layers[i].NEXT_P = &layers[i+1];
layers[i].set_out_params();
}
return layers;
}

### 13.3. C++ Code to control multiple DMAs effectively

Next, I write C++ functions to reshape the output (\hat{Y}\) on the fly (after each small DMA packet) to generate the next layers input $$\hat{X}$$. Also, configuration bits need to be calculated and appended to the packet to make it complete.

void restart_output()
{
static int i_w=0, i_w_flipped=0, i_blocks=0, i_bpa=0, i_arr=0, i_cout=0, i_itr=0, i_layers=i_layers_start;
static volatile s8 * write_p = layers[i_layers].get_output_pixels_base_p();
static bool is_new_layer=true;

static volatile s8 * write_p_old = 0;
Xil_DCacheFlushRange((UINTPTR)write_p_old, UNITS_EDGES);

if ((i_itr == 0 && i_blocks == 31) || (i_itr == 1 && i_blocks == 0)){
for (int i=0; i<UNITS_EDGES; i++){
PRINT(" %d,", write_p_old[i]);
}
PRINT("] \r\n");
PRINT("(%d,%d,%d,%d-%d,:) -- %p [", i_arr, i_bpa, i_w_flipped,i_itr,i_cout, write_p);
}
write_p_old = write_p;

// start transfer
dma_weights_im_out.s2mm_start(	(UINTPTR)write_p,
layers[i_layers].WORDS_OUT_PER_TRANSFER);

// set config
if (is_new_layer && i_layers != N_LAYERS-1)
{
layers[i_layers].NEXT_P->set_config();
layers[i_layers].NEXT_P->done_write = false;
is_new_layer = false;
}

// PREPARE NEXT INDICES
// blocks = 31 (a=1,bpa=15), w_f = 191 (w = 190), itr = 0
if (i_w < layers[i_layers].OUT_W_IN-1)
{
i_w += 1;
// Flip last KW-1 columns : flipped = 2w-(kw+iw)
// For max: kw <- kw-2
if (i_w > layers[i_layers].OUT_W_IN - layers[i_layers].KW_PAD)
i_w_flipped = 2 * layers[i_layers].OUT_W_IN - (i_w + layers[i_layers].KW_PAD);
else
i_w_flipped = i_w;
}
else
{
i_w = 0;
i_w_flipped = 0;

PRINT(" i_blocks: %d, write_p: %p \r\n", i_blocks, write_p);

if (i_blocks < layers[i_layers].OUT_BLOCKS-1)
{
i_blocks  += 1;
i_arr      = i_blocks % layers[i_layers].OUT_MAX_FACTOR;
i_bpa      = i_blocks / layers[i_layers].OUT_MAX_FACTOR;
}
else
{
i_blocks   = 0;
i_arr      = 0;
i_bpa      = 0;

PRINT(" i_itr: %d \r\n", i_itr);

if (i_itr >= layers[i_layers].ITR-1)
{
is_new_layer = true;
i_itr = 0;
i_cout= 0;

if (i_layers < N_LAYERS-1)
i_layers += 1;
else
{
i_layers = 0;
done = true;
PRINT("All Layers done \r\n");
}

/* Chaining*/
if (i_layers == N_LAYERS-1)
{
layers[0].input_chunk_p = &temp_in_chunk;
layers[i_layers].output_chunk_p = &temp_out_chunk;
}
else
{
layers[i_layers].output_chunk_p = get_chunk();
layers[i_layers].NEXT_P->input_chunk_p = layers[i_layers].output_chunk_p;
}
PRINT("Writing to new layer: chained_chunks (idx:%d -> idx:%d), data_p= %p \r\n",
layers[i_layers].idx, layers[i_layers].NEXT_P->idx,
layers[i_layers].output_chunk_p->data_p);

layers[i_layers].print_output_params();
}
else if (i_itr == 0)
{
i_itr += 1;
i_cout = layers[i_layers].COUT_VALID;
}
else
{
i_itr  += 1;
i_cout += layers[i_layers].EFF_CORES;
}
}
}
// blocks = 31 (a=1,bpa=15), w_f = 191, itr = 0
write_p = unravel_image_abwcu(layers[i_layers].get_output_pixels_base_p(),
i_arr,i_bpa,i_w_flipped,i_cout,0, i_layers);
}

## 15. Repeat 11-14

I spent weeks or months repeating 11-14, to finally get the hardware outputs to match the golden model, and hence the original DNNs.  Once I spent a month figuring out a bug where the system worked perfectly in randomized simulations but had wrong values for just 6 bytes out of 4 million bytes. Finally, I found it's a bug in Vivado's compiler.

]]>
<![CDATA[Neural Chip Design [3/4: RTL Design & Verification]]]>https://aba-blog.xyz/dnn-to-chip-3/61f5176733068f34ce882ed8Sat, 29 Jan 2022 10:36:04 GMTThis is a series of articles [overview] outlining the  workflow of 15 steps, which I developed over the past few years through building my own DNN accelerator: Kraken [arXiv paper].

After building golden models and understanding the operations, its time to design and implement digital circuits that can accelerate those operations with:

• low on-chip area
• high fmax (hence short critical paths)
• minimal multiplexers, registers, and SRAM usage

For this, I first design my modules in detail, on a whiteboard. I spend days or weeks doing this: optimizing designs and mapping out state machines for each module. Once I'm satisfied. I sit with VSCode and start writing synthesizable RTL. Once I'm done, I generate test vectors from the golden model, write testbenches to read and compare them and start debugging.

## Steps:

1. Whiteboard: Design hardware
2. RTL Design: SystemVerilog/Verilog for the whiteboard designs
3. Generate Test Vectors: using Python Notebooks
4. Testbenches: SystemVerilog OOP testbenches to read the input vector (txt file), randomly control the valid & ready signals and get output vectors (txt files)
5. Debug: Python notebooks to compare the expected output with simulation output and to find which dimensions have errors.
6. Microsoft Excel: I manually simulate the values in wires with excel to debug
7. Repeat 3-8: For every module & every level of integration
8. ASIC Synthesis

## 3. Whiteboard

I almost always design my modules fully on a whiteboard before sitting down to write RTL. This helps to map out almost every register and multiplexer, get an idea of the critical paths, and also to reduce bugs.

## 4. RTL Design

### SystemVerilog / Verilog

I then start writing modules, state-machines...etc. using synthesizable SystemVerilog, converting my whiteboard drawings into code. This is fairly straightforward. I've given a stripped-down example code of my conv engine. The key things to note are:

• SystemVerilog - Verilog lacks a lot of features and has the potential to cause serious bugs. SystemVerilog is beautiful and a breeze to write and read.
• Multidimensional wires and ports - I use them a lot, to group ports meaningfully and connect with each other. I prefer packed over unpacked, so multidimensional SystemVerilog ports can be seamlessly connected to Verilog wrappers with flat ports, without having to manually flatten them.
• Readability - I take this seriously. Order does not matter in HDL, but I write in a way that the code top to bottom corresponds to left to right in my whiteboard diagram (the way signal flows).
• Macro Parameters - I put all the parameters, derived parameters in a common file and include it in all modules. That file itself is written through a tcl script. This way, the parameters of all files are guaranteed to be the same, avoiding bugs and also making the code readable.
• No always@clk: This might be a surprise. In my entire synthesizable codebase of 9000 lines, I have only one sequential always block: in a parametrized module named register.v.  The register module has optional clken, different types of reset...etc. All other modules instantiate this whenever needed. This helps me to avoid bugs, and to visualize the signal flow, as it directly translates from my whiteboard to code.
• FPGA & ASIC - Using preprocessor directives (ifdef), I write code to suit both FPGA and ASIC. Registers have async reset in ASIC mode and sync reset in FPGA mode.
timescale 1ns/1ps
include "../include/params.v"

module conv_engine #(ZERO=0) (
clk            ,
clken          ,
resetn         ,
s_valid        ,
s_last         ,
s_user         ,
s_data_pixels  ,
s_data_weights ,
m_valid        ,
m_data         ,
m_last         ,
m_user
);
input  logic clk, clken, resetn;
input  logic s_valid, s_last;
output logic m_valid, m_last;
input  logic [TUSER_WIDTH_CONV_IN-1:0] s_user;
input  logic [COPIES-1:0][UNITS -1:0]                          [WORD_WIDTH_IN    -1:0] s_data_pixels;
input  logic [COPIES-1:0][GROUPS-1:0][MEMBERS-1:0]            [WORD_WIDTH_IN    -1:0] s_data_weights;
output logic [COPIES-1:0][GROUPS-1:0][MEMBERS-1:0][UNITS-1:0][WORD_WIDTH_OUT   -1:0] m_data;
output logic [TUSER_WIDTH_CONV_OUT-1:0] m_user;

// Code ommited
logic [KW_MAX/2:0][SW_MAX -1:0][MEMBERS -1:0] lut_sum_start;
logic [COPIES-1:0][GROUPS-1:0][MEMBERS-1:0][UNITS-1:0][WORD_WIDTH_IN*2-1:0] mul_m_data ;
logic [COPIES-1:0][GROUPS-1:0][MEMBERS-1:0][UNITS-1:0][WORD_WIDTH_OUT -1:0] acc_s_data ;
logic [COPIES-1:0][GROUPS-1:0][MEMBERS-1:0][UNITS-1:0][WORD_WIDTH_OUT -1:0] mux_s2_data;

generate
genvar c,g,u,m,b,kw2,sw_1;
// Code ommitted
for (c=0; c < COPIES; c++)
for (g=0; g < GROUPS; g++)
for (u=0; u < UNITS; u++)
for (m=0; m < MEMBERS; m++)
if (m==0) assign mux_s2_data [c][g][m][u] = 0;
else      assign mux_s2_data [c][g][m][u] = m_data     [c][g][m-1][u];
assign mux_sel_next = mul_m_valid && mul_m_user[I_IS_CIN_LAST] && (mul_m_kw2 != 0);

register #(
.WORD_WIDTH     (1),
.RESET_VALUE    (0)
) MUX_SEL (
.clock          (clk   ),
.resetn         (resetn),
.clock_enable   (clken ),
.data_in        (mux_sel_next),
.data_out       (mux_sel )
);
assign clken_mul = clken && !mux_sel;

for (m=0; m < MEMBERS; m++) begin: Mb
for (kw2=0; kw2 <= KW_MAX/2; kw2++)
for (sw_1=0; sw_1 < SW_MAX; sw_1++) begin
localparam k = kw2*2 + 1;
localparam s = sw_1 + 1;
localparam j = k + s -1;

assign lut_sum_start[kw2][sw_1][m] = m % j < s; // m % 3 < 1 : 0,1
end
assign acc_m_sum_start [m] = lut_sum_start[acc_m_kw2][acc_m_sw_1][m] & acc_m_user[I_IS_SUM_START];
// Code ommited
end

for (c=0; c < COPIES; c++) begin: Ca
for (g=0; g < GROUPS; g++) begin: Ga
for (u=0; u < UNITS; u++) begin: Ua
for (m=0; m < MEMBERS; m++) begin: Ma
processing_element PROCESSING_ELEMENT (
.clk           (clk           ),
.clken         (clken         ),
.resetn        (resetn        ),
.clken_mul     (clken_mul     ),
.s_data_pixels (s_data_pixels [c]      [u]),
.s_data_weights(s_data_weights[c][g][m]   ),
.mul_m_data    (mul_m_data    [c][g][m][u]),
.mux_sel       (mux_sel       ),
.mux_s2_data   (mux_s2_data   [c][g][m][u]),
.bypass        (bypass        [m]),
.clken_acc     (clken_acc     [m]),
.acc_s_data    (acc_s_data    [c][g][m][u]),
.m_data        (m_data        [c][g][m][u])
);
end end end end

// Code ommitted
assign m_user_base[I_IS_BOTTOM_BLOCK:I_IS_NOT_MAX] = acc_m_user[I_IS_BOTTOM_BLOCK:I_IS_NOT_MAX];
assign m_user  = {m_clr, m_shift_b, m_shift_a, m_user_base};
endgenerate
endmodule
module processing_element (
clk    ,
clken  ,
resetn ,

clken_mul,
s_data_pixels,
s_data_weights,
mul_m_data,

mux_sel,
mux_s2_data,
bypass,
clken_acc,
acc_s_data,
m_data
);
input  logic clk, clken, resetn;
input  logic clken_mul, mux_sel, bypass, clken_acc;
input  logic [WORD_WIDTH_IN  -1:0] s_data_pixels, s_data_weights;
input  logic [WORD_WIDTH_OUT -1:0] mux_s2_data;
output logic [WORD_WIDTH_IN*2-1:0] mul_m_data;
output logic [WORD_WIDTH_OUT -1:0] acc_s_data;
output logic [WORD_WIDTH_OUT -1:0] m_data;

ifdef MAC_XILINX
multiplier MUL (
else
multiplier_raw MUL (
endif
.CLK    (clk),
.CE     (clken_mul),
.A      (s_data_pixels ),
.B      (s_data_weights),
.P      (mul_m_data    )
);
assign acc_s_data = mux_sel ? mux_s2_data  : WORD_WIDTH_OUT'(signed'(mul_m_data));

ifdef MAC_XILINX
accumulator ACC (
else
accumulator_raw ACC (
endif
.CLK    (clk),
.bypass (bypass     ),
.CE     (clken_acc  ),
.B      (acc_s_data ),
.Q      (m_data     )
);
endmodule

## 5. Test Vector Generation

### Python

Next, I write python functions to extracts weights and inputs of each layer from my custom framework model (2.2), and convert them into input test vectors. Their dimensions need to be split, reshaped, transposed and flattened to get the final input $$\hat{X}$$, and weights $$\hat{K}$$ packets which can be understood by the hardware I designed. Output $$\hat{Y}$$ is also transformed to match the hardware's outputs. Also, configuration bits need to be calculated and appended to the packet to make it complete.

def get_weights(i_layers, i_itr, c):
weights = c.LAYERS[f'{c.PREFIX_CONV}{i_layers}'].weights
KH, KW, CIN, COUT = weights.shape
max_factor = 2 if f'{c.PREFIX_MAX}{i_layers}' in c.LAYERS.keys() else 1
print(f"get_weights - shape_in:(KH, KW, CIN, COUT) = {weights.shape}")

'''
Reshape
'''
weights = weights.transpose(3,0,1,2) #(COUT,KH,KW,CIN)
weights = fill_invalid_scg(weights,KW=KW,max_factor=max_factor,c=c) #(ITR,EFF_CORES,KH,KW,CIN)
ITR,EFF_CORES = weights.shape[0:2]
weights = weights.transpose(0,4,2,1,3) #(ITR,CIN,KH,EFF_CORES,KW)

'''
* Data comes out of maxpool in the order: S,CGU
* Data comes out of conv in the order   : CGMU and is transposed into S,CGUby hardware
* Conv in takes weights in order        : CGM

* Since system_out is SCG, first invalid should be filled that way, so that output data is continous and cin matches cout
* After filling, we transpose it to CGM
'''
SUB_CORES = c.MEMBERS//KW
weights = weights.reshape((ITR,CIN,KH, SUB_CORES,c.COPIES//max_factor,c.GROUPS ,KW)) # EFF_CORES = (SCG)
weights = weights.transpose(0,1,2, 4,5, 3,6) # CGS
weights = weights.reshape((ITR,CIN,KH,1,c.COPIES//max_factor,c.GROUPS,SUB_CORES,KW)) # (CGS)
weights = np.repeat(weights,repeats=max_factor,axis=3)
weights = weights.reshape((ITR,CIN,KH,c.COPIES,c.GROUPS,SUB_CORES,KW))
weights = weights.reshape((ITR,CIN,KH,c.COPIES,c.GROUPS,SUB_CORES*KW))
zeros = np.zeros((ITR,CIN,KH,c.COPIES,c.GROUPS,c.MEMBERS), dtype=weights.dtype)
zeros[:,:,:,:,:,0:SUB_CORES*KW] = weights
weights = zeros

KERNEL_BEATS = CIN*KH
weights = weights.reshape(ITR,KERNEL_BEATS,c.COPIES,c.GROUPS,c.MEMBERS)

'''
'''
lrelu = get_lrelu_config(i_layers=i_layers,c=c)

LRELU_BEATS = lrelu.shape[1]
weights_beats = np.concatenate([lrelu,weights], axis=1) # (ITR, LRELU_BEATS + KERNEL_BEATS, COPIES, GROUPS, MEMBERS)

_,H,W,CIN = c.LAYERS[f'{c.PREFIX_CONV}{i_layers}'].in_data.shape
BLOCKS    = H // (SH*max_factor*c.CONV_UNITS)

bram_weights_addr_max = LRELU_BEATS + SW*KH*CIN-1

weights_config = 0
weights_config |= (KW//2)
weights_config |= (KH//2)               << (BITS_KW2)
weights_config |= SW-1                  << (BITS_KW2 + BITS_KH2)
weights_config |= (CIN   -1)            << (BITS_KW2 + BITS_KH2 + BITS_SW)
weights_config |= (W     -1)            << (BITS_KW2 + BITS_KH2 + BITS_SW + BITS_CIN_MAX)
weights_config |= (BLOCKS-1)            << (BITS_KW2 + BITS_KH2 + BITS_SW + BITS_CIN_MAX + BITS_COLS_MAX)
weights_config |= bram_weights_addr_max << (BITS_KW2 + BITS_KH2 + BITS_SW + BITS_CIN_MAX + BITS_COLS_MAX + BITS_BLOCKS_MAX)

weights_config = np.frombuffer(np.uint64(weights_config).tobytes(),np.int8)
weights_config = np.repeat(weights_config[np.newaxis,...],repeats=ITR,axis=0)

'''
'''
weights_dma_beats = np.concatenate([weights_config,weights_beats.reshape(ITR,-1)], axis=1)

assert weights_dma_beats.shape == (ITR, 8 + (LRELU_BEATS + CIN*KH*SW)*c.COPIES*c.GROUPS*c.MEMBERS)
print(f"get_weights - weights_dma_beats.shape: (ITR, 4 + (LRELU_BEATS + CIN*KH)*COPIES*GROUPS*MEMBERS) = {weights_dma_beats.shape}")

np.savetxt(f"{c.DATA_DIR}{i_layers}_weights.txt", weights_dma_beats[i_itr].flatten(), fmt='%d')
return weights_dma_beats

## 6. Testbench & Simulation

### SystemVerilog

Next, I write testbenches for the modules. They are built around two custom SystemVerilog classes: AXIS_Slave, which reads a text file and loads data into an AXI stream port while conforming to the protocol, and AXIS_Master which reads data from a port and writes into a text file.

The control signals: valid and ready are randomized. They get toggled according to a given probability, to simulate the effects of memory bus freezing up and clearing.

Following is an example on how AXIS slave and master classes are utilized. Each module gets a testbench like this. Some modules get multiple slave and multiple masters.

module axis_tb_demo();
timeunit 1ns;
timeprecision 1ps;
localparam CLK_PERIOD = 10;
logic aclk;
initial begin
aclk = 0;
forever #(CLK_PERIOD/2) aclk <= ~aclk;
end

localparam WORD_WIDTH        = 8;
localparam WORDS_PER_PACKET  = 40;
localparam WORDS_PER_BEAT    = 4;
localparam ITERATIONS        = 6;
localparam BEATS = int'($ceil(real'(WORDS_PER_PACKET)/real'(WORDS_PER_BEAT))); logic [WORDS_PER_BEAT -1:0][WORD_WIDTH-1:0] data; logic [WORDS_PER_BEAT -1:0] keep; logic valid, ready, last; string path = "D:/cnn-fpga/data/axis_test.txt"; string out_base = "D:/cnn-fpga/data/axis_test_out_"; AXIS_Slave #( .WORD_WIDTH (WORD_WIDTH ), .WORDS_PER_BEAT(WORDS_PER_BEAT), .VALID_PROB (70 ) ) slave_obj = new( .file_path (path), .words_per_packet(WORDS_PER_PACKET), .iterations (ITERATIONS) ); AXIS_Master #( .WORD_WIDTH (WORD_WIDTH ), .WORDS_PER_BEAT(WORDS_PER_BEAT), .READY_PROB (70 ), .CLK_PERIOD (CLK_PERIOD ), .IS_ACTIVE (1 ) ) master_obj = new( .file_base(out_base), .words_per_packet(-1), .packets_per_file(2) ); initial forever slave_obj.axis_feed(aclk, ready, valid, data, keep, last); initial forever master_obj.axis_read(aclk, ready, valid, data, keep, last); initial begin @(posedge aclk); slave_obj.enable <= 1; master_obj.enable <= 1; end int s_words, s_itr, m_words, m_itr, m_packets, m_packets_per_file; initial forever begin @(posedge aclk); s_words = slave_obj.i_words; s_itr = slave_obj.i_itr; m_words = master_obj.i_words; m_itr = master_obj.i_itr; m_packets = master_obj.i_packets; m_packets_per_file = master_obj.packets_per_file; end endmodule ## 7. Debugging ### Python Notebooks & SystemVerilog Simulations I then run simulations, collect output vectors and compare them with expected vectors using python notebooks. Notebooks allow one to play around with data, quickly print and observe different dimensions...etc. ## 8. Debugging with Microsoft Excel ### Yep :-) In some cases, the output from a module is garbage and does not match the expected output at all. Since it is a convolution over several values, it is near impossible to guess the bug by looking at such garbage numbers. In that case, I resort to Excel, where I manually transform a set of small input vectors through the logic, step by step to see what I should expect in every clock cycle. I then compare it to the waveforms I see in the simulator to figure out where the bug is. ## 9. Repeat I move back and forth between the whiteboard, RTL code, python code, and simulation to fix bugs one by one. Some take weeks and make me want to pull my hair out. I also do this for each module, then put them together hierarchically, write integration testbenches, and test that too. ## 10. ASIC Synthesis Once the design is verified in randomized simulations, I write the scripts for ASIC synthesis. Our university uses Cadence tools, so the following script is for Cadence Genus, using 65nm CMOS PDK from TSMC. set TOP axis_accelerator_asic # set TOP axis_conv_engine #--------- CONFIG set RTL_DIR ../../rtl set XILINX 0 source ../../tcl/config.tcl set_db hdl_max_loop_limit 10000000 set TECH 65nm set NUM_MACS [expr$MEMBERS*$UNITS*$GROUPS*$COPIES] set REPORT_DIR ../report/${TECH}/${TOP}/${NUM_MACS}
exec mkdir -p $REPORT_DIR #--------- LIBRARIES set LIB_DIR ../../../tsmc/${TECH}/GP
set_db library [glob $LIB_DIR/cc_lib/noise_scadv10_cln65gp_hvt_tt_1p0v_25c.lib$LIB_DIR/cc_lib/scadv10_cln65gp_hvt_tt_1p0v_25c.lib]
set_db lef_library [glob $LIB_DIR/lef/tsmc_cln65_a10_6X1Z_tech.lef$LIB_DIR/lef/tsmc_cln65_a10_6X2Z_tech.lef $LIB_DIR/lef/tsmc65_hvt_sc_adv10_macro.lef] set_db qrc_tech_file$LIB_DIR/other/icecaps.tch
# set LIB_DIR ../../../tsmc/${TECH}/LP # set_db library [glob$LIB_DIR/lib/sc12_cln65lp_base_hvt_tt_typical_max_1p00v_25c.lib $LIB_DIR/lib/sc12_cln65lp_base_hvt_tt_typical_max_1p20v_25c.lib] # set_db lef_library [glob$LIB_DIR/lef/sc12_cln65lp_base_hvt.lef]

read_hdl -mixvlog [glob $RTL_DIR/include/*] read_hdl -mixvlog [glob$RTL_DIR/external/*]
read_hdl -mixvlog [glob $RTL_DIR/src/*] #--------- ELABORATE & CHECK set_db lp_insert_clock_gating true elaborate$TOP
check_design > ${REPORT_DIR}/check_design.log uniquify$TOP

#--------- CONSTRAINTS
set PERIOD [expr 1000.0/$FREQ_HIGH] create_clock -name aclk -period$PERIOD [get_ports aclk]
set_dont_touch_network [all_clocks]
set_dont_touch_network [get_ports {aresetn}]

set design_inputs [get_ports {m_axis_tready s_axis_pixels_tvalid s_axis_pixels_tlast s_axis_pixels_tdata s_axis_pixels_tkeep s_axis_weights_tvalid s_axis_weights_tlast s_axis_weights_tdata s_axis_weights_tkeep}]
set design_outputs [get_ports {s_axis_pixels_tready  s_axis_weights_tready m_axis_tvalid m_axis_tlast m_axis_tdata m_axis_tkeep}]

set_input_delay  [expr $PERIOD * 0.6] -clock aclk$design_inputs
set_output_delay [expr $PERIOD * 0.6] -clock aclk$design_outputs

#--------- RETIME OPTIONS
set_db retime_async_reset true
set_db design:${TOP} .retime true #--------- SYNTHESIZE set_db syn_global_effort high syn_generic syn_map syn_opt #--------- NETLIST write -mapped > ../output/${TOP}.v
write_sdc > ../output/${TOP}.sdc #--------- REPORTS report_area >${REPORT_DIR}/area.log
report_gates > ${REPORT_DIR}/gates.log report_timing -nworst 10 >${REPORT_DIR}/timing.log
report_congestion > ${REPORT_DIR}/congestion.log report_messages >${REPORT_DIR}/messages.log
report_hierarchy > ${REPORT_DIR}/hierarchy.log report_clock_gating >${REPORT_DIR}/clock_gating.log

build_rtl_power_models -clean_up_netlist
report_power > {REPORT_DIR}/power.log ## Next: ]]> <![CDATA[Neural Chip Design [2/4: Golden Model]]]>https://aba-blog.xyz/dnn-to-chip-2/61f5163733068f34ce882ebbSat, 29 Jan 2022 10:30:50 GMT This is a series of articles [overview] outlining the workflow of 15 steps, which I developed over the past few years through building my own DNN accelerator: Kraken [arXiv paper]. Golden Models are essential to hardware (FPGA/ASIC) development. They model the expected behavior of a chip using a high-level language, such that they can be built relatively fast, with almost zero chance of error. The input and expected output test vectors for every RTL module are generated using them, and the simulation output from the testbench is compared against their 'gold standard.' I first obtain pretrained DNNs from PyTorch / Tensorflow model zoo, analyze them, then load them into the custom DNN inference framework I have built with NumPy stack to ensure I fully understand each operation. I then generate test vectors from those golden models. ## Steps: 1. PyTorch/TensorFlow: Explore DNN models, quantize & extract weights 2. Golden Model in Python (NumPy stack): Custom OOP framework, process the weights, convert to custom datatypes ## 1. Tensorflow / PyTorch Tensorflow (Google) and PyTorch (Facebook) are the two competing open source libraries used to build, train, quantize and deploy modern deep neural networks. Both frameworks provide high-level, user-friendly classes and functions such as Conv2D, model.fit() to build & train networks. Each such high-level API is implemented using their own low-level tensor operations (matmul, einsum), which also can be used by the users. Those operations are implemented using their C++ backend, accelerated by high performant libraries like eigen and CUDA. Once we define the models using Python, the C++ code underneath pulls the load, making them fast as well as user-friendly. ### 1.1. Download & Explore Pretrained DNN Models As the first step, I obtained the pretrained models from either Keras.Applications or PyTorch Model Zoo. ### 1.2. Build Models & Retrain if needed (PyTorch) PyTorch is more intuitive, pythonic and bliss to work with. I use it to build new models and train them if needed. ### 1.3. Convert Torch Models to Tensorflow However, the support for int8 quantization for PyTorch is still experimental. Therefore, for most of my work, I use pretrained models from Tensorflow, whose quantization library (TFLite) is much superior. Some models, like AlexNet, are not found in Keras.Applications. Therefore, I load them from PyTorch Model Zoo and convert them to ONNX (the common open-source format) and then load them in Tensorflow. ### 1.4. Quantize Models with TensorFlowLite Following is an example of loading a float32 model (VGG16) from tensorflow's savedmodel format (1.1), testing it, quantizing it to int8, and testing & saving the quantized network. import tensorflow as tf filenames = glob("dataset/*.jpg") ''' LOAD AND TEST FLOAT32 MODEL ''' prep_fn = tf.keras.applications.vgg16.preprocess_input model = tf.keras.models.load_model(f'saved_model/vgg16') h = model.input_shape[1] import cv2 from glob import glob import numpy as np def representative_data_gen(): for im_path in filenames: im = cv2.imread(im_path) im = cv2.resize(im, (h,h)) im = im[None,:,:,::-1] im = prep_fn(im) im = tf.convert_to_tensor(im) yield [im] images = list(representative_data_gen()) predictions = np.zeros((len(images),), dtype=int) for i, image in enumerate(images): output = model(image[0])[0] predictions[i] = output.numpy().argmax() print(predictions) ''' CONVERT AND SAVE INT8 MODEL (STATIC QUANTIZATION) ''' converter = tf.lite.TFLiteConverter.from_keras_model(model) converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.representative_dataset = representative_data_gen converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.int8 converter.inference_output_type = tf.int8 tflite_model_quant = converter.convert() import pathlib tflite_model_quant_file = pathlib.Path(f"tflite/vgg16.tflite") tflite_model_quant_file.write_bytes(tflite_model_quant) ''' LOAD AND TEST QUANTIZED MODEL ''' interpreter = tf.lite.Interpreter(model_path=str(tflite_model_quant_file)) interpreter.allocate_tensors() input_details = interpreter.get_input_details()[0] output_details = interpreter.get_output_details()[0] images = list(representative_data_gen()) predictions = np.zeros((len(images),), dtype=int) for i, image in enumerate(images): image = image[0] input_scale, input_zero_point = input_details["quantization"] image = image / input_scale + input_zero_point test_image = image.numpy().astype(input_details["dtype"]) interpreter.set_tensor(input_details["index"], test_image) interpreter.invoke() output = interpreter.get_tensor(output_details["index"])[0] predictions[i] = output.argmax() print(predictions) ### 1.5 Explore Model Architecture Netron is a great tool for opening tensorflow's 32-bit models (savedmodel), tflite's int8 models (tflite), pytorch models (pt), ONNX models, and more, to observe the architecture and tensor names. ## 2. Golden Model ### Python (NumPy stack) After obtaining the pretrained model, I need to 100% understand what operations are involved and how they are applied as data flows through the network. The best way to do this is to re-do it myself from scratch and obtain exactly the same results. ### 2.1. Custom Quantization Scheme ### 2.2. Custom Inference Framework (OOP, Python) For this, I built a custom framework in Python. It is structured like Keras with the following classes, inheriting as follows: • MyModel • MyLayer • MyConv • MyLeakyReLU • MyMaxpool • MyConcat • MySpaceToDepth • MyFlatten A MyModel object has a list of objects from MyLayer's children's classes. It's constructor extracts weights from tflite and sets them to the layers. A set of images can flow through the layers through a recursive call to the last layer. Following is the stripped-down version of the MyConv implementation. ### 2.3. Rebuilding the model & Debugging I then rebuild the model using the above framework, pass data and tweak things until I get the exact same output. That tells me I have understood all the operations going on inside the model. Once I've understood the model inside-out, I start designing the hardware on the whiteboard. ## Next: ]]> <![CDATA[Neural Chip Design [1/4: Overview]]]>https://aba-blog.xyz/dnn-to-chip-1/61f418fd33068f34ce882a77Fri, 28 Jan 2022 16:57:48 GMTKraken Engine In March 2020, I started building Kraken as a personal, passion project, an engine capable of accelerating any kind of Deep Neural Network (DNN) with convolutional layers, fully-connected layers, and matrix products using a single uniform dataflow pattern while consuming remarkably low on-chip area. I synthesized it in 65-nm CMOS technology at 400 MHz, packing 672 MACs in 7.3 mm2, with a peak performance of 537.6 Gops. When benchmarked on AlexNet [336.6 fps], VGG16 [17.5 fps], and ResNet-50 [64.2 fps] (yeah, those are old, but those are the widely used benchmarks), it outperforms the state-of-the-art ASIC architectures in terms of overall performance efficiency, DRAM accesses, arithmetic intensity, and throughput, with 5.8x more Gops/mm2 and 1.6x more Gops/W. I submitted the design as a journal paper to IEEE TCAS-1, the #3 journal in the field, and it is currently under review. You can find the paper at the following link. ## Technologies Used Throughout the project, I wrote code, moved back and forth, and interfaced between several technologies, from different domains such as hardware (RTL), software, and machine vision. • Python - Numpy, Tensorflow, PyTorch • SystemVerilog - RTL, Testbenches • TCL - Scripting the Vivado project, ASIC synthesis • C++ - Firmware to control the system-on-chip • Tools: Xilinx (Vivado, SDK), Cadence (Genus, Innovus)... ## My Workflow Through the Kraken project, I developed a workflow of 15 steps, which helps me to move between golden models in python, RTL designs & simulations in SystemVerilog, and firmware in C++. I have written them in detail as three more blog posts with code examples. ### 2/4: Golden Model 1. PyTorch/TensorFlow: Explore DNN models, quantize & extract weights 2. Golden Model in Python (NumPy stack): Custom OOP framework, process the weights, convert to custom datatypes ### 3/4: RTL Design & Verification 1. Whiteboard: Design hardware modules, state machines 2. RTL Design: SystemVerilog/Verilog for the whiteboard designs 3. Generate Test Vectors: using Python Notebooks 4. Testbenches: SystemVerilog OOP testbenches to read the input vector (txt file), randomly control the valid & ready signals and get output vectors (txt files) 5. Debug: Python notebooks to compare the expected output with simulation output and to find which dimensions have errors. 6. Microsoft Excel: I manually simulate the values in wires with excel to debug 7. Repeat 3-8: For every module & every level of integration 8. ASIC Synthesis ### 4/4: System-on-Chip Integration & Firmware Development 1. SoC Block Design: Build FPGA projects with Vivado manually and synthesize 2. Automation: TCL scripts to automate the project building and configuration 3. C++ Firmware: To control the custom modules 4. Hardware Verification: Test on FPGA, compare output to golden model 5. Repeat 11-14 ## Directory Structure Simpler FPGA/ASIC projects can be done within a single folder. However, as the scope and complexity of the project grow, the need to work with multiple languages (Python, SystemVerilog, C, TCL...) and tools (Jupyter, Vivado, SDK, Genus, Innovus...) can grow out of control. The project needs to be version-controlled (git) as well, to prevent data loss and to move between different stages of development like a time machine. However, FPGA and ASIC tools create a lot of internal files, which do not generalize between machines. Therefore, the building of such projects is automated via TCL scripts, and only such scripts and the source files are git-tracked. The following is the structure I developed through my Kraken project. kraken │ ├── hdl │ ├── src : rtl designs (SystemVerilog/Verilog) │ ├── tb : SystemVerilog testbenches │ ├── include : V/SV files with macros │ └── external : open-source SV/V libraries │ ├── fpga │ ├── scripts : TCL scripts to build & configure Vivado projects from source │ ├── projects : Vivado projects [not git tracked] │ └── wave : waveform scripts (wcfg) │ ├── asic │ ├── scripts : TCL scripts for synth, p&r from source │ ├── reports : reports from asic tools │ ├── work : working folder for ASIC tools, [not git tracked] │ ├── log : [not git tracked] │ └── pdk : technology node files, several GBs [not git tracked] │ ├── python │ ├── dnns : TensorFlow, Torch, TfLite extraction │ ├── framework : Custom framework │ └── golden : Golden models │ ├── data : [not git tracked] │ ├── input : input test vectors, text files, generated by python scripts │ ├── output_exp : expected output vectors │ ├── output_sim : output vectors from hardware simulation │ └── output_fpga: output vectors from FPGA │ ├── cpp │ ├── src : C++ firmware for the controller │ ├── include : header files │ └── external : external libraries │ └── doc : documentation: excel files, drawings... ## Next: ]]> <![CDATA[Vision-Based Adaptive Traffic Control on an MPSoC [ARM+FPGA]]]> ## Overview As a modular, mass-manufacturable and decentralized edge solution to the adaptive traffic control problem, we designed and implemented a custom CNN accelerator on FPGA, modified & trained YOLOv2 object detector with custom Sri Lankan data to detect ]]> https://aba-blog.xyz/vision-traffic-2019/61949a5547053101a188c58fFri, 28 Jan 2022 00:00:00 GMT ## Overview As a modular, mass-manufacturable and decentralized edge solution to the adaptive traffic control problem, we designed and implemented a custom CNN accelerator on FPGA, modified & trained YOLOv2 object detector with custom Sri Lankan data to detect vehicles in the day, night & rain, and wrote custom algorithms to track vehicles, measure weighted critical flow with 100% (day) & 80% (night) accuracy, and calculate traffic green times (delta algorithm) on the ARM processor core. ​ ​A patent for our system is currently under review at NIPO and the system is being further developed with funding from World Bank via AHEAD into a product by a multidisciplinary team of engineers through the Enterprise (Business Linkage Cell) of the University of Moratuwa, together with RDA and SD&CC.​ ## It's storytime... How it started: In my 3rd year (5th semester), by pure chance, we formed a team of four: Abrutech (Abarajithan, Rukshan, Tehara, Chinthana) for our processor project. Soon we figured out that we had great chemistry, we complemented each other's strengths and weaknesses, a team made in heaven. :-) I had a knack for digital architecture design and I was good at coming up with ideas and algorithms. Rukshan was the best verification engineer I have seen. With infinite patience and thoroughness, he never skipped a corner case. A Verilog module he wrote and tested is as good as formally verified. Tehara had the passion and patience to modify neural networks and train them for weeks. Chinthana was the street smart guy, the jack-of-all-trades, who could learn and do any task within hours. By the end of that project, I had decided that I want this group forever, especially for the final year project. Also, the feeling I got when I designed architecture was nothing but pure bliss. It was addictive, I had never felt that before. I was in love. In the next semester, we all started our internships. At CSIRO (Australia), as I worked with building CNNs on Tensorflow/Keras, training them and implementing them on edge devices (Jetson TX2), I itched to design an architecture to accelerate complex CNNs, such as object detectors. That was a perfect idea for our team. Rukshan was implementing a Maxpool engine for a simpler CNN at NTU (Singapore), Chinthana was building a python compiler for ML on ASIC at Wave Computing and Tehera was working object detectors (RCNNs) at Zone 24x7. So, we collectively decided on this topic and were brainstorming remotely from three countries. However, when we proposed the idea to a few scientists at CSIRO and lecturers in our department, they suggested that an FPGA implementation of a custom CNN engine would be a waste of time, as GPU-based systems are the popular ones for inference back then, and that we will not be able to publish this. Disheartened, we searched for a staff-proposed project where this solution might make sense. We found Prof. Rohan's project, funded by the world bank, titled Vision-Based Traffic Control. It was a topic that was attempted a few times in the past and failed. One team had tried to fly hot air balloons, another MSc team tried a simpler approach with Raspberry Pi, of passing the images through an edge-detection kernel, counting white pixels and using a custom fully-connected network of 2 layers trained on a custom dataset of few hundred images to estimate a traffic level (1-5) from the white pixel count. We proposed to tackle this problem with our own mass-manufacturable, robust solution. A pre-trained and then fine-tuned object detector running on our custom engine in an FPGA and a custom algorithm tested in VISSIM simulations to control the traffic lights based on those detections. This received a huge backlash from some staff members, who pointed out a few valid concerns. We may not have enough time to obtain government permission to demonstrate this on the road, and the CNNs may not work with Sri Lankan data. After some traumatizing back and forth via academic politics, we picked the project in February. As a response to "Like everyone, you will start collecting data way too late... in August, and find your model doesn't work", I vowed to demonstrate it within ten days! Rukshan built a data collection device, powered by a power bank through a 40-feet wire and programmed a python GUI to control its 2-axis servo and camera via Wi-Fi. We collected 750 images, Tehara ran pretrained YOLOv2 on them and we showed that we can detect ALL vehicles in both day and night time, acing the feasibility presentation within 10 days. With that start, we were good to go. ## Methodology The loop detectors and radar-based traffic sensing methods lacked accuracy, especially for smaller vehicles like motorcycles, which jam the traffic in developing countries. Vision-based adaptive controls in developing countries primarily rely on processing in a central server, which requires a high s setup cost. Research level edge solutions based on processors use simple algorithms (like edge detection) and aren't robust enough in different lighting conditions. Ones that run on GPU-like systems aren't mass manufacturable. Therefore, we came to the conclusion that our method of a state-of-the-art, robust object detector accelerated on a mass-manufacturable FPGA-based system best solves all these issues. ### 1. Machine Learning #### 1.1 Inference Framework for Precision Experiments (Aba) I built a Keras-like object-oriented inference framework using the multidimensional operations of NumPy. Layer types (Conv, Relu...etc) are implemented as subclasses of the common Layer class. They can be chained to build a model, which can forward propagate an image through multiple datapaths. ### 1.2 Data Collection Devices (Rukshan, Chinthana) & Annotation (Tehara) Built four remotely powered, wirelessly data-collection devices. Collected, annotated and augment traffic images to create a Sri Lankan traffic dataset (1500 images). ### 1.3 CNN Architecture & Training (Tehara) Optimized the architecture of the YOLOv2 object detection neural network for hardware implementation. Trained YOLOv2 and TinyYOLO. • Fused batch normalization into convolution by modifying the weights and biases accordingly. • Interchanged conv => leaky-relu => max-pool to conv => max-pool => leaky-relu to reduce power. • Changed the output layer from 80 classes to 5 classes, by reusing weights of appropriate classes. • Changed grid size from (13 x 13) to (12 x 8) and designed the sensing algorithm accordingly • Trained with custom Sri Lankan Traffic Dataset (three-wheelers) ### 2. CNN Accelerator Design (Aba) I designed our first engine, which in retrospect resembled ShiDianNao (ISCA’15). With 9-muxes x 24, 3-muxes x 48, 16-bit registers x 144, multipliers x 3, accumulators x 3, each core, performed 12 of 3x3 convolutions in 9 clock cycles. Rukshan implemented it and demonstrated it in simulation. But we couldn't fit enough processing elements of it in our FPGA to run YOLOv2 at any respectable speed. The LUT count exceeded, thanks to large 9-way multiplexers. Several registers stayed unused over most clocks. In addition, Rukshan hated the ad-hoc solutions we came up with to handle edge cases, calling them cello tape solutions. Limited by the resources and the memory bandwidth of our FPGA, I went back to the whiteboard. On an arid afternoon of August that I vividly remember, I conceived Kraken’s cornerstone. Staying in Colombo to work on this without going home for mid vacations, staring at the whiteboard and sweating in the unbearable Colombo heat, it flashed to me. That I can separate the convolution into horizontal and vertical, and shift them independently, such that partial sums snake through the 3 PEs in a core to compute a full 3x3 convolution. Within an hour, I had the accelerator design ready, with data rates and clock cycles calculated. Rukshan hates my habit of changing the architecture, introducing new features every day, in the name of optimizing it. As he spends several weeks verifying designs without skipping a corner case, it is a nightmare when I change a verified design. Basically, I'm progressive and he's conservative, making us the ideal pair of designers. I spent a few days convincing him, showing that the core 2.0, was 4 times faster, used five times fewer 3-muxes, zero 9-muxes, about 20 times fewer registers (for the same speed), with 100% utilization of all multipliers and adders. He finally came around and started implementation. I designed some support modules and implemented them, and I started working on the PS-PL (ARM-FPGA) coordination. It took me a while to wrap my head around the Vivado design flow and I finally cracked it after I got some hands-on experience at the workshop organized by International Center for Theoretical Physics, Italy in Assam, India. Four of us stayed together at Tehara's home for a month-long sprint. The final system could run the 3x3, 1x1 layers of YOLOv2, implemented at 50 MHz on Xilinx Z706. The fmax was disappointing, and we knew why. Without any guidance in digital design, we implemented our AXI stream modules to handle the handshakes combinationally, resulting in an extremely long path from the back to the front of the pipe for the ready signal. Also, the enable signals of the nested counters of 5-levels were implemented combinationally, causing another long path. We didn't have time to redesign and verify it, so we decided to go with what we had. ### 3. Tracking Algorithm (Aba) Meanwhile, I designed a lightweight, IOU-based, standalone (no libraries) object tracking algorithm, robust to broken tracks, double-counting...etc and implemented them in both python and C (bare-metal on ARM side of ZYNQ FPGA. • Near 97% vehicle counting accuracy in the daytime, 85% accuracy in the night, rainy time, on test data (on a road the CNN has never seen before) • Object detector (YOLOv2) has less accuracy. But tracking algorithm is designed to obtain near 100% accuracy in vehicle counting and identification ### 4. Delta Algorithm for Traffic timing (Aba), VISSIM verification (Chinthana) With that, I also designed and tested 8 algorithms based on density, bounding box count, flow...etc. Chinatha figured out the VISSIM software, built a sophisticated intersection based on a real-world one, and tested the algorithms. None of them converged. Our supervisor had suggested an algorithm, where static time is changed a little, based on the ratio of the number of vehicles, which failed to converge on testing. Thinking about it, I figured a way to put it into an equation, but with traffic flow. We named it the Delta Algorithm. During poor visibility, it naturally falls back to static timing. Chinthana tested it and found it converges. We tuned the sensitivity parameter. ## Acknowledgement We are forever indebted to our families of Tehara and Chinthana for hosting our team for weeks/months during strikes and study breaks, allowing us to work together. In addition, we thank our Supervisors: Prof. Rohan and Prof. Saman Bandara for their support and assistance. ## Behind the Scenes ### My team: abrutech We stayed for months at each other's homes to work together day and night. We became a part of each other's families, celebrating birthdays, night outs... the most icon team in our batch, bonded for life! <3 ### How I work My whiteboard designs throughout the project. Some were not implemented. Just to show you how I work. :-) ]]> <![CDATA[My Paper-Writing Workflow [Inkscape, Python, Mendeley, VSCode, Git]]]>I'm quite used to LaTeX, thanks to our department encouraging students to make assignments and reports in LaTeX from the sophomore level. Through countless such submissions, I have tried multiple workflows: TexMaker, Overleaf, VSCode + Latex Workshop and more. This year, I began writing my first paper (Kraken), targeting ]]> https://aba-blog.xyz/paper-workflow/619680e634a827036b050634Sun, 21 Nov 2021 06:32:20 GMT I'm quite used to LaTeX, thanks to our department encouraging students to make assignments and reports in LaTeX from the sophomore level. Through countless such submissions, I have tried multiple workflows: TexMaker, Overleaf, VSCode + Latex Workshop and more. This year, I began writing my first paper (Kraken), targeting a journal. Since it was on my personal, passion project, wrote the entire paper, drew all diagrams, made all figures and revised it several times, before handing it over to my advisor, who reviewed it. Because this is my first paper, I wanted everything: the diagrams, graphs, their text styles and all to be perfect. I experimented and learned a lot on the way to craft my own workflow for this and my future publications. In a nutshell, • I draw diagrams with Inkscape, save them as EPS (without text) + TEX (text only), such that text is rendered in latex for a uniform style. • I use the python ecosystem to handle data. Pandas for spreadsheets, Matplotlib and Seaborn for beautiful plots. SciencePlots to conform to IEEE style. • I manage my references with Mendeley, sync them across my devices. I use Resilo to sync my handwritten notes on papers between my tablet and laptop • I write pure LaTeX with VSCode + LaTeX workshop, compile it with MiKTeX, version control everything with git and sync it to overleaf via GitHub for my supervisor. ## Drawings - Inkscape Which one looks better? Tools like draw.io lack precision. Also, the size and style of their fonts do not match the rest of the LaTeX text. Inkscape [get] is a professional vector graphics tool, a fully-functional, lightweight, free and open-source alternative to Adobe Illustrator. Therefore, I learnt it to draw the two diagrams in the paper. First, find the column/text width of your LaTeX document with \the\columnwidth in points (1 pt = 1/72.27 inch). Then set the Inkscape canvas size in File > Document Properties > Page > Custom Size. You can create a grid, either rectangular or isometric, in File > Document Properties > Grid. You can start drawing with Bezier Curve tool [B], organize the drawing by layers...etc ### Adding Text Text can be added in multiple ways using the Text tool [T]. You can simply add text, but that will get rendered with the image, resulting in any font/size you want. But I wanted the text to be rendered in LaTeX, such that it has uniform size and style, to match the rest of the document. For this, I type the text, with any required latex symbols as follows, and save a copy of the image as EPS+LaTeX. This creates an .eps file without text and a .tex file with text only. Make sure to keep the master copy of your drawing updated in the default SVG format (only save a copy in the EPS+TEX format) to avoid data loss. Then add the tex file into your LaTeX document using either include or import: Since this renders text seperately, the text might have moved slightly. Make nessasary adjustments, and recheck the text alignment to make it work. The first figure (this one) took me over 10 days to make. To learn inkscape from scratch, try like five different text insertion methods and finally get it right. The next figure (following one) took me just a few hours. ## Line, bar, pie charts - Python Ecosystem If you prefer Matlab, skip this. I, on the other hand, love the python stack. Numpy is the best in manipulating like 6,7-dimensional arrays, and the object-oriented and intuitive nature of python libraries make them a treat to use. I also found this awesome package: SciencePlots [get], which works on top of other packages and makes IEEE & Science style plots with ease. To build the spreadsheet, I populate a pandas dataframe with some regular python code. Yes, I had to spend a few days learning pandas for this, but it's a great investment. Then I use matplotlib to make plots and seaborn to make them prettier. ## Reference Management - Mendeley I had to read and summarize like hundred papers for a literature review. Keeping them organized is a nightmare with regular file manager. Which one do you prefer among the following? Mendeley is available on Web, Windows, Mac and iOS. It has tons of features, including the following. • Add references as you browse the internet. Install the Mendeley extension and click it from arXiv, IEEE, ResearchGate, Springer, raw PDF...etc. It will fetch the data (author, journal, year, abstract, possibly pdf) and save it to its Web version. You can sync that to all your devices. • Group by labels, sort by author/year...etc. • Search within your references fast • Highlight and add comments using the built-in PDF tool. You may take notes as well. Everything will be synced between all your devices. • Easily export bibliography as LaTeX. Select a set of references, right-click, copy as BibTeX, then paste into your .bib file. I summarize the papers into a Google / Online Excel Sheet, so my supervisor also can have a look. ## Sync Handwriting between Tablet and PC I prefer drawing on the papers and writing stuff on them with my Android tablet and pen. Mendeley does not work with Android and I'm not sure if its iOS version supports handwriting. Therefore, to sync documents, I use Resilio Sync. It can be installed in both PC [get] and tablet [get]. Folders from the PC can be connected to those in the tablet, such that any changes made on either are reflected in another one within minutes. I use Xodo [get], an excellent PDF manager on my tablet to draw on the papers. These then reflect in the PC (and within Mendeley). ## Pure LaTeX, but faster! - VSCode "What's wrong with Overleaf?" "Oh you're a simpleton happy with LyX!" Nope, I write pure LaTeX. Overleaf is great, but I hate web-based tools. I prefer something local, more responsive (thanks to my shitty connection) and with proper version control. Since I wrote the entire journal paper alone, all 15 pages of it, and revised it like ten times, I had the freedom to develop my own workflow. When I finally passed it on to my supervisor for final revision, I synced it with Overleaf so he can work with it comfortably and his changes will be reflected in my local machine. To compile LaTeX locally, you need a compiler like MiKTeX: miktex.org. ### VSCode + Latex Workshop Maintained by Microsoft, yet free and open-source, VSCode [get] is the best editor/IDE out there for almost any language, hands-down. Unless you are a vim user, of course. By installing the right extensions, you can make it into a super-powerful, IDE of any language. Latex Workshop [get] is the extension we need. Coupled with VScode's native features, it provides a killer LaTeX experience. Check out some of its coolest features below. More features are listed here. The theme I use is One Dark Pro Monokai Darker [get]. ### Version Control - git Git is a pain to a lot of people. It was the same to me, until mid-2020. When I was stranded in New Delhi due to COVID lockdowns amidst my solo backpacking, I started reading some books on git. Then I figured that it's what I have been missing all my life. Git is a version control tool. Basically, a time machine that helps you to move to any point in the development, compare two points (diff), manage them collaboratively...etc. It guarantees that "commits" (aka snapshots) of your folder are immutable and non-deletable, giving you the confidence to clean up your code and try new things after each commit. It has a horrible interface. The commands barely make sense. But underneath, it has a beautiful data model. Once I understood that deeply, I was able to recall or google commands at will and use git like a pro. Now I'm addicted to the safety it provides. Even for a personal project, I start tracking it with git. Note, a git is a local tool. You don't need an internet connection to use it. Okay, how do I use git for paper-writing? Well, simple. Once you understand git, start tracking all your TEX, EPS, SVG, BIB, python, ipynb files using git. You will feel confidence and power. When you add something and save (recompile) successfully, make a commit with an appropriate commit message (eg: "Modify table horz_conv, to add circles around y"). ### Overleaf Sync Right, all good. But my supervisor is not familiar with git and Mendeley. What do I do? He is very familiar with overleaf though. So, I set up sync to overleaf via Github. I had to purchase a student account for8/month.

First, create a GitHub repo (remote), connect overleaf to that empty repo, push your local repo to remote, then import changes into overleaf. When the supervisor changes something, I can push that from overleaf to GitHub and pull it into the local repo. Following is a tutorial

## Conclusion

Yeah, I had to learn 80% of this over several weeks to set up my workflow. But I believe this is a worthy investment, as I can continue to write future papers with this easily.

If you think I am taking this too far, here's what I aspire to be. This mathematics freshman takes lecture notes with LaTeX using vim (yep) while drawing all mathematical drawings in real-time (as the prof draws on blackboard) on Inkscape.

]]>
<![CDATA[Afghanistan: A Sad Story]]>TLDR of some counterintuitive facts: ​

• 77% of Afghan people supported the US invasion.​
• The US invaded to avenge 9/11. There’s not enough oil there.​
• The US did war crimes, some serious development but failed to build a self-sustaining nation as they got distracted with
]]>
https://aba-blog.xyz/afghanistan-a-sad-story/61990b0234a827036b05065cSat, 07 Aug 2021 14:53:00 GMT

TLDR of some counterintuitive facts: ​

• 77% of Afghan people supported the US invasion.​
• The US invaded to avenge 9/11. There’s not enough oil there.​
• The US did war crimes, some serious development but failed to build a self-sustaining nation as they got distracted with Iraq in 2003.​
• Afghan people have been intercours'ed by the British, Soviets, local tribal leaders, Taliban and US, one after the other.
• Taliban means "student" in Arabic. They were the fatherless refugee children orphaned in the Soviet war, raised in Wahhabist schools of Pakistan funded by Saudi Arabia.​
• Taliban committed at least 15 massacres and genocides targeting Shia and Hazara (Muslims) when they were in power. This is why people are desperate to flee.
• Taliban assassinated the one honest leader (Massoud) who promoted peaceful democracy, as they couldn't bribe him.​
• Trump cut a secret deal with the Taliban to withdraw by May 1st. ​
• Afghan state failed because military was “incompetent fools, corrupt to the patrol level”, 30% of police became bandits, public lost trust in govt.​

Afghanistan is made of a lot of different ethnicities and tribes in rural areas scattered across mountains. British drew their borders arbitrarily. Their biggest ethnicity: Pashtun is split between Pakistan and Afghanistan. So people don’t respect that border and keep moving back and forth. This allowed the Taliban to regroup and train in Pakistan easily. Khyber pass, an extremely strategic path through Hindu Kush mountains through which, Indo-Aryans, Genghis Khan, Persians, Mughals and British invaded India, was later used by American troops and Taliban to enter Afghanistan. ​

Following British independence in 1919, the king tried to modernize the country by educating everyone (including women) and abolishing women's face veil. These liberal reforms led to a civil war with tribal leaders. After few more kings, in 1964 Afghanistan became more democratic, allowing multiple parties. One party (PDPA) took the power in a non-violent coup and started implementing communism, with the support of Soviet Russia.​

That "democratic party" banned forced marriages and promoted the education & job security of women. They also tortured and killed local Muslim leaders and forced atheism. This and dependence on the Soviet Union angered the people. Islamist riots by Afghan Mujahedeen (armed tribal leaders) broke out. The Soviet Union invaded to quell the rebellion, killed millions and raped women. They systematically depopulated rural areas with landmines and even toy grenades targeted at children. In response, the US supplied anti-aircraft guns to Mujahedeen. After 9 years, in 1989, as USSR collapsed, the Soviets withdrew in defeat, much like the US does now. ​

People cheered, considering Mujahedeen as liberators. They then became warlords, tore up their areas and started a civil war between themselves. Half of Kabul was reduced to rubble. UN tried to form a coalition govt of them in 1992 and failed. Fatherless children from the Soviet war grew up in refugee camps in Pakistan, and were indoctrinated into wahhabism by madarasas funded by Saudi Arabia. They called themselves Taliban, that is students. Pakistan opened the refugee camps, in 1994, they invaded Afghanistan. ​ People welcomed them as liberators.

In 1996, the Taliban seized Kabul with the support of Saudi Arabia and Pakistan and brought the sharia rule. They forbade women from leaving homes and studying. They committed at least 15 systematic massacres and genocides targeting the Shia and Hazara, torturing and killing 4000 at a time. Ahmed Shah Massoud, a respected tribal leader, railed up enough opposition, set up democratic institutions and promoted women rights. Taliban tried to bribe him by giving the Prime Minister position, but he declined and asked for a democratic solution. He addressed the EU in 2001, stating that the Taliban and Al Qaeda had introduced "a very wrong perception of Islam". The Taliban assassinated him in 2001.​

On 11/9/2001, Afghan-based Al-Qaeda attacked the world trade centre. Taliban gave Osama Bin Laden safe harbour and in return US-led forces invaded Afghanistan. There isn’t enough oil there though. In December 2001, the Taliban government was toppled. 77% of Afghan people supported the American invasion [source 1] [source 2].​

NATO started building a government [source], training the Afghan army and police and implementing reforms. Women education rose from 0% to 60% (2016), free media and platforms for public debate were established. Infant mortality rates fell by half. In 2005, fewer than 1 in 4 Afghans had access to electricity. By 2019, nearly all did. The Afghan geography makes centralized control impossible. Most of the country being over 2000m in elevation makes the road networks non-existent. The US started rebuilding the Soviet-built circular highway with the help of the world bank, Saudi Arabia and Iran (wow), to unify the country. Then they got occupied with Iraq, Taliban strengthened themselves and started blowing up the highway and construction workers.​

Afghan society is geared more towards family and tribe than having a “unified afghan” feeling. The popular people who got Afghan govt positions were corrupt tribal warlords. In the leaked secret documents, the US military officials call Afghan soldiers "incompetent fools and corrupt to the patrol level” and blame themselves “we moved too slowly initially when Taliban were defeated. When they rebounded, we trained too quickly” (and quality fell) [source]. The US built an Afghan military that dwarfs the military of even developed countries. The US spent 20 years and 2 trillion dollars there, which mostly went to corrupt Afghan officers. 30% of the new police escaped with their weapons to put up their private checkpoints, becoming bandits, robbing people. The rural public lost trust in the centralized govt, police and military.​

Meanwhile, the Taliban regrouped in Pakistan and started attacking and controlling the strategic choke points. The US conducted war crimes, torture and killings of Taliban prisoners and suspects. The US killed 9 children in 2003 and 100 civilians (mostly children) in 2009. This turned the international perception and rural Afghans against the US occupation. Of civilian casualties, 40% was due to the US & Afghan govt and 60% was due to the Taliban. ​

After Osama was killed in 2011, the occupation became unpopular among the US public. They didn’t want their children being killed in an unrelated conflict. The US govt also did not have a clear objective after killing Osama. As a last attempt, Obama tried a "surge" of troop inflow, but the Taliban just surged their attacks. Obama started pulling out. Trump stroke a secret deal with the Taliban, without even consulting the Afghan govt they built, that the US would pull out by May 1st 2021. He reduced the 15,500 troops to 2,500. Biden says US succeeded in its objective: avenging 9/11 and making sure the region doesn't breed terrorism targeting mainland USA. Given Afghanistan is now a Taliban controlled terrorist state, US might have failed in that objective too.​

An exit strategy wasn’t planned. When the Taliban attacked this year, with only 2500 US troops, Biden had to either send more resources and restart the war or stop the war by pulling out immediately, he decided on the latter [source]. The US military shut off the lights in their airbase and slipped disgracefully without informing the Afghan army. The president fled the country betraying his people. The incompetent and corrupt Afghan govt and army collapsed like a house of cards and most of them switched sides to the Taliban for bribes.​

Now people are desperate to flee through the borders and Kabul airport, fearing the extremist rule, genocides and massacres of the Taliban. The US has a deal with the Taliban to continue the evacuation of allies and special visa Afghans. US & French ambassadors fled, but the UK ambassador stays back writing visas for the Afghan translators, helping them escape. China quickly joined hands with the Taliban and India is shocked to have a terrorist country in its footsteps. The future of Afghanistan and its people is as uncertain as ever before.​

​​Disclaimer: I'm no expert. Feel free to mention any missing points in the comments, will add them.

]]>
<![CDATA[How safe are our compilers?]]>An important argument placed against e-voting machines is that there is absolutely no way to prevent large-scale fraud. Even if the firmware is open-source, how do you verify it's the same code programmed into the machine? Even if it was compiled in front of your eyes, how do

]]>
https://aba-blog.xyz/compiler-vulnerabilities/61990ca534a827036b05067bThu, 05 Aug 2021 14:59:00 GMT

An important argument placed against e-voting machines is that there is absolutely no way to prevent large-scale fraud. Even if the firmware is open-source, how do you verify it's the same code programmed into the machine? Even if it was compiled in front of your eyes, how do u know the compiler itself is safe?​

GCC, the most popular C/C++ compiler itself is actually written in C. Then how is it compiled? Using older versions of itself! This fact was used by Ken Thompson, the legend who designed and implemented UNIX, the OS on which Linux (hence all android phones, data centers, mars rovers) and macOS (an overpriced piece of shit ;-) ) are built upon.​

When he wrote the 'login' program of UNIX (during early days), he put a backdoor for debugging purposes. That is, given his secret password, any UNIX machine would unlock. But anyone else who reads the source code of login would notice this and panic. So, he hid that backdoor in the compiler, such that if the login program is compiled, it would insert the backdoor into the binary (assembly), else it would compile other programs normally. ​

But, wouldn't people read compiler's source code? For that, he first built the backdoor in the compiler roughly as follows. If future compiler code is given to the compiler, it would generate a binary for a rigged compiler with the above back door. Else, it would compile other programs normally. ​

He then compiled the compiler into binary and then removed the backdoor from the compiler's source code. Now the backdoor is practically invisible. it only exists as 1s and 0s in the binary. Anyone who reads the source code of the compiler sees it's perfectly fine. But when they add features to future versions of compiler code & compile it with his rigged compiler, it generates a new, rigged compiler binary. When they compile the login source code with that rigged compiler in the future, it would place the back door into the login binary. ​

Anyway, it was temporary, he never got caught and he revealed it when he received the Turing Award. However, this is an interesting phenomenon, and any kind of malware could be injected like this since we usually apt-get / download .exe of the compiler, which was in fact compiled by the older version of the same compiler.​

Btw, this is pretty common. Recently I heard from a podcast, that few years ago hackers infiltrated a company that makes network-security software. They injected their virus into their development toolchain. In the next release, that got into the software and was distributed to all their customers, who are big software companies and the US govt itself. It was found months later.​

### First posted on Facebook:

]]>
<![CDATA[The Snowflake Generation: A Defense]]>"This is such a snowflake generation" has been the mainstream conservative opinion for ages. In every period of history, the elder generation has accused so. It is part of the "during my time, we had to defeat a balrog to get to school" mentality.​

]]>
https://aba-blog.xyz/snowflake-generation/61990db334a827036b05068eSun, 27 Jun 2021 15:01:00 GMT

"This is such a snowflake generation" has been the mainstream conservative opinion for ages. In every period of history, the elder generation has accused so. It is part of the "during my time, we had to defeat a balrog to get to school" mentality.​

This issue is similar to the memes about safety instructions. 50 years ago they didn't write "do not drink" on bleach bottles. Today we do. Does this mean we are dumber?​

No. In any given population, there's a probability (say 0.001%) of people who are dumb enough to drink bleach. 50 years ago, they simply died. Today we have better consumer protection laws. If one guy dies, media erupts, companies get sued to bankruptcy. By simply making a minor change, which doesn't affect the lives of 99.999% of the population, we have drastically reduced the number/percentage of the dumb people who actually die. Is that bad?​

Same with body shaming/sexual harassment...etc. There's always a percentage who are unable to handle it. I don't think that percentage is increasing. By spreading awareness, we are reducing the number/percentage of people who will mentally break down, grow up to be deeply scarred adults, or even commit suicide. It doesn't affect us personally, but we can bear a small inconvenience, to save a few lives. This is good, we should be proud.​

I'm actually happy about this generation. We see kids being wholesome, having a common sense of what shouldn't be made fun of. A disabled kid, one with disfiguration is less likely to get bullied at school today. It was shocking to talk to mom and see what they have bullied about, during her time. She was unable to comprehend that we think differently and we don't even consider those funny. That's a win for our generation. We are improving. ​ Our younger generation will be better than us.​

If you want to improve things further, teach your kids to stand up against injustice. Teach them to get into trouble in order to fearlessly defend themselves and the weak around them, regardless of the consequences. And when they do, tell them how proud you are. Teach them not to be blindly loyal to their morally corrupt friends. This is how we can build a resilient society that upholds justice.

]]>
<![CDATA[Tamil 101: For Non-Tamils​]]>TLDR:  Spoken Tamil dialects are just fancy ways of shortening written Tamil. There is one unified written form, with well-defined grammar. But since the verb packs a ton of information as suffixes, which get shortened in spoken dialects, it might be hard to notice that pattern for beginners. Extracted

]]>
https://aba-blog.xyz/tamil-101/61990ef134a827036b05069fWed, 09 Jun 2021 15:09:00 GMT

TLDR:  Spoken Tamil dialects are just fancy ways of shortening written Tamil. There is one unified written form, with well-defined grammar. But since the verb packs a ton of information as suffixes, which get shortened in spoken dialects, it might be hard to notice that pattern for beginners. Extracted from my convo with Kasun Withana.

Tolkappiyam (literally old epic), written 2200 years ago, defines the grammar of Tamil as used today. The first verse goes like "letters are 30. From a to na". So there goes the first misconception. Tamil doesn't have 248 letters. Only 30 prime ones. ​

12 vowels, 18 consonants. ​ The vowels are the ones in English (a,e,i,o,u) + their longer counterparts (5) + 'ai' + 'ou'. 6 extra consonants are borrowed from Sanskrit (although Tamil is from a different family) to get fancy sounds: sa, sha, ja, ha, ksha, shree. My name has one. There goes the second misconception. Tamil has fancy sounds.​

Third one: 'ae', 'aae' vowels of Sinhala are also there. They are implicit, they arise naturally when compounding with some consonants. But f, z are not there.​

Unlike some other similarly ancient classical languages (Arabic, Chinese, Greek), the grammar of Tamil has not changed much or diverged into varieties in 2200 years. However, the alphabet has changed. Started as Tamil Brahmi (Brahmi was the common script in India), it changed gradually. After a massive reform in 1978, the letters and their combinations are very consistent now.​

The spoken dialects are based on written Tamil, whose verbs are a bit complicated. Tamil is an agglutinative language. Unlike spoken Sinhala and English, the tense (past/present/future), sex (male/female/non-conscious), voice (active/passive), number (singular/plural) are all added as suffixes to the verb. This follows a well-defined order and set of rules, and unlike English, exceptions are extremely rare. ​

For example:​

avan saappitt-aan = he ate​
avan saappittu-vitt-aan = he has eaten​
avan saappitt-irupp-aan = he would have eaten​
avan saappittu-kkond-irunth-irupp-aan = he would have been eating​
avan saappittu-kkond-irunth-irukka-maattaan = he would not have been eating​

Breaking the verb:​

saappidu = eat (base verb - command)​
-vittu- = perfect​
iru = would​
ttaan = male, past​

Fun fact: The word "செல்லாதிருப்பவர்" (cellaathiruppavar) is ranked 8th in The Most Untranslatable Word In The World. ​

The pattern of suffixes is hard to notice from spoken dialects. Tamil is harder to speak than Sinhala, precisely due to this. Spoken dialects are simply different ways of shortening written Tamil. Jaffna Tamil, SL Muslim Tamil, Kotahena slang, Madras tamil... Tamil Nadu must have hundreds of dialects. If you speak the common written form, you'll sound like a newsreader or like a modern poet. But it will be intelligible and people will love you for the effort.

Irukkinraaya? - Written = Are (you) there?​
irukkiya? - Indian / sms​
irukkiriya? - Jaffna​
eekkiyaa? - Muslim​ (Akurana)

Madras Tamil is fun. It's the most optimized form of Tamil. They condense so much information into so few syllables. ​

Iluthuththuk kondu po = pull this and go (take it away)​
isthukinupo - Madras Tamil​

So, yeah... If you would like to learn spoken Tamil, you don't need to learn the written one. But it helps to know that seemingly arbitrary changes in verbs come from well-defined grammatical rules of written Tamil. You can find a youtube tutorial video of a cute girl and learn to speak from there 😉.

]]>
<![CDATA[English Skills: Pride & Prejudice]]>Learning English is not about pride, it's just another language. I learnt it out of necessity since I was too shy to speak in Sinhala from Grade 7-ALs. But as an international language, English brings a lot of opportunities to get exposure.​

By the way, being

]]>
https://aba-blog.xyz/english-skills/619910c134a827036b0506d1Sun, 06 Jun 2021 00:00:00 GMT

Learning English is not about pride, it's just another language. I learnt it out of necessity since I was too shy to speak in Sinhala from Grade 7-ALs. But as an international language, English brings a lot of opportunities to get exposure.​

By the way, being fluent in English doesn't mean you have to let go of your first language. In my experience, my bilingual friends (and I) read a lot in both languages, watch good cinema in both languages and have an excellent grasp of the grammar and vocabulary of their native language as well. If your kid grows up speaking English only, the fault is yours, not of the kid or English. ​

To add to the list, you get to learn that:​​

• Romans had complex cities and an empire spanning Europe with drainage systems when Vijaya was landing in Sri Lanka (6th century BC). They even kept records of what happened in their courts from that time.​​
• Roman roads, designed as multi-layered, built in 300 BC are still being used. Around that time, Byzantium had multi-story apartment complexes which were rented.​​
• When Aryans were settling in Ganges valley (3000-4000 BC), Egypt and Indus valley had complex civilizations that were trading over the ocean. Egypt, Assyria and all had houses well planned and built for average citizens.

So yeah, Sri Lanka is neither the oldest nor the greatest civilization in the world. But we have a rich history of multiple cultures. We need to preserve that and do actual historical research, rather than boasting about legends. Meanwhile, we can appreciate other cultures and civilizations that are similarly great as well.

]]>
<![CDATA[Webinars: Modern C++ & Embedded]]>Following the success of SystemVerilog session, my friend and fellow junior lecturer Kithmin and few others got together to organize another Missing Semester series, for ROS. When that succeded beyond expectations, Kithmin, myself and Dr. Subodha joined hands with few final years students for Missing Semester 3: Embedded Systems

I

]]>
https://aba-blog.xyz/embedded-2021/6195a85e34a827036b050499Tue, 18 May 2021 02:24:00 GMT

Following the success of SystemVerilog session, my friend and fellow junior lecturer Kithmin and few others got together to organize another Missing Semester series, for ROS. When that succeded beyond expectations, Kithmin, myself and Dr. Subodha joined hands with few final years students for Missing Semester 3: Embedded Systems

I taught Modern C++ (11+), its features and best practices. We then extended that series into a formal certificate course through University of Moratuwa, and then into a wider, but shallower course for youngsters: Kickstarter on Embedded Systems

In each of these series, I gave the introduction talk. I touched on product development as well. The audience greatly enjoyed my stories on history of electronics and original skunkworks (nighthawk, blackbird aeroplane design tradeoffs).

]]>