Neural Chip Design [1/4: Overview]

Technical Projects Jan 28, 2022

Kraken Engine

In March 2020, I started building Kraken as a personal, passion project, an engine capable of accelerating any kind of Deep Neural Network (DNN) with convolutional layers, fully-connected layers, and matrix products using a single uniform dataflow pattern while consuming remarkably low on-chip area. I synthesized it in 65-nm CMOS technology at 400 MHz, packing 672 MACs in 7.3 mm2, with a peak performance of 537.6 Gops.

When benchmarked on AlexNet [336.6 fps], VGG16 [17.5 fps], and ResNet-50 [64.2 fps] (yeah, those are old, but those are the widely used benchmarks), it outperforms the state-of-the-art ASIC architectures in terms of overall performance efficiency, DRAM accesses, arithmetic intensity, and throughput, with 5.8x more Gops/mm2 and 1.6x more Gops/W.

I submitted the design as a  journal paper to IEEE TCAS-1, the #3 journal in the field, and it is currently under review. You can find the paper at the following link.

Kraken: An Efficient Engine with a Uniform Dataflow for Deep Neural Networks
Deep neural networks (DNNs) have been successfully employed in a multitude ofapplications with remarkable performance. As such performance is achieved at asignificant computational cost, several embedded applications demand fast andefficient hardware accelerators for DNNs. Previously proposed app…

Technologies Used

Throughout the project, I wrote code, moved back and forth, and interfaced between several technologies, from different domains such as hardware (RTL), software, and machine vision.

  • Python - Numpy, Tensorflow, PyTorch
  • SystemVerilog - RTL, Testbenches
  • TCL - Scripting the Vivado project, ASIC synthesis
  • C++ - Firmware to control the system-on-chip
  • Tools: Xilinx (Vivado, SDK), Cadence (Genus, Innovus)...

My Workflow

Through the Kraken project, I developed a workflow of 15 steps, which helps me to move between golden models in python, RTL designs & simulations in SystemVerilog, and firmware in C++.  I have written them in detail as three more blog posts with code examples.

2/4: Golden Model

  1. PyTorch/TensorFlow: Explore DNN models, quantize & extract weights
  2. Golden Model in Python (NumPy stack): Custom OOP framework, process the weights, convert to custom datatypes
Neural Chip Design [2/4: Golden Model]
I wrote golden models in python to model the intended behavior of the chip. This outlines how I got pretrained models, extracted weights, applied my own quantization...etc.

3/4: RTL Design & Verification

  1. Whiteboard: Design hardware modules, state machines
  2. RTL Design: SystemVerilog/Verilog for the whiteboard designs
  3. Generate Test Vectors: using Python Notebooks
  4. Testbenches: SystemVerilog OOP testbenches to read the input vector (txt file), randomly control the valid & ready signals and get output vectors (txt files)
  5. Debug: Python notebooks to compare the expected output with simulation output and to find which dimensions have errors.
  6. Microsoft Excel: I manually simulate the values in wires with excel to debug
  7. Repeat 3-8: For every module & every level of integration
  8. ASIC Synthesis
Neural Chip Design [3/4: Digital Design & Verification]
How I designed modules in the whiteboard, then wrote SystemVerilog RTL, built testbenches, prepared test vectors, and debugged them.

4/4: System-on-Chip Integration & Firmware Development

  1. SoC Block Design: Build FPGA projects with Vivado manually and synthesize
  2. Automation: TCL scripts to automate the project building and configuration
  3. C++ Firmware: To control the custom modules
  4. Hardware Verification: Test on FPGA, compare output to golden model
  5. Repeat 11-14
Neural Chip Design [4/4: SoC Integration & Firmware]
SoC design involves integrating other IPs, automating projects with TCL, writing C++ firmware, and testing on PSoC hardware.

Directory Structure

Simpler FPGA/ASIC projects can be done within a single folder. However, as the scope and complexity of the project grow, the need to work with multiple languages (Python, SystemVerilog, C, TCL...)  and tools (Jupyter, Vivado, SDK, Genus, Innovus...) can grow out of control.

The project needs to be version-controlled (git) as well, to prevent data loss and to move between different stages of development like a time machine. However, FPGA and ASIC tools create a lot of internal files, which do not generalize between machines. Therefore, the building of such projects is automated via TCL scripts, and only such scripts and the source files are git-tracked.

The following is the structure I developed through my Kraken project.

├── hdl
│   ├── src        : rtl designs (SystemVerilog/Verilog)
│   ├── tb         : SystemVerilog testbenches
│   ├── include    : V/SV files with macros
│   └── external   : open-source SV/V libraries
├── fpga
│   ├── scripts    : TCL scripts to build & configure Vivado projects from source
│   ├── projects   : Vivado projects [not git tracked]
│   └── wave       : waveform scripts (wcfg)
├── asic
│   ├── scripts    : TCL scripts for synth, p&r from source
│   ├── reports    : reports from asic tools
│   ├── work       : working folder for ASIC tools, [not git tracked]
│   ├── log        : [not git tracked]
│   └── pdk        : technology node files, several GBs [not git tracked]
├── python
│   ├── dnns       : TensorFlow, Torch, TfLite extraction
│   ├── framework  : Custom framework
│   └── golden     : Golden models
├── data           : [not git tracked]
│   ├── input      : input test vectors, text files, generated by python scripts
│   ├── output_exp : expected output vectors
│   ├── output_sim : output vectors from hardware simulation
│   └── output_fpga: output vectors from FPGA
├── cpp
│   ├── src        : C++ firmware for the controller
│   ├── include    : header files
│   └── external   : external libraries
└── doc            : documentation: excel files, drawings...


Neural Chip Design [2/4: Golden Model]
I wrote golden models in python to model the intended behavior of the chip. This outlines how I got pretrained models, extracted weights, applied my own quantization...etc.