# Large-Scale FPGA implementations of Machine Learning Algorithms

Philip Leong (梁恆惠) | Computer Engineering Laboratory School of Electrical and Information Engineering, The University of Sydney





# Computer Engineering Laboratory

- > Focuses on how to use parallelism to solve demanding problems
  - Novel architectures, applications and design techniques using VLSI, FPGA and parallel computing technology
- > Research
  - Reconfigurable computing
  - Machine learning
  - Nanoscale interfaces





**Initially expectation** : Heralded single photon rate should enhance significantly without degrading coincidence to accidental ratio (CAR)





# Time Multiplexing of Single Photons





# Cool Transistors (0.35u CMOS C35B4C3)

Purposes:

- To characterize CMOS transistors
- Evaluate matching property of CMOS transistors
- Test analog circuits: ADC, Level Shifter, Ring Oscillator, Beta Multiplier, Passive LC circuit, Metal tracks, ...



Layout of QNL2\_CMOS

IEEE Electron Device Letters, 38:847-850, 2017



# Wide-range Threshold Voltage Model





Modified Booth Radix-4 datapath is split into 2 sections, each with its own critical path

Non-zero encodings take  $\overline{K}\tau$  and zero take  $\tau$ 





- FPGAs offer an opportunity to provide ML algorithms with higher throughput and lower latency through
  - Exploration easily try different ideas to arrive at a good solution
  - Parallelism so we can arrive at an answer faster
  - Integration so interfaces are not a bottleneck
  - Customisation problem-specific designs to improve efficiency
- > Describe our work on implementations of ML that use these ideas



- > Exploration (Online kernel methods)
- > Parallelisation
- Integration
- > Customisation





### Throughput and Latency

Challenges in measurement and control are becoming feasible

- Significant improvements in ML algorithms but cannot keep up with sources e.g. hyperspectral imager or wireless transceiver
- Need extremely high throughput



# Improvements in throughput and latency enable new applications!

- In control applications we need low latency e.g. triggering data collection in Large Hadron Collider
- > Need very low latency



#### Kernel Methods





- Choose high dimensional feature space (so easily separable)
- Use kernel trick to avoid computing the mapping (fast)
- Do regression/classification using

$$f(x_i) = \sum_{j=1}^N \alpha_j \kappa(x_i, v_j)$$

# Kernel Trick



- > Kernel is a similarity function
  - defined by an implicit mapping  $\phi$ , (original space to feature space)

$$\kappa(x,x') = \phi(x)^T \phi(x') = \left\langle \phi(x), \phi(x') \right\rangle$$

- e.g. Linear kernel  $\kappa(x,x') = \langle x,x' \rangle$
- e.g. Polynomial kernel  $\kappa(x,x')=(1+\langle x,x'\rangle)^d$  for d=2:  $\phi(x) = (x_1^2, x_2^2, \sqrt{2x_1x_2})$
- e.g. Gaussian kernel (universal approximator)  $k(x, x') = \exp\left(-\frac{\|x x'\|^2}{2\sigma^2}\right)$ 
  - $\Phi(\mathbf{x})$  infinite in dimension!
- Modify linear ML techniques to kernel ones by replacing dot products with the kernel function (kernel trick)
  - e.g. linear discriminant analysis, logistic regression, perceptron, SOM, K-means, PCA, ICA, LMS, RLS, …
  - While we only describe prediction here, also applied to training equations

### **Online Kernel Methods**





> "Kernel Method"  $\rightarrow \kappa(x, x') : \mathbb{R}^d \rightarrow \mathbb{R}^D$ , where  $D \gg d$ 

- > Dictionary  $\rightarrow$  subset of the input data of length N
- Computation and Memory scale O(Nd)
- > BUT... N scales linearly with the dataset size



#### Random Approximation (Rahimi and Recht, '07)

#### **Exact Kernel Methods**

$$f(x) = \sum_{i=1}^{N} \alpha_i \kappa(x, d_i)$$

#### **Random Kernel Expansion**

$$f(x) = \sum_{i=1}^{n} \alpha_i z(x)$$
$$z(x) = \frac{1}{\sqrt{n}} \cos(\mathbf{W}x)$$
  
\*\* Only for k(x,x') = k(x-x',0)

#### **Define z(x):**

Approximates κ(x, x')
MV + Non-Linear Activation
(i.e. like Multilayer Perceptron)
W is **fixed** and **random**





 Computes z(x) efficiently by replacing Wx with combinations of random diagonal matrices and Hadamard transforms

$$z(x) = \frac{1}{\sqrt{n}} \cos(Vx), \quad \text{where } Vx = [Q_1 x, Q_2 x, \cdots, Q_h x]$$
$$Q_j x = SHGPHBx$$
\*\* Each Q<sub>j</sub>x is an independent dxd transform





### Systolic Array Architecture

>  $\mathbf{V}\mathbf{x} = [\mathbf{Q}_1 x, \mathbf{Q}_2 x, \cdots, \mathbf{Q}_h x]$ 





# Systolic Array Architecture

- $\mathbf{V}\mathbf{X} = [\boldsymbol{Q}_1 \boldsymbol{x}, \boldsymbol{Q}_2 \boldsymbol{x}, \cdots, \boldsymbol{Q}_h \boldsymbol{x}]$
- > Block of **b** PEs (i.e.  $Q_q x$ )





# Systolic Array Architecture

- $\mathbf{V}\mathbf{X} = [\boldsymbol{Q}_1 x, \boldsymbol{Q}_2 x, \cdots, \boldsymbol{Q}_h x]$
- > Block of **b** PEs (i.e.  $Q_q x$ )
- > General PE: 18-bit ALU, RAMs, Control Unit, LFSR





| Impl.            | dim. | n     | bw | Lat.<br>(cyc) | Fmax<br>(MHz) | Exec<br>(ns) | Th.put<br>(Gb/s) |
|------------------|------|-------|----|---------------|---------------|--------------|------------------|
| NORMA (V7, '15)  | 8    | 200   | 18 | 10            | 127           | 7.87         | 18.3             |
| KNLMS (V7, '15)  | 8    | 16    | 32 | 207           | 314           | 3.18         | 80.4             |
| CPU (Le et. '13) | 1024 | 16.4k | 32 |               |               | 58e4         | 0.06             |
| FASTFOOD (V7)    | 1024 | 16.4k | 18 | 1893          | 432           | 23.7e3       | 7.77             |
| FASTFOOD (KU035) | 8192 | 90.1k | 18 | 16930         | 508           | 17.2e3       | 8.57             |



| Impl.            | dim. | n     | bw | Lat.<br>(cyc) | Fmax<br>(MHz) | Exec<br>(ns) | Th.put<br>(Gb/s) |
|------------------|------|-------|----|---------------|---------------|--------------|------------------|
| NORMA (V7, '15)  | 8    | 200   | 18 | 10            | 127           | 7.87         | 18.3             |
| KNLMS (V7, '15)  | 8    | 16    | 32 | 207           | 314           | 3.18         | 80.4             |
| CPU (Le et. '13) | 1024 | 16.4k | 32 |               |               | 58e4         | 0.06             |
| FASTFOOD (V7)    | 1024 | 16.4k | 18 | 1893          | 432           | 23.7e3       | 7.77             |
| FASTFOOD (KU035) | 8192 | 90.1k | 18 | 16930         | 508           | 17.2e3       | 8.57             |

> Supports much larger problems



| Impl.              | dim. | n     | bw | Lat.<br>(cyc) | Fmax<br>(MHz) | Exec<br>(ns) | Th.put<br>(Gb/s) |
|--------------------|------|-------|----|---------------|---------------|--------------|------------------|
| Braiding (V7, '15) | 8    | 200   | 18 | 10            | 127           | 7.87         | 18.3             |
| KNLMS (V7, '15)    | 8    | 16    | 32 | 207           | 314           | 3.18         | 80.4             |
| CPU (Le et. '13)   | 1024 | 16.4k | 32 |               |               | 58e4         | 0.06             |
| FASTFOOD (V7)      | 1024 | 16.4k | 18 | 1893          | 432           | 23.7e3       | 7.77             |
| FASTFOOD (KU035)   | 8192 | 90.1k | 18 | 16930         | 508           | 17.2e3       | 8.57             |

- > Supports much larger problems
- > High speed design



| Impl.              | dim. | n     | bw | Lat.<br>(cyc) | Fmax<br>(MHz) | Exec<br>(ns) | Th.put<br>(Gb/s) |
|--------------------|------|-------|----|---------------|---------------|--------------|------------------|
| Braiding (V7, '15) | 8    | 200   | 18 | 10            | 127           | 7.87         | 18.3             |
| KNLMS (V7, '15)    | 8    | 16    | 32 | 207           | 314           | 3.18         | 80.4             |
| CPU (Le et. '13)   | 1024 | 16.4k | 32 |               |               | 58e4         | 0.06             |
| FASTFOOD (V7)      | 1024 | 16.4k | 18 | 1893          | 432           | 23.7e3       | 7.77             |
| FASTFOOD (KU035)   | 8192 | 90.1k | 18 | 16930         | 508           | 17.2e3       | 8.57             |

- > Supports much larger problems
- > High speed design
- > 245x speed-up over a CPU



> Exploration

#### > Parallelisation (Low Precision Neural Network)

- Integration
- > Customisation





# Inference with Convolutional Neural Networks

Slides from Yaman Umuroglu et. al., "FINN: A framework for fast, scalable binarized neural network inference," FPGA'17





### **Binarized Neural Networks**

- > The extreme case of quantization
  - Permit only two values: +1 and -1
  - Binary weights, binary activations
  - Trained from scratch, not truncated FP
- > Courbariaux and Hubara et al. (NIPS 2016)
  - Competitive results on three smaller benchmarks
  - Open source training flow
  - Standard "deep learning" layers
    - Convolutions, max pooling, batch norm, fully connected...

|       |              | cat  | deer | dog | frog       | horse  | shi |
|-------|--------------|------|------|-----|------------|--------|-----|
| 56789 | 2 11/2 11 15 |      | 1    | R   |            | - Marc | 2   |
| 56789 | 1. 11 41 160 | 5    | SX   |     | (5         | 恒      | -   |
| 56989 |              | ×    | 1    | B.  | 4          | PH.    | -   |
| 56789 | 5 6 0 0 40   | 24   |      | ×   | The second | H      |     |
| 56789 | 9 14 5 6 11  | E.   |      | ,Q  | ×.         | 7%     | -   |
| 56789 | 7 15 4 8 騷   | 1    | Ť    | Å.  |            | af.    | -   |
| 56709 | 1/12 4 2 1   | Sel. | *    | A.  | -          | A      | -   |

|                              | MNIST | SVHN   | CIFAR-<br>10 |
|------------------------------|-------|--------|--------------|
| Binary weights & activations | 0.96% | 2.53%  | 10.15%       |
| FP weights & activations     | 0.94% | 1.69%  | 7.62%        |
| BNN accuracy<br>loss         | -0.2% | -0.84% | -2.53%       |

% classification error (lower is better)



# Advantages of BNNs

#### Vivado HLS estimates on Xilinx UltraScale+ MPSoC ZU19EG

- > Much smaller datapaths
  - Multiply becomes XNOR, addition becomes popcount
  - No DSPs needed, everything in LUTs
  - Lower cost per op = more ops every cycle
- > Much smaller weights
  - Large networks can fit entirely into onchip memory (OCM)
  - More bandwidth, less energy compared to off-chip

| Precision | Peak T | OPS      | On-chip<br>weights |          |              |  |
|-----------|--------|----------|--------------------|----------|--------------|--|
| 1b        | ~66    | $\wedge$ | ~70 M              | $\wedge$ |              |  |
| 8b        | ~4 2   |          | ~10 M 🖊            |          | $\mathbf{r}$ |  |
| 16b       | ~1     | 00<br>V  | ~5 M               | 80x      |              |  |
| 32b       | ~0.3   |          | ~2 M               |          |              |  |

> fast inference with large BNNs



# Comparison

|    |                       | Accuracy           | FPS                | Power<br>(chip)         | Power<br>(wall) | kFPS / Watt<br>(chip)     | kFPS / Watt<br>(wall)          | Precision            |
|----|-----------------------|--------------------|--------------------|-------------------------|-----------------|---------------------------|--------------------------------|----------------------|
|    | MNIST, SFC-max        | 95.8%              | 12.3 M             | 7.3 W                   | 21.2 W          | 1693                      | 583                            | 1                    |
| Ş  | MNIST, LFC-max        | 98.4%              | 1.5 M              | 8.8 W                   | 22.6 W          | 177                       | 269                            | 1                    |
| Ē  | CIFAR-10, CNV-max     | 80.1%              | 21.9 k             | 3.6 W                   | 11.7 W          | 6                         | 2                              | 1                    |
|    | SVHN, CNV-max         | 94.9%              | 21.9 k             | 3.6 W                   | 11.7 W          | 6                         | 2                              | 1                    |
|    |                       |                    |                    |                         |                 |                           |                                |                      |
| Б  | MNIST, Alemdar et al. | 97.8%              | 255.1 k            | 0.3 W                   | -               | 806                       | -                              | 2                    |
| Σ  | CIFAR-10, TrueNorth   | 83.4%              | 1.2 k              | 0.2 W                   | -               | 6                         | -                              | 1                    |
| r: | SVHN, TrueNorth       | 96.7%              | 2.5 k              | 0.3 W                   | -               | 10                        | -                              | 1                    |
| •  | Max<br>los            | accuracy<br>s: ~3% | 10 – 100<br>perfor | )<br>0x better<br>mance |                 | CIFAR-10/S\<br>comparable | /HN energy e<br>e to TrueNortl | efficiency<br>h ASIC |



#### **Issues with Low-Precision**

- > Who would be willing to incur a loss in accuracy?
- > Can we get better accuracy with a little more hardware?



# SYQ Quantisation

To compute quantised weights from FP weights

$$\boldsymbol{Q}_l = sign(\boldsymbol{W}_l) \odot \boldsymbol{M}_l$$

with,

$$M_{l_{i,j}} = \begin{cases} 1 & \text{if} \quad \left| W_{l_{i,j}} \right| \ge \eta_l \\ 0 & \text{if} \quad -\eta_l < W_{l_{i,j}} < \eta_l \end{cases}$$

$$sign(x) = \begin{cases} 1 & \text{if } x \ge 0 \\ -1 & \text{otherwise} \end{cases}$$

where *M* represents a masking matrix,  $\eta$  is the quantization threshold hyperparameter (0 for binarised)



- Make approximation  $W_l \approx \alpha_l Q_l, Q_l \in C$
- C is the codebook,  $C \in \{C_1, C_2, \ldots\}$  e.g.  $C = \{-1, +1\}$  for binary,  $C = \{-1, 0, +1\}$  for ternary
- A diagonal matrix  $\alpha_I$  is defined by the vector  $\alpha_I = [\alpha_I^1, ..., \alpha_I^m]$ :

$$\alpha = diag(\alpha) := \begin{bmatrix} \alpha^{1} & 0 & \dots & 0 & 0 \\ 0 & \alpha^{2} & \dots & \vdots & 0 \\ \vdots & \vdots & \dots & \alpha^{m-1} & \vdots \\ 0 & 0 & \dots & 0 & \alpha^{m} \end{bmatrix}$$

• Train by solving  $\alpha_{l} = \operatorname*{argmin}_{\alpha} E(\alpha, \mathbf{Q}) \quad s.t. \quad \alpha \ge 0, \ \mathbf{Q}_{l_{i,j}} \in \mathbb{C}$ 





> More fine-grained quantisation can improve approximation of weights





For K filters, I Input feature maps of dimension FxF, N output feature maps
 P=K<sup>2</sup>INF<sup>2</sup>

| Method             | Scalars    | Ops | MAC Tree          |                                 |
|--------------------|------------|-----|-------------------|---------------------------------|
| Layer (DoReFa)     | 1          | P   |                   | Scaling Coefficient<br>Multiply |
| Row (SYQ)          | K          | P   | $\langle \rangle$ | Activation                      |
| Pixel (SYQ)        | $K^2$      | P   |                   |                                 |
| Asymmetric (TTQ)   | 2          | P+Z |                   | — • • G                         |
| Grouping (FGQ)     | $K^{2}N/4$ | P   |                   |                                 |
| Channel (HWGQ/BWN) | N          | P   | t t               |                                 |
|                    |            |     | )O                |                                 |
|                    |            |     |                   | Accumulator                     |



 Full precision for 1<sup>st</sup> and last layers, CONV layers pixel-wise, FC layerwise

| Model      |       | 1-8  | 2-8  | Baseline | Reference |
|------------|-------|------|------|----------|-----------|
| AlaxNat    | Top-1 | 56.6 | 58.1 | 56.6     | 57.1      |
| Alexinet   | Top-5 | 79.4 | 80.8 | 80.2     | 80.2      |
| VGG        | Top-1 | 66.2 | 68.7 | 69.4     | -         |
| 100        | Top-5 | 87.0 | 88.5 | 89.1     | -         |
| DecNet 18  | Top-1 | 62.9 | 67.7 | 69.1     | 69.6      |
| Keshel-10  | Top-5 | 84.6 | 87.8 | 89.0     | 89.2      |
| PacNat 34  | Top-1 | 67.0 | 70.8 | 71.3     | 73.3      |
| Keshel-54  | Top-5 | 87.6 | 89.8 | 89.1     | 91.3      |
| PesNet 50  | Top-1 | 70.6 | 72.3 | 76.0     | 76.0      |
| Residel-30 | Top-5 | 89.6 | 90.9 | 93.0     | 93.0      |

Baseline is floating-point, reference <u>https://github.com/facebook/fb.resnet.torch</u> (ResNet) and <u>https://github.com/BVLC/caffe</u> (AlexNet)



|  | the university of SYDNEY |
|--|--------------------------|
|--|--------------------------|

| Model           | Weights | Act. | Top-1 | Top-5 |
|-----------------|---------|------|-------|-------|
| DoReFa-Net [33] | 1       | 2    | 49.8  | -     |
| QNN [15]        | 1       | 2    | 51.0  | 73.7  |
| HWGQ [2]        | 1       | 2    | 52.7  | 76.3  |
| SYQ             | 1       | 2    | 55.4  | 78.6  |
| DoReFa-Net [33] | 1       | 4    | 53.0  | -     |
| SYQ             | 1       | 4    | 56.2  | 79.4  |
| BWN [24]        | 1       | 32   | 56.8  | 79.4  |
| SYQ             | 1       | 8    | 56.6  | 79.4  |
| SYQ             | 2       | 2    | 55.8  | 79.2  |
| FGQ [21]        | 2       | 8    | 49.04 | -     |
| TTQ [34]        | 2       | 32   | 57.5  | 79.7  |
| SYQ             | 2       | 8    | 58.1  | 80.8  |

# Results (ResNet)



| Model    | Weights | Act. | Top-1 | Top-5 |
|----------|---------|------|-------|-------|
| BWN [24] | 1       | 32   | 60.8  | 83.0  |
| SYQ      | 1       | 8    | 62.9  | 84.6  |
| TWN [19] | 2       | 32   | 65.3  | 86.2  |
| INQ [32] | 2       | 32   | 66.0  | 87.1  |
| TTQ [34] | 2       | 32   | 66.6  | 87.2  |
| SYQ      | 2       | 8    | 67.7  | 87.8  |

| Model    | Weights | Act. | Top-1 | Top-5 |
|----------|---------|------|-------|-------|
| HWGQ 2   | 1       | 2    | 64.6  | 85.9  |
| SYQ      | 1       | 4    | 68.8  | 88.7  |
| SYQ      | 1       | 8    | 70.6  | 89.6  |
| FGQ [21] | 2       | 4    | 68.4  | -     |
| SYQ      | 2       | 4    | 70.9  | 90.2  |
| FGQ [21] | 2       | 8    | 70.8  | -     |
| SYQ      | 2       | 8    | 72.3  | 90.9  |

ResNet-18

ResNet-50



- > Exploration
- > Parallelisation
- Integration (radio frequency machine learning)
- > Customisation





# Radio Frequency Machine Learning

- Processing radio frequency signals remains a challenge
  - high bandwidth and low latency difficult to achieve
- Autoencoder to do anomaly detection





#### Autoencoder

#### Train so $\tilde{x} \times x$ (done in an unsupervised manner)





- > Anomaly if distance between autoencoder output and input large
- > FPGA has sufficiently high performance to process each sample of waveform at 200 MHz!
  - This minimises latency and maximises throughput
  - Weights trained on uP and updated on FPGA without affecting inference





# Software Defined Radio Architecture

#### Implemented on Ettus X310 platform



#### Example







# Performance (XC7K410T)

#### Typical SDR latency >> 1 ms

| Module             | п   | Latency<br>(cycles) | BRAM | DSP  | FF     | LUT   |
|--------------------|-----|---------------------|------|------|--------|-------|
| Windower           | 1   | 0                   | 0    | 0    | 1511   | 996   |
| FFT                | 1   | 8                   | 0    | 48   | 4698   | 2796  |
| NN                 | 1   | 17                  | 4    | 1280 | 213436 | 13044 |
| $L_2$ -Norm        | 1   | 4                   | 0    | 32   | 1482   | 873   |
| Thres              | 1   | 0                   | 0    | 0    | 3      | 21    |
| Weight Update      | 258 | 257                 | 0    | 0    | 21955  | 4528  |
| Inference (FFT+NN) | 1   | 37                  | 1068 | 1360 | 241522 | 45448 |
| Inference (NN)     | 1   | 29                  | 1068 | 1312 | 236824 | 42652 |
| Total              | N/A | N/A                 | 1068 | 1360 | 263477 | 49976 |
| Total Util.        | N/A | N/A                 | 67%  | 88%  | 51%    | 19%   |

| Operation         | Throughp | ut Latency |
|-------------------|----------|------------|
| Inference(FFT+NN) | 5ns      | 185ns      |
| Inference(NN)     | 5ns      | 105ns      |
| Weight Update     | 1290ns   | 1285ns     |



- > Exploration
- > Parallelisation
- Integration
- > Customisation (Matrix Multiplication on Intel Harp v2)





#### Problem

- $C = alpha \ast op(A) \ast op(B) + beta \ast C$
- > Xeon+FPGA
- Simple, software-based interface
- Extensions to efficiently support Machine Learning

























FP32, INT16, INT8, INT4, Ternary, Binary

#### Feeder Blocks







### Memory Interleaving

| $a_{00}$ | $a_{01}$        | $a_{02}$                                                               | $a_{03}$     | 0        | 0              | $b_{00}$            | $b_{01}$ | $b_{02}$ | $b_{03}$             | 0  | 0   |
|----------|-----------------|------------------------------------------------------------------------|--------------|----------|----------------|---------------------|----------|----------|----------------------|----|-----|
| $a_{10}$ | $a_{11}$        | $a_{12}$                                                               | $a_{13}$     | 0        | 0              | $b_{10}$            | $b_{11}$ | $b_{12}$ | $b_{13}$             | 0  | 0   |
| $a_{20}$ | $a_{21}$        | $a_{22}$                                                               | $a_{23}$     | 0        | $\frac{0}{2}$  | $b_{20}$            | $b_{21}$ | $b_{22}$ | $b_{23}$             | 0  | 0   |
| $a_{30}$ | $a_{31}$        | $a_{32}$                                                               | $a_{33}$     | 0        | $\overline{0}$ | $b_{30}$            | $b_{31}$ | $b_{32}$ | $b_{33}$             | 0  | 0   |
| 0        | 0               | 0                                                                      | 0            | 0        | 0              | 0                   | 0        | 0        | 0                    | 0  | 0   |
| 0        | 0               | 0                                                                      | 0            | 0        | 0              | 0                   | 0        | 0        | 0                    | 0  | 0   |
| 0        | u <sub>00</sub> | $a_{01} \begin{vmatrix} \mathbf{I} \\ \mathbf{I} \end{vmatrix} a_{01}$ | $u_{02}$     | $a_{03}$ |                | $b_{00}$            | $b_{01}$ | $b_0$    | $b_2$ $b_0$          | 03 |     |
| C        | $u_{10}$        | $a_{11}$                                                               | $u_{12}$     | $a_{13}$ | $\checkmark$   | $b_{10}$            | $b_{11}$ | $b_1$    | 2 b                  | 13 | Fe  |
| 0        | $u_{20}$ (      | $a_{21}$                                                               | $u_{22}$     | $a_{23}$ | $\wedge$       | $\overline{b}_{20}$ | $b_{21}$ | $b_2$    | $b_2 \overline{b_2}$ | 23 | Inc |
| (        | $a_{30}$ .      | $a_{31} \stackrel{\cdot}{\underline{!}} a_{31}$                        | $\iota_{32}$ | $a_{33}$ |                | $b_{30}$            | $b_{31}$ | $b_3$    | $b_2$ $b_3$          | 33 |     |

Inefficient with Static Partitioning

Fewer Computations Increase Bandwidth



# Memory Sharing between Feeder A and Feeder B

| $a_{00}$            | $a_{01}$ | $a_{02}$ | 2 | $a_{03}$ | 3     | 0                   | 0 |     |  |
|---------------------|----------|----------|---|----------|-------|---------------------|---|-----|--|
| $a_{10}$            | $a_{11}$ | $a_{12}$ | 2 | $a_{13}$ | }     | 0                   | 0 |     |  |
| $a_{20}$            | $a_{21}$ | $a_{22}$ | 2 | $a_{23}$ | }     | 0                   | 0 | · · |  |
| $\overline{a_{30}}$ | $a_{31}$ | $a_{32}$ | 2 | $a_{33}$ | 3     | 0                   | 0 | Х   |  |
| $a_{40}$            | $a_{41}$ | $a_{42}$ | 2 | $a_{43}$ | 3     | 0                   | 0 |     |  |
| 0                   | 0        | 0        |   | 0        |       | 0                   | 0 |     |  |
|                     | $a_{00}$ | $a_{01}$ | 0 | $l_{02}$ | $a_0$ | )3                  |   |     |  |
|                     | $a_{10}$ | $a_{11}$ | 0 | $l_{12}$ | $a_1$ | $\lfloor 3 \rfloor$ |   |     |  |
|                     | $a_{20}$ | $a_{21}$ | 0 | $l_{22}$ | $a_2$ | 23                  |   | Х   |  |
|                     | $a_{30}$ | $a_{31}$ | 0 | $l_{32}$ | $a_3$ | 33                  |   |     |  |
|                     | $a_{40}$ | $a_{41}$ | 0 | $l_{42}$ | $a_4$ | 13                  |   |     |  |

| $b_{00}$            | 0        | 0 |  |  |  |  |  |  |  |
|---------------------|----------|---|--|--|--|--|--|--|--|
| $b_{10}$            | 0        | 0 |  |  |  |  |  |  |  |
| $b_{20}$            | 0        | 0 |  |  |  |  |  |  |  |
| $b_{30}$            | 0        | 0 |  |  |  |  |  |  |  |
| 0                   | 0        | 0 |  |  |  |  |  |  |  |
| l                   | $b_{00}$ |   |  |  |  |  |  |  |  |
| $b_{10}$            |          |   |  |  |  |  |  |  |  |
| $\overline{b}_{20}$ |          |   |  |  |  |  |  |  |  |
| l                   | $b_{30}$ |   |  |  |  |  |  |  |  |

Inefficient with Static Partitioning

Minimising Bandwidth Increase Maximum block size restriction

# Training a Binarised Neural Network



| Larran | Type    |          |  |  |  |  |  |
|--------|---------|----------|--|--|--|--|--|
| Layer  | Forward | Backward |  |  |  |  |  |
|        |         |          |  |  |  |  |  |
| conv   | BINxBIN | FPxBIN   |  |  |  |  |  |
| c&r    | INT     | STE      |  |  |  |  |  |
| relu   | INT     | FP       |  |  |  |  |  |
| norm   | FP      | FP       |  |  |  |  |  |
| pool   | FP      | FP       |  |  |  |  |  |
|        |         |          |  |  |  |  |  |
| fc     | FPxFP   | FPxFP    |  |  |  |  |  |
| prob   | FP      | FP       |  |  |  |  |  |

STE=straight through estimator

#### **Optimisation: Dynamic Dot Product**





Supports two different precisions to avoid reconfiguration at runtime



#### **Optimisation: Fused Operations**





#### Measured Peak Performance





#### GEMM Memory Interleaving Results





#### Heterogeneous Load Balancing





#### Neural Network Memory Interleaving Results





#### **BNN Inference Performance**

| Derrice                 |       | FPG.      | FPGA       |             |          |      | GPU         |        |           |  |  |
|-------------------------|-------|-----------|------------|-------------|----------|------|-------------|--------|-----------|--|--|
| TOPs                    |       | GOPS/W    | IPS        | IPS/W       | T(       | OPS  | GOPs/W      | IPS    | IPS/W     |  |  |
| AlexNet                 | 31.54 | 657.27    | 161        | 0 33.54     | 37       | .60  | 568.09      | 1626   | 25.02     |  |  |
| VGGNet                  | 31.18 | 649.67    | 114        | 2.39        | 35       | .85  | 522.59      | 121    | 1.78      |  |  |
| <u>.</u>                |       |           |            |             |          |      |             |        |           |  |  |
| [26] [11] [21] Our Work |       |           |            |             |          |      | /ork        |        |           |  |  |
| Platform                |       | Zynq z704 | 45   H     | Cintex US K | J115     | Arri | a 10 GX1150 | Arria  | 10 GX1150 |  |  |
| Logic Elements (LEs)    |       | ) 350K    | 350K 1,451 |             | K 1,150K |      | 0K          | 1,150K |           |  |  |
| Power (W)               |       | 11.3      | 11.3 4     |             | 41       |      | iπ.         |        |           |  |  |
| TOPs (Peak)             |       | 11.612    | 1          | 4.8         |          | 25   |             | 40.77  |           |  |  |
| MOPs / LE               |       | 33.17     | 1          | 0.19        |          | -    |             | 35.45  |           |  |  |
| GOPs / Wa               | att   | 1027.68   | 3          | 60.97       |          | -    |             | 849.38 |           |  |  |



- > Exploration (Online kernel methods)
- > Parallelisation
- Integration
- > Customisation



## Conclusion



THE UNIVERSITY OF

 Kernel methods optimised using different algorithms, mathematical techniques, computer architectures, arithmetic

#### > Parallelism

- Increase parallelism by reducing precision
- Keep weights on-chip to devote more hardware to arithmetic

#### Integration

 In radio frequency, this allows latency to be reduced by 4 orders of magnitude

#### Customisation

 Supplement conventional matrix multiplication to support DNN implementation

- FPGAs can greatly assist with the implementation of intelligent sensing
  - > Learning & inference at 70 Gbps
  - Learning & inference with 100 ns latency
  - > Image processing @ 12.3 Mfps
  - > Multimodal measurements
- > Radio frequency anomaly detector
  - We are using this to predict physical and media access layer protocols
  - Could also be used as a novel diagnostic instrument - monitor RF output of electronic equipment, detect anomalies



Thank you!



