Added docs authored by Gregory Ling's avatar Gregory Ling
The 5x8x9 MAC unit produces the sum of five 8-bit by 9-bit multiply operations from 5 8-bit inputs and 5 9-bit inputs in a single cycle (<50ns). When designing this, I found the Array Adder Libero block in the SmartGen Cores Reference Guide which takes a configurable number of inputs of a configurable bit-length and outputs two partial sums A and B, which when added together produces the final sum of all inputs. The reference guide mentioned that this array adder has significantly higher performance than multiple standard adders, and if possible the design should not perform a final add until the end of the calculation.
The base form of multiply used in this design is a simple conditional shift-and-sum which consists of one 8x16-bit array adder. Take 8 copies of the 9-bit input, if the corresponding bit in the 8-bit input is a 1, pass the 9-bit input shifted by the place-value of the bit, otherwise pass 0 into the entry in the array adder:
```
11010010
x 101010100
-----------------
000000000
101010100
000000000
000000000
101010100
000000000
101010100
101010100
-----------------
```
I also noticed that in an 8x9 multiply, the result will be 17 bits long. In order to perform 5 8x9 multiplies, I would need five 8x16-bit array adders to perform the multiplication by shifting and summing the 8-bit input, a 10x17-bit array adder to add the partial sums of all 5 multipliers, and a final adder to sum the results of the last array adder.
```
00000001 00000010 00000011 00000100 00000101
x 101010001 x 101010010 x 101010011 x 101010100 x 101010101
----------------- ----------------- ----------------- ----------------- -----------------
101010001 000000000 101010011 000000000 101010101
000000000 101010010 101010011 000000000 000000000
000000000 000000000 000000000 101010100 101010101
000000000 000000000 000000000 000000000 000000000
000000000 000000000 000000000 000000000 000000000
000000000 000000000 000000000 000000000 000000000
000000000 000000000 000000000 000000000 000000000
000000000 000000000 000000000 000000000 000000000
----------------- ----------------- ----------------- ----------------- -----------------
00000000101010001 00000001010100100 00000001111111001 00000010101010000 00000011010101001
00000000101010001
00000001010100100
00000001111111001
00000010101010000
+ 00000011010101001
----------------------
00000001001111100111
```
However, these partial results are all summed together. Instead of performing a complete multiply, then adding the results together, I can instead sum each corresponding bit together across the 5 multipliers, then once I have the sums of each place-value, shift and add those sums together. This form uses significantly smaller adder units because the operands are smaller and the shift is performed later, but it uses three more. I did not verify this is actually smaller than the method above, but it appeared to make more sense to implement. I'm curious if they're equivalent or if the more smaller adders is actually more efficient. Note: These examples show full adds, each of these add operations really produces two partial sums and the final adder is a 16x18-bit adder, not 8x18 as shown.
```
101010001 000000000 000000000 000000000 000000000 000000000 000000000 000000000
000000000 101010010 000000000 000000000 000000000 000000000 000000000 000000000
101010011 101010011 000000000 000000000 000000000 000000000 000000000 000000000
000000000 000000000 101010100 000000000 000000000 000000000 000000000 000000000
+ 101010101 + 000000000 + 101010101 + 000000000 + 000000000 + 000000000 + 000000000 + 000000000
------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
001111111001 001010100101 001010101001 000000000000 000000000000 000000000000 000000000000 000000000000
001111111001
001010100101
001010101001
000000000000
000000000000
000000000000
000000000000
+ 000000000000
--------------------
00000001001111100111
```
There are configurable pipeline stages built in between the 8 array adders and the last array adder as well as between the last array adder and the final sum to configure timing later if needed.
\ No newline at end of file