Changes

Gregory Ling · 35511c5b
--- a/mac5x9x8.md
+++ b/mac5x9x8.md
+The 5x8x9 MAC unit produces the sum of five 8-bit by 9-bit multiply operations from 5 8-bit inputs and 5 9-bit inputs in a single cycle (<50ns). When designing this, I found the Array Adder Libero block in the SmartGen Cores Reference Guide which takes a configurable number of inputs of a configurable bit-length and outputs two partial sums A and B, which when added together produces the final sum of all inputs. The reference guide mentioned that this array adder has significantly higher performance than multiple standard adders, and if possible the design should not perform a final add until the end of the calculation.
+
+The base form of multiply used in this design is a simple conditional shift-and-sum which consists of one 8x16-bit array adder. Take 8 copies of the 9-bit input, if the corresponding bit in the 8-bit input is a 1, pass the 9-bit input shifted by the place-value of the bit, otherwise pass 0 into the entry in the array adder:
+
+```
+         11010010
+      x 101010100
+-----------------
+        000000000
+       101010100
+      000000000
+     000000000
+    101010100
+   000000000
+  101010100
+ 101010100
+-----------------
+```
+
+ I also noticed that in an 8x9 multiply, the result will be 17 bits long. In order to perform 5 8x9 multiplies, I would need five 8x16-bit array adders to perform the multiplication by shifting and summing the 8-bit input, a 10x17-bit array adder to add the partial sums of all 5 multipliers, and a final adder to sum the results of the last array adder. 
+ 
+ ```
+          00000001           00000010           00000011           00000100           00000101
+       x 101010001        x 101010010        x 101010011        x 101010100        x 101010101
+ -----------------  -----------------  -----------------  -----------------  -----------------
+         101010001          000000000          101010011          000000000          101010101
+        000000000          101010010          101010011          000000000          000000000 
+       000000000          000000000          000000000          101010100          101010101  
+      000000000          000000000          000000000          000000000          000000000   
+     000000000          000000000          000000000          000000000          000000000    
+    000000000          000000000          000000000          000000000          000000000     
+   000000000          000000000          000000000          000000000          000000000      
+  000000000          000000000          000000000          000000000          000000000       
+ -----------------  -----------------  -----------------  -----------------  -----------------
+ 00000000101010001  00000001010100100  00000001111111001  00000010101010000  00000011010101001
+ 
+     00000000101010001
+     00000001010100100
+     00000001111111001
+     00000010101010000
+    00000011010101001
+----------------------
+  00000001001111100111 
+ ```
+ 
+ However, these partial results are all summed together. Instead of performing a complete multiply, then adding the results together, I can instead sum each corresponding bit together across the 5 multipliers, then once I have the sums of each place-value, shift and add those sums together. This form uses significantly smaller adder units because the operands are smaller and the shift is performed later, but it uses three more. I did not verify this is actually smaller than the method above, but it appeared to make more sense to implement. I'm curious if they're equivalent or if the more smaller adders is actually more efficient. Note: These examples show full adds, each of these add operations really produces two partial sums and the final adder is a 16x18-bit adder, not 8x18 as shown.
+ 
+```
+   101010001    000000000    000000000    000000000    000000000    000000000    000000000    000000000
+   000000000    101010010    000000000    000000000    000000000    000000000    000000000    000000000
+   101010011    101010011    000000000    000000000    000000000    000000000    000000000    000000000
+   000000000    000000000    101010100    000000000    000000000    000000000    000000000    000000000
+  101010101 +  000000000 +  101010101 +  000000000 +  000000000 +  000000000 +  000000000 +  000000000
+------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
+001111111001 001010100101 001010101001 000000000000 000000000000 000000000000 000000000000 000000000000
+
+
+         001111111001
+        001010100101
+       001010101001
+      000000000000
+     000000000000
+    000000000000
+   000000000000
+ 000000000000
+--------------------
+00000001001111100111
+```
+
+There are configurable pipeline stages built in between the 8 array adders and the last array adder as well as between the last array adder and the final sum to configure timing later if needed.
\ No newline at end of file