This 32x33-bit multiplier produces the product over 4 cycles of processing time. This multiplier's i\_valid bit must only go high once every 4 cycles, any more often and the output of the multiplier will be corrupted. This is made to be a small, slower multiplier since we do not need a fast multiplier for this stage in the acceleration pipeline.
It performs 4 8x33-bit multiplies summed and shifted over the 4 clock cycles. The partial product is initially 0. Once the 8x33 multiply (+ previous partial product) is computed, the least significant 8 bits are saved into a separate shift register. When summing 32 33-bit inputs each shfited over according to their place value, the last 8 bits of the sum of the first 8 inputs will never change due to the shift, so those are saved into a shift register each cycle until the final product is computed. This effectively shifts all inputs by 8 so no additional shifting is required for the next computation cycle. The 32-bit input is also shifted 8-bits at a time, so the only logic used in this system are the adders, two shift registers, and a 9x40-bit array adder. Shifting the outputs and 32-bit inputs in this way significantly simplifies the logic for the array adder's inputs.