FPU Accumulator
This design for a better FPU block for 8/16 bit access, which also integrates 32-bit integer processing. While it's 32 bits, copying around 32-bit values with the CPU is mostly avoided.
There are 256 32-bit zeropage-like registers and one 32-bit accumulator. Any of the registers can hold useful constants (including 0, 1, etc) for reference. It's similar to the 6502 accumulator and zeropage values.
(TBD - registers in RAM or FPGA? Former allows swapping in new sets, but more complicated bus mastering, steal cpu cycles.)
Commands are issued by writing a byte to an I/O location. The location identifies the command to trigger, and the byte written is often a selector for which register number to use, sometimes a literal integer value, or ignored.
Status bits (TBD) are available to read, with various errors, and integer as well as fp Z/N/V/C flags just like the 6502. V and C are only for integer ops.
- Overflow/underflow (fp/int range, as well as f2i)
- Division by zero
- Int conversion lost fractional portion
Set the integer fixed point location (TBD) which affects multiply, divide, and conversion with floats.
TODO - should there be separate signed & unsigned variants of integer operations (including F2I), or a mode bit for signedness? Should all integer commands respect a single signedness control flag?
Scripting
For optimization of applying a stream of mathematical ops in repetitive ways, you can save bytecodes in the FPGA registers. These are generally in the form of <op> <param> in terms of applying a write of byte param at op location <op>. This is fine for very fixed pipelines with input data always at one location. However, for general use this will require its own set of flow control and register indices to loop over sets of registers, indirect through computed register values, or receive "pointers" to a register and using offsets.
Op run <reg> and exit, although these could nest generally to support subroutines. Not sure how to handle the stack.
Control flow would be unconditional loops, with tests to exit the loop on: Accumulator == 0, or offset = val.
Option 1: Offset register
Add another 8-bit "offset" register, and an enable bit to config. Op to write, set mask, or reset mask against the config register. When enabled, all registers have an offset added to it. Op to set the offset, or add a signed 8-bit value to the offset. This allows scripting to advance through the register set.
Option 2: Indirection
Have a mode bit for indirection. Any reg access will indirect through that register, but we'll need an offset for that as well? Annoying to swap between indirect and non-indirect, unless we move to 128 regs per bank?
TODO - directly support matrix multiplication and matrix addition, with width/height registers. Should hopefully be applicable to vector ops as well. BRAM is dual-ported, so we could read 2 registers to multiply per cycle, though multiplication is probably a couple of cycles. Could instantiate a couple of multipliers to really pipeline this hard.
The given register (offset ignored!) is input A, the "matrix param" is input B, register A + offset is the output register? But then we can't iterate matrices with the output register. But we also likely can't overwrite a matrix as we compute, because we're using it as an input. Don't want to waste LUT space copying a matrix into a reserved area. But we could also use regs 0-8 as a temp 3x3 matrix? Nah, that's another copying a matrix, dummy operation. This could be a 4-byte op instead of 2: MATMUL, A, B, Result? Have 3 registers, when you write to Result it triggers it? Writing a register should capture whether it's in RAM and the register bank number.
| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | |
|---|---|---|---|---|---|---|---|---|
| Integer Control | Signed | 64bit reg0 | — | Fixed point frac bits (0-31) | ||||
| Mode | RAM | Regnum Hi (0-1 for fpga) | ||||||
| Register offset | Offset | |||||||
| Matrix Size | Rows | Columns | ||||||
| Offset Enable | All non-matrix | — | — | — | — | Result | MATB | MATA |
| Bitmask | Mask, for setting/resetting | |||||||
| RAM Start | 24-bit address? Or in multiples of 4kB banks? | |||||||
| Cmd # | Name | Parameter | Description |
|---|---|---|---|
| LOAD | Reg | A = Reg | |
| STORE | Reg | Reg = A | |
| SWAP | Reg | (A, Reg) = (Reg, A) | |
| LEXTRA | Ignored | A = Extra | |
| MASKW | Reg | Write mask to register | |
| MASKS | Reg | Set mask bits in register | |
| MASKC | Reg | Clear mask bits in register |
| Cmd # | Name | Parameter | Description |
|---|---|---|---|
| FADD | Reg | A = A + Reg | |
| FSUB | Reg | A = A - Reg | |
| FRSUB | Reg | A = Reg - A | |
| FMUL | Reg | A = A * Reg | |
| FDIV | Reg | A = A / Reg | |
| FRDIV | Reg | A = Reg / A | |
| F2I | Signed flag | A = Int(A) or UInt(A) | |
| FMIN | Reg | A = Min(A, Reg) | |
| FMAX | Reg | A = Max(A, Reg) | |
| FABS | Ignored | A = Abs(A) | |
| Maybe | |||
| FADDB | Byte | A = A + Val | |
| FSUBB | Byte | A = A - Val | |
| FMULB | Byte | A = A * Val | |
| FDIVB | Byte | A = A / Val | |
| FCLR | Ignored | A = 0.0 (LOAD instead) | |
| FNEG | Ignored | A = -A (FRSUB instead) | |
| FRECIP | Ignored | A = 1.0/A (FRDIV instead) | |
| FPOW | Reg | A = A ^ Reg | |
| FSQRT | Ignored | A = Sqrt(A) | |
| FTRIG | 0 | A = Sin(A) | |
| 1 | A = Cos(A) | ||
| 2 | A = Tan(A) | ||
| FATAN | Reg | A = Atan(A, Reg) | |
| A = Sec(A) | |||
| A = Csc(A) | |||
| A = Cot(A) | |||
| FACOT | Reg | A = Arccot(A, Reg) | |
| A = Arcsin(A) | |||
| A = Arccos(A) | |||
| FFLOOR | Reg | A = Floor(A), Reg = Frac(A) (options?) | |
| FCEIL | Reg | A = Ceil(A), Reg = Frac(A) (options?) | |
| Cmd # | Name | Parameter | Description |
|---|---|---|---|
| MATA | Reg | Set param A | |
| MATB | Reg | Set param B | |
| MATMUL | Reg | Reg = MATA * MATB | |
| MATADD | Reg | Reg = MATA + MATB |
| Cmd # | Name | Parameter | Description |
|---|---|---|---|
| IADD | Reg | A = A + Reg | |
| IADDC | Reg | A = A + Reg + C | |
| ISUB | Reg | A = A - Reg | |
| ISUBC | Reg | A = A - Reg - !C | |
| IRSUB | Reg | A = Reg - A | |
| ISETC | 0 or 1 | C = Value | |
| IMUL | Reg | A = A * Reg (TBD - 64-bit result? Could auto-write Reg 0?) | |
| IDIV | Reg | A = A / Reg (TBD - combined DIVMOD? 64/32 DIV?) | |
| IMOD | Reg | A = A % Reg | |
| ICMP | Reg | Status = A - Reg | |
| INEG | 0 | A = -A | |
| IAND | Reg | A = A & Reg | |
| IOR | Reg | A = A | Reg | |
| IXOR | Reg | A = A ^ Reg | |
| I2F | Signed Flag | A = Float(Int(A)) or Float(UInt(A)) | |
| ISHL | Count | A = A << Value | |
| ISHR | Count | A = A >> Value | |
| ISSHR | Count | A = A >>(signed) Value | |
| IROLL | Count | A = (A << Count) | (A >> (32 - Count)) | |
| IUMIN | Reg | A = Min(A, Reg) | |
| ISMIN | Reg | A = Min(A, Reg) | |
| IUMAX | Reg | A = Max(A, Reg) | |
| ISMAX | Reg | A = Max(A, Reg) | |
| IABS | Ignored | A = Abs(A) | |
| ISET0 | Byte | Set byte 0 of accumulator | |
| ISET1 | Byte | Set byte 1 of accumulator | |
| ISET2 | Byte | Set byte 2 of accumulator | |
| ISET3 | Byte | Set Byte 3 of accumulator |