WF FPGA Ideas: Difference between revisions
(Created page with "== Global settings == CRT emulation, only for low resolution layers. 640x480 for 4:3 output, or 960x540 for 16:9 output, if bandwidth can run it. Non-integer pixel aspect flags, again only for low res layers. Match 320x200 and 256x200 non-square aspects blended on a 480p/540p base output. Keep it at 60Hz Select 50 or 60 Hz in any resolution. Ditch 70Hz, as nothing syncs to that in the PC space for compatibility. 50Hz is much lower priority, but can be done by extendin...") |
|||
| (27 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
== | == Video enhancements == | ||
[[WF16 Video Architecture]] | |||
== Math coprocessor == | |||
[[FPU Accumulator]] | |||
== SDCard == | |||
Auto-tx on read. Could supersede this with auto-reading a 16-bit length to a storage pointer, running in the background, flag or interrupt when done. Loading straight into DDR3 would be good once audio/video can read from there. | |||
Stream MP3 or MIDI file from disk (or ddr3?) straight to chips | |||
== | == Bootstrap and Machine Identity == | ||
Access to the cores SDCard. Name something better than the current hardcoded names. | |||
Soft-boot into one of the other cores, instead of relying on jumpers to select the core. Have an extended PGZ header or new file format that can request such things. Note that PGZ should say how many and which 8k blocks or 64k banks it's hardcoded for. Should be another load format for "play nice" system tools that can be simultaneously loaded. | |||
Identify cores by an 8 character ASCII string, instead of bit fields. Maybe have a version number (16 bit) per release of any named platform. | |||
Have different ROM loads for different cores? Or boot from RAM if some magic bytes occur there instead of ROM., though that would be part of the ROM boot that can do that, what ISA would the code be? | |||
RP2040 uses the Slave SelectMAX x8 config interface of the FPGA, and its 8 data pins should be GPIO for the FPGA to talk back to the RP after it's up and running (unless the pins are used for something else). The RP software needs updating for that. | |||
Simplest comms would be splitting the 8-bit path into two 4-bit independent directional channels. Clock (toggle when new data is sent), Ack (receiver toggles when it's read), 2 data bits. All polled on the software end. The RP2040 could be crunching FAT32 stuff while the FPGA is waiting to write to it, no problem. | |||
SRAM can be retained between resets: | |||
<code>char my_crash_info[100] __attribute__((section(".uninitialized_data")));</code> | |||
and with a macro: | |||
<code>#define __uninitialized_ram(group) __attribute__((section(".uninitialized_data." #group))) group</code> | |||
<code>datetime_t __uninitialized_ram(persist_date);</code> | |||
=== Tentatively Proposed RP2040 Startup === | |||
* Read SD Card for magic <code>/MYCORE.TXT</code> file containing the full path of the core filename to boot. | |||
** If this exists, DIP switches can be some form of override. Need to be able to break in if something messes up. On configuring MYCORE, always tell the user to set their DIP switches to 00. | |||
** If this doesn't exist, oldsk00l DIP switches determine the core to boot. | |||
* Load compressed core file from card and get the FPGA going. | |||
** Set the FPGA comm handshake pins to 0 before releasing the init/reset command to the FPGA. | |||
* Load <code>/RPLIB</code> library and jump into it, if exists (else infinite loop). This runs whatever library functionality can talk to the FPGA, and is upgradable as a file through the loaded utilities. | |||
The library should have functions to browse the cores sdcard, and select which file to boot with. | |||
TODO - do the core selection DIP pins also hit the FPGA for flash banking? If so, then we might not be able to escape them. | |||
== Timers == | |||
Ensure that the 24-bit value latches when the LSB is read (or the MSB is written?) for consistent reads. | |||
== Wavetable Audio == | |||
Some form of wavetable audio is sorely missing for Amiga, TG16, SNES, Soundblaster era audio, especially sound effects. Basic multiple channels of stereo or panned mono, uncompressed samples, support some simple compression formats. | |||
Pull small buffers of audio into the FPGA from RAM, keep 2 live at a time, one playing, one buffered. At 25MHz with a 48KHz sample rate, 1 sample lasts 520 cycles (20µsec), could be fast enough to fetch while the last sample is playing and just single-buffer it? Or even a rolling window Should also support just-in-time software generation of each buffer, with IRQ notifications. | |||
Stretch samples to whatever the output sampling rate is. Linear interpolation would be neat, but probably optional. Same with the decay-to-zero that many old synth chips had. | |||
== Sound Chip Instances == | |||
Instead of duplicating hardware instances of sound chips, multiplex all their registers and internal variables. Access would select 1 of them, and the hardware would run with that multiplexer active on all its values. Compare the size taken, depends on how complex the chip is. | |||
Since sound chips don't need to be that fast, serially looping through and executing a cycle, accumulating their output, for the final audio sample would be fine. Some of them do this internally with their voices anyway. | |||
Register to select which instance(s) to use, so programs can be agnostic to what sound context exists with other things. Probably expose 2 chips at a time through the IO registers, with a separate one to select which bank of 2 to use. At least for mono chips, or those which are commonly in pairs. Stereo chips could just have 1 register set exposed. | |||
== 65816 support == | |||
Remap page 0 into any 64kB bank, allowing direct page, stack, etc, to have its own swappable space. However, need to consider how this goes for interrupts. An interrupt will probably have to remap to hardware bank 0, but then somehow restore the bank that used to be there. Keeping the upper 256 bytes locked somewhere (flash or bank 0 RAM) would solve that. The 8kB MMU can be used for these purposes already, but I wonder if the full bank swapping might be easier, but likely not really necessary as long as the MMU is around. | |||
== FPGA-based CPU == | |||
65816 but with a genuine 16-bit data bus. | |||
Keep 6809 in the same core with the 65816 running. Switch between either at any time (bus master), or alternate cycles. At core2x speed, they could each run at 6MHz alternately hitting the SRAM. | |||
If the bus can be more dynamic, especially during vblank/hblank, the CPUs can run a lot faster there. | |||
== Bitstream readers/writers == | |||
Write a byte or word to a FPGA location, it takes a CPU cycle to write it, and bumps its pointer. | |||
For a bit stream, a 32-bit bitpointer covers 4Gb = 512MB. A write would need to know the width to write. Maybe 16 registers, write a value to one of those to declare how many bits from the written value to write. This actually allows the index register to determine width dynamically, which is nice. Both read & write interfaces should use this. A complete hack but easier to use would be to have 18 regs. For any width >8 bits, a 2nd access to the next register would grab the high byte, even though it's technically the trigger for the next higher. But this would get confusing in 65816 16-bit mode, accessing lower lengths. | |||
Automatically converts between 32-bit bitstream pointer and 24-bit byte pointer + bit offset. When writing to the byte pointer, it automatically zeros out the bit offset, for easier initialization from standard word boundaries. If loading a bit pointer from normal pointer + bit offset, set the bit offset last. | |||
Separate read & write context, so copies, decompression, etc, can be done. Bit pointers can be directly read/written as well. Direction is always in the positive direction, though, at least for now. | |||
== RLE == | Probably good to support 0-length, for dynamically computed lengths. So with 0-16 supported, that's 17 entry points, kinda messy. | ||
2 different | |||
Also, skip forward N bits without a read or write. Technically this could just be a read N bits and ignore the value, but this should be 0-65535 bits skipped. Could also just do a 32-bit add on the pointer register. | |||
8bit interface: 2 bitpointers, then 8 byte locs for pointer 0, and 8 byte locs for pointer 1. | |||
16bit interface: 2 bitpointers, then 16 word locs for pointer 0, 16 word locs for pointer 1. Writes triggered on high byte write. Reads trigger on low byte read, which readies the high byte. | |||
This is a CPU-blocking interface for reads, buffered for writes. | |||
== Optical Keyboard == | |||
Can this be sent as 9 events of 8bits, instead of 8 events of 9 bits (2 bytes)? However the self-describing row number might be useful to keep. | |||
== RLE Format(s) == | |||
RLE layers, DMA, and potentially sprites can use RLE encoding. | |||
{| class="wikitable" | |||
|+Span-based RLE formats | |||
!bpp | |||
!Layout | |||
!length | |||
!Max compression | |||
!Breakeven | |||
|- | |||
|1 | |||
|<code>clllllll</code> | |||
|1-128 | |||
|16:1 byte | |||
|8px | |||
|- | |||
|2 | |||
|<code>cc111111</code> | |||
|1-64 | |||
|16:1 byte | |||
|4 pixels | |||
|- | |||
|4 | |||
|<code>ccccllll</code> | |||
|1-16 | |||
|4:1 byte | |||
|2 pixels | |||
|- | |||
|4 | |||
|<code>ccccCCCC llllllll llllllll</code> | |||
|1-256 | |||
|170:1 byte (512:3) | |||
|3+3 pixels | |||
|- | |||
|8 | |||
|<code>cccccccc llllllll</code> | |||
|1-256 | |||
|128:1 byte | |||
|2 pixels | |||
|} | |||
However, it would be useful to have spans of literal pixels as well, instead of just solid color span fills. | |||
<code>0lllllll cccccccc</code> = span length L of color C, 0 = transparent | |||
<code>1lllllll cccccccc...</code>= L count of individual pixels | |||
For a bpp less than 8, probably require them to fill an even byte or word count | |||
For now, RLE layers should be simple length + 8bpp aligned words. RLE bitmaps would be something different, maybe it's too flexible so we should just leave that to the CPU. It would save a lot of bandwidth for bitmap overlay layers with large transparent windows, though. | |||
== DMA/Blitter == | |||
Maybe separate out 2d mode into its own blitter? | |||
Flag to mask out the 'fill/mask' color (default 0) | |||
Clip to output screen dimensions. | |||
Xflip, yflip, maybe 90° rotation, but that means dest dimensions change? Scaling? Full affine transform? | |||
Unpack RLE graphics, for better memory usage. Could still do x/y flip because this isn't raster-dependent. Must know the total x/y though if clipping is supported | |||
Fields: | |||
* bpp (could expand from src to dest given an offset?) | |||
* src w/h/stride | |||
* dest w/h/stride | |||
'''TODO - V2''' | |||
Clipping? Or should the src/dest be handled in software? | |||
Ideally, there'd be a clip bounds defined at the dest address, w, h, stride, bpp. The source address is defined, and it's blitted into an x/y in the dest screen, automatically clipped. This could also be used as a pixel/stamp plotter. | |||
If there end up being a large number of parameters for a src or dest, it would be nice to have multiple profiles. Either read src/dest profiles from ram, or have say 4 src & 4 dests saved, and blit from src N to dest M. | |||
RLE graphics should probably save their w/h and mode implicitly as the first 2 words, as they are their own free-form shapes. | |||
Since DMA currently only takes place during VBLANK, could be more efficient to have a DMA list to run when VSYNC hits, blasting those out as fast as possible. | |||
For 1bpp (or maybe others, too?) and/or/nor/nand/xor modes would be necessary. | |||
Latest revision as of 01:46, 20 February 2026
Video enhancements
Math coprocessor
SDCard
Auto-tx on read. Could supersede this with auto-reading a 16-bit length to a storage pointer, running in the background, flag or interrupt when done. Loading straight into DDR3 would be good once audio/video can read from there.
Stream MP3 or MIDI file from disk (or ddr3?) straight to chips
Bootstrap and Machine Identity
Access to the cores SDCard. Name something better than the current hardcoded names.
Soft-boot into one of the other cores, instead of relying on jumpers to select the core. Have an extended PGZ header or new file format that can request such things. Note that PGZ should say how many and which 8k blocks or 64k banks it's hardcoded for. Should be another load format for "play nice" system tools that can be simultaneously loaded.
Identify cores by an 8 character ASCII string, instead of bit fields. Maybe have a version number (16 bit) per release of any named platform.
Have different ROM loads for different cores? Or boot from RAM if some magic bytes occur there instead of ROM., though that would be part of the ROM boot that can do that, what ISA would the code be?
RP2040 uses the Slave SelectMAX x8 config interface of the FPGA, and its 8 data pins should be GPIO for the FPGA to talk back to the RP after it's up and running (unless the pins are used for something else). The RP software needs updating for that.
Simplest comms would be splitting the 8-bit path into two 4-bit independent directional channels. Clock (toggle when new data is sent), Ack (receiver toggles when it's read), 2 data bits. All polled on the software end. The RP2040 could be crunching FAT32 stuff while the FPGA is waiting to write to it, no problem.
SRAM can be retained between resets:
char my_crash_info[100] __attribute__((section(".uninitialized_data")));
and with a macro:
#define __uninitialized_ram(group) __attribute__((section(".uninitialized_data." #group))) group
datetime_t __uninitialized_ram(persist_date);
Tentatively Proposed RP2040 Startup
- Read SD Card for magic
/MYCORE.TXTfile containing the full path of the core filename to boot.- If this exists, DIP switches can be some form of override. Need to be able to break in if something messes up. On configuring MYCORE, always tell the user to set their DIP switches to 00.
- If this doesn't exist, oldsk00l DIP switches determine the core to boot.
- Load compressed core file from card and get the FPGA going.
- Set the FPGA comm handshake pins to 0 before releasing the init/reset command to the FPGA.
- Load
/RPLIBlibrary and jump into it, if exists (else infinite loop). This runs whatever library functionality can talk to the FPGA, and is upgradable as a file through the loaded utilities.
The library should have functions to browse the cores sdcard, and select which file to boot with.
TODO - do the core selection DIP pins also hit the FPGA for flash banking? If so, then we might not be able to escape them.
Timers
Ensure that the 24-bit value latches when the LSB is read (or the MSB is written?) for consistent reads.
Wavetable Audio
Some form of wavetable audio is sorely missing for Amiga, TG16, SNES, Soundblaster era audio, especially sound effects. Basic multiple channels of stereo or panned mono, uncompressed samples, support some simple compression formats.
Pull small buffers of audio into the FPGA from RAM, keep 2 live at a time, one playing, one buffered. At 25MHz with a 48KHz sample rate, 1 sample lasts 520 cycles (20µsec), could be fast enough to fetch while the last sample is playing and just single-buffer it? Or even a rolling window Should also support just-in-time software generation of each buffer, with IRQ notifications.
Stretch samples to whatever the output sampling rate is. Linear interpolation would be neat, but probably optional. Same with the decay-to-zero that many old synth chips had.
Sound Chip Instances
Instead of duplicating hardware instances of sound chips, multiplex all their registers and internal variables. Access would select 1 of them, and the hardware would run with that multiplexer active on all its values. Compare the size taken, depends on how complex the chip is.
Since sound chips don't need to be that fast, serially looping through and executing a cycle, accumulating their output, for the final audio sample would be fine. Some of them do this internally with their voices anyway.
Register to select which instance(s) to use, so programs can be agnostic to what sound context exists with other things. Probably expose 2 chips at a time through the IO registers, with a separate one to select which bank of 2 to use. At least for mono chips, or those which are commonly in pairs. Stereo chips could just have 1 register set exposed.
65816 support
Remap page 0 into any 64kB bank, allowing direct page, stack, etc, to have its own swappable space. However, need to consider how this goes for interrupts. An interrupt will probably have to remap to hardware bank 0, but then somehow restore the bank that used to be there. Keeping the upper 256 bytes locked somewhere (flash or bank 0 RAM) would solve that. The 8kB MMU can be used for these purposes already, but I wonder if the full bank swapping might be easier, but likely not really necessary as long as the MMU is around.
FPGA-based CPU
65816 but with a genuine 16-bit data bus.
Keep 6809 in the same core with the 65816 running. Switch between either at any time (bus master), or alternate cycles. At core2x speed, they could each run at 6MHz alternately hitting the SRAM.
If the bus can be more dynamic, especially during vblank/hblank, the CPUs can run a lot faster there.
Bitstream readers/writers
Write a byte or word to a FPGA location, it takes a CPU cycle to write it, and bumps its pointer.
For a bit stream, a 32-bit bitpointer covers 4Gb = 512MB. A write would need to know the width to write. Maybe 16 registers, write a value to one of those to declare how many bits from the written value to write. This actually allows the index register to determine width dynamically, which is nice. Both read & write interfaces should use this. A complete hack but easier to use would be to have 18 regs. For any width >8 bits, a 2nd access to the next register would grab the high byte, even though it's technically the trigger for the next higher. But this would get confusing in 65816 16-bit mode, accessing lower lengths.
Automatically converts between 32-bit bitstream pointer and 24-bit byte pointer + bit offset. When writing to the byte pointer, it automatically zeros out the bit offset, for easier initialization from standard word boundaries. If loading a bit pointer from normal pointer + bit offset, set the bit offset last.
Separate read & write context, so copies, decompression, etc, can be done. Bit pointers can be directly read/written as well. Direction is always in the positive direction, though, at least for now.
Probably good to support 0-length, for dynamically computed lengths. So with 0-16 supported, that's 17 entry points, kinda messy.
Also, skip forward N bits without a read or write. Technically this could just be a read N bits and ignore the value, but this should be 0-65535 bits skipped. Could also just do a 32-bit add on the pointer register.
8bit interface: 2 bitpointers, then 8 byte locs for pointer 0, and 8 byte locs for pointer 1.
16bit interface: 2 bitpointers, then 16 word locs for pointer 0, 16 word locs for pointer 1. Writes triggered on high byte write. Reads trigger on low byte read, which readies the high byte.
This is a CPU-blocking interface for reads, buffered for writes.
Optical Keyboard
Can this be sent as 9 events of 8bits, instead of 8 events of 9 bits (2 bytes)? However the self-describing row number might be useful to keep.
RLE Format(s)
RLE layers, DMA, and potentially sprites can use RLE encoding.
| bpp | Layout | length | Max compression | Breakeven |
|---|---|---|---|---|
| 1 | clllllll
|
1-128 | 16:1 byte | 8px |
| 2 | cc111111
|
1-64 | 16:1 byte | 4 pixels |
| 4 | ccccllll
|
1-16 | 4:1 byte | 2 pixels |
| 4 | ccccCCCC llllllll llllllll
|
1-256 | 170:1 byte (512:3) | 3+3 pixels |
| 8 | cccccccc llllllll
|
1-256 | 128:1 byte | 2 pixels |
However, it would be useful to have spans of literal pixels as well, instead of just solid color span fills.
0lllllll cccccccc = span length L of color C, 0 = transparent
1lllllll cccccccc...= L count of individual pixels
For a bpp less than 8, probably require them to fill an even byte or word count
For now, RLE layers should be simple length + 8bpp aligned words. RLE bitmaps would be something different, maybe it's too flexible so we should just leave that to the CPU. It would save a lot of bandwidth for bitmap overlay layers with large transparent windows, though.
DMA/Blitter
Maybe separate out 2d mode into its own blitter?
Flag to mask out the 'fill/mask' color (default 0)
Clip to output screen dimensions.
Xflip, yflip, maybe 90° rotation, but that means dest dimensions change? Scaling? Full affine transform?
Unpack RLE graphics, for better memory usage. Could still do x/y flip because this isn't raster-dependent. Must know the total x/y though if clipping is supported
Fields:
- bpp (could expand from src to dest given an offset?)
- src w/h/stride
- dest w/h/stride
TODO - V2
Clipping? Or should the src/dest be handled in software?
Ideally, there'd be a clip bounds defined at the dest address, w, h, stride, bpp. The source address is defined, and it's blitted into an x/y in the dest screen, automatically clipped. This could also be used as a pixel/stamp plotter.
If there end up being a large number of parameters for a src or dest, it would be nice to have multiple profiles. Either read src/dest profiles from ram, or have say 4 src & 4 dests saved, and blit from src N to dest M.
RLE graphics should probably save their w/h and mode implicitly as the first 2 words, as they are their own free-form shapes.
Since DMA currently only takes place during VBLANK, could be more efficient to have a DMA list to run when VSYNC hits, blasting those out as fast as possible.
For 1bpp (or maybe others, too?) and/or/nor/nand/xor modes would be necessary.