WF FPGA Ideas

Video enhancements

Math coprocessor

SDCard

Auto-tx on read. Could supersede this with auto-reading a 16-bit length to a storage pointer, running in the background, flag or interrupt when done. Loading straight into DDR3 would be good once audio/video can read from there.

Stream MP3 or MIDI file from disk (or ddr3?) straight to chips

Memtext Bitmap

Akin to the C64 hi-res bitmap mode, have a 640x480 1bpp bitmap with 8x8 color cells as defined by the existing memtext (or on-FPGA?) color attributes.

Font currently takes

Settings:

Address of bitmap data (same as memtext font pointer)
Choose cell-based (C64-like, follows 8x8 or 8x16 memtext selection) or linear pixel byte layout
FPGA color matrix (2x 16-color) or RAM color matrix (2x 256-color)

It's annoying that this has its own separate palette, which takes up a lot of space, but I get it that it's a separate parallel pipeline to the "graphics" section. However, I'd rather just have it select which CLUT to use for FG and BG.

Bootstrap and Machine Identity

Access to the cores SDCard. Name something better than the current hardcoded names.

Soft-boot into one of the other cores, instead of relying on jumpers to select the core. Have an extended PGZ header or new file format that can request such things. Note that PGZ should say how many and which 8k blocks or 64k banks it's hardcoded for. Should be another load format for "play nice" system tools that can be simultaneously loaded.

Identify cores by an 8 character ASCII string, instead of bit fields. Maybe have a version number (16 bit) per release of any named platform.

Have different ROM loads for different cores? Or boot from RAM if some magic bytes occur there instead of ROM., though that would be part of the ROM boot that can do that, what ISA would the code be?

RP2040 uses the Slave SelectMAX x8 config interface of the FPGA, and its 8 data pins should be GPIO for the FPGA to talk back to the RP after it's up and running (unless the pins are used for something else). The RP software needs updating for that.

Simplest comms would be splitting the 8-bit path into two 4-bit independent directional channels. Clock (toggle when new data is sent), Ack (receiver toggles when it's read), 2 data bits. All polled on the software end. The RP2040 could be crunching FAT32 stuff while the FPGA is waiting to write to it, no problem.

SRAM can be retained between resets:

char my_crash_info[100] __attribute__((section(".uninitialized_data")));

and with a macro:

#define __uninitialized_ram(group) __attribute__((section(".uninitialized_data." #group))) group

datetime_t __uninitialized_ram(persist_date);

Tentatively Proposed RP2040 Startup

Read SD Card for magic /MYCORE.TXT file containing the full path of the core filename to boot.
- If this exists, DIP switches can be some form of override. Need to be able to break in if something messes up. On configuring MYCORE, always tell the user to set their DIP switches to 00.
- If this doesn't exist, oldsk00l DIP switches determine the core to boot.
Load compressed core file from card and get the FPGA going.
- Set the FPGA comm handshake pins to 0 before releasing the init/reset command to the FPGA.
Load /RPLIB library and jump into it, if exists (else infinite loop). This runs whatever library functionality can talk to the FPGA, and is upgradable as a file through the loaded utilities.

The library should have functions to browse the cores sdcard, and select which file to boot with.

TODO - do the core selection DIP pins also hit the FPGA for flash banking? If so, then we might not be able to escape them.

Timers

Ensure that the 24-bit value latches when the LSB is read (or the MSB is written?) for consistent reads.

Wavetable Audio

Some form of wavetable audio is sorely missing for Amiga, TG16, SNES, Soundblaster era audio, especially sound effects. Basic multiple channels of stereo or panned mono, uncompressed samples, support some simple compression formats.

Pull small buffers of audio into the FPGA from RAM, keep 2 live at a time, one playing, one buffered. At 25MHz with a 48KHz sample rate, 1 sample lasts 520 cycles (20µsec), could be fast enough to fetch while the last sample is playing and just single-buffer it? Or even a rolling window Should also support just-in-time software generation of each buffer, with IRQ notifications.

Stretch samples to whatever the output sampling rate is. Linear interpolation would be neat, but probably optional. Same with the decay-to-zero that many old synth chips had.

Sound Chip Instances

Instead of duplicating hardware instances of sound chips, multiplex all their registers and internal variables. Access would select 1 of them, and the hardware would run with that multiplexer active on all its values. Compare the size taken, depends on how complex the chip is.

Since sound chips don't need to be that fast, serially looping through and executing a cycle, accumulating their output, for the final audio sample would be fine. Some of them do this internally with their voices anyway.

Register to select which instance(s) to use, so programs can be agnostic to what sound context exists with other things. Probably expose 2 chips at a time through the IO registers, with a separate one to select which bank of 2 to use. At least for mono chips, or those which are commonly in pairs. Stereo chips could just have 1 register set exposed.

65816 support

Remap page 0 into any 64kB bank, allowing direct page, stack, etc, to have its own swappable space. However, need to consider how this goes for interrupts. An interrupt will probably have to remap to hardware bank 0, but then somehow restore the bank that used to be there. Keeping the upper 256 bytes locked somewhere (flash or bank 0 RAM) would solve that. The 8kB MMU can be used for these purposes already, but I wonder if the full bank swapping might be easier, but likely not really necessary as long as the MMU is around.

FPGA-based CPU

65816 but with a genuine 16-bit data bus.

Keep 6809 in the same core with the 65816 running. Switch between either at any time (bus master), or alternate cycles. At core2x speed, they could each run at 6MHz alternately hitting the SRAM.

If the bus can be more dynamic, especially during vblank/hblank, the CPUs can run a lot faster there.

Bitstream readers/writers

Write a byte or word to a FPGA location, it takes a CPU cycle to write it, and bumps its pointer.

For a bit stream, a 32-bit bitpointer covers 4Gb = 512MB. A write would need to know the width to write. Maybe 16 registers, write a value to one of those to declare how many bits from the written value to write. This actually allows the index register to determine width dynamically, which is nice. Both read & write interfaces should use this. A complete hack but easier to use would be to have 18 regs. For any width >8 bits, a 2nd access to the next register would grab the high byte, even though it's technically the trigger for the next higher. But this would get confusing in 65816 16-bit mode, accessing lower lengths.

Automatically converts between 32-bit bitstream pointer and 24-bit byte pointer + bit offset. When writing to the byte pointer, it automatically zeros out the bit offset, for easier initialization from standard word boundaries. If loading a bit pointer from normal pointer + bit offset, set the bit offset last.

Separate read & write context, so copies, decompression, etc, can be done. Bit pointers can be directly read/written as well. Direction is always in the positive direction, though, at least for now.

Probably good to support 0-length, for dynamically computed lengths. So with 0-16 supported, that's 17 entry points, kinda messy.

Also, skip forward N bits without a read or write. Technically this could just be a read N bits and ignore the value, but this should be 0-65535 bits skipped. Could also just do a 32-bit add on the pointer register.

8bit interface: 2 bitpointers, then 8 byte locs for pointer 0, and 8 byte locs for pointer 1.

16bit interface: 2 bitpointers, then 16 word locs for pointer 0, 16 word locs for pointer 1. Writes triggered on high byte write. Reads trigger on low byte read, which readies the high byte.

This is a CPU-blocking interface for reads, buffered for writes.

Optical Keyboard

Can this be sent as 9 events of 8bits, instead of 8 events of 9 bits (2 bytes)? However the self-describing row number might be useful to keep.

RLE Format(s)

RLE layers, DMA, and potentially sprites can use RLE encoding.

Span-based RLE formats
bpp	Layout	length	Max compression	Breakeven
1	`clllllll`	1-128	16:1 byte	8px
2	`cc111111`	1-64	16:1 byte	4 pixels
4	`ccccllll`	1-16	4:1 byte	2 pixels
4	`ccccCCCC llllllll llllllll`	1-256	170:1 byte (512:3)	3+3 pixels
8	`cccccccc llllllll`	1-256	128:1 byte	2 pixels

However, it would be useful to have spans of literal pixels as well, instead of just solid color span fills.

0lllllll cccccccc = span length L of color C, 0 = transparent

1lllllll cccccccc...= L count of individual pixels

For a bpp less than 8, probably require them to fill an even byte or word count

For now, RLE layers should be simple length + 8bpp aligned words. RLE bitmaps would be something different, maybe it's too flexible so we should just leave that to the CPU. It would save a lot of bandwidth for bitmap overlay layers with large transparent windows, though.

DMA/Blitter

Maybe separate out 2d mode into its own blitter?

Flag to mask out the 'fill/mask' color (default 0)

Clip to output screen dimensions.

Xflip, yflip, maybe 90° rotation, but that means dest dimensions change? Scaling? Full affine transform?

Unpack RLE graphics, for better memory usage. Could still do x/y flip because this isn't raster-dependent. Must know the total x/y though if clipping is supported

Fields:

bpp (could expand from src to dest given an offset?)
src w/h/stride
dest w/h/stride

TODO - V2

Clipping? Or should the src/dest be handled in software?

Ideally, there'd be a clip bounds defined at the dest address, w, h, stride, bpp. The source address is defined, and it's blitted into an x/y in the dest screen, automatically clipped. This could also be used as a pixel/stamp plotter.

If there end up being a large number of parameters for a src or dest, it would be nice to have multiple profiles. Either read src/dest profiles from ram, or have say 4 src & 4 dests saved, and blit from src N to dest M.

RLE graphics should probably save their w/h and mode implicitly as the first 2 words, as they are their own free-form shapes.

Since DMA currently only takes place during VBLANK, could be more efficient to have a DMA list to run when VSYNC hits, blasting those out as fast as possible.

For 1bpp (or maybe others, too?) and/or/nor/nand/xor modes would be necessary.