Return of the Beast: Game On

Solving puzzles is rewarding. That's why the weekend newspapers have a big pullout section chock full of them. They want those endorphins to flow freely and influence your future newspaper buying habits.

Previously I wrote about finding a faster way of scrolling by using pre-shifted data. I was trying to solve a puzzle, one that might not have a solution, but I had made a little bit of progress, got a little bit of reward, and kept thinking about it.

One thing making it slow was having to OR the tile edges together. A way to mitigate this is to have larger tiles. Less ORing and more SToring. (If I can find a few more rhymes like that then this blog is pretty much going to write itself!)

How about tiles that are four bytes wide? What does that do to the single row code?

puls a,x,u   ; 10 cycles
ora ,y       ; 4
sta ,y       ; 4
stx 2,y      ; 6
stu 4,y      ; 6

We're now pointing to the tile data with s so that we can use u as a temporary register. The first tile row now takes 30 cycles, or 32 cycles for subsequent rows. For a PMODE3 or 4 screen, this needs to be repeated 6144 / 4 = 1536 times, taking 32 x 1536 = 49152 cycles per screen. That's about 18 fps and 50% faster than two byte wide tiles. A decent gain just by making the tiles larger. The thing is, I'm looking for a big speed increase. Time for a compromise.

Does the graphics mode really need to be PMODE3 or 4? It's almost ingrained that you use the highest graphics resolution available. Good money was paid for all those pixels so surely the higher the resolution, the better things will look?

A PMODE1 screen has a resolution of 128 x 96. Half the vertical resolution of PMODE3, but the pixels are actually square. At 3072 bytes the display now requires half the memory, and can be updated in half the time. Good Things have been done using that graphics mode, such as Glove, Flagon Bird and sixxie's vertical scroll demo.

So I've quickly talked myself into using PMODE1 and traded detail for speed. That doubles the rate to 36 fps. Things are starting to get interesting. I wonder how much memory all these pre-shifted tiles are going to take up?

A PMODE1 four byte wide square tile requires 64 bytes of storage. Shifted versions will be five bytes wide and require 80 bytes. There are four pixels in a PMODE1 byte, so to be able to scroll to any pixel we would need three shifted versions of each tile in addition to the unshifted version. That's 304 bytes per tile. It sounds excessive but you could fit 32 tiles in under 10K and have a chance at producing an interesting background.

I nearly decided that was the way to go, and I think that approach could find good uses, but I still wasn't convinced it would be fast enough for what I was trying to do. To run at 25 fps, there would only be around 10,000 cycles left over for the rest of the game and I haven't accounted for fetching the tile addresses or for drawing partial tiles at the screen edges.

So I started thinking along the lines of having ever larger tiles to reduce the tile overhead but somehow slowly building them on demand out of smaller tiles. It was that line of thinking that led to the solution I settled on: Just have four buffers, each containing the screen image with a different amount of horizontal shift. Drawing the background is reduced to a large copy operation from one of the buffers. Scrolling would be achieved by moving the buffer start position and by drawing a stripe of new image data into the buffer in the right place to appear at the edge of the screen.

This sounded pretty simple. How fast could it run? A large block copy can be done with a set of instructions like this:

pulu cc,dp,d,x,y   ; 13 cycles
pshs cc,dp,d,x,y   ; 13
leas 16,s          ; 5

The leas 16,s instruction is there to compensate for pul & psh working in opposite directions. It can actually be removed if adjustments are made to the code writing into the buffer, which gives a nice improvement in speed for a pure horizontal scroll, but there are complications with combined horizontal and vertical scrolling that offset some of the gain. Maybe a topic for the future.

Using the cc register seems a bit dangerous because it contains the interrupt masks but it's fine as long as interrupts are disabled at the hardware level by programming the PIAs.

Anyway, that's 8 bytes copied every 31 cycles or less than 12,000 cycles for a PMODE1 screen. The data that needs to be written into the buffers amounts to a byte-wide column (or equivalent row) of tile fragments per game loop. This is sounding much faster.

Memory usage is heavy, as we need 12K of shift buffers, but the simplicity of the method is very appealing. Admittedly this is better suited to 64K machines but I still feel I can do something for 32K machines as the buffer requirements can be reduced by not using the whole screen. Just reserving 8 lines of display for score and status reduces the buffers by 1K.

Game On!

Return of the Beast

Monday 6 March 2017

Game On

No comments:

Post a Comment