Tuesday, 18 April 2017

More details

At some point during development, it struck me how complicated the scroll engine was becoming. The initial concept could be summarised in a few words and was supposed to be a simple matter of moving pointers around circular buffers, copying data from one place to another. But now there seemed to be an endless list of details to consider.

In some ways, the scrolling really is simple. It does pretty much boil down to pointers and copying. Where it gets complicated is figuring out which pieces of tile to draw, and a lot of that complication comes from the interaction between horizontal and vertical scrolling. One direction messes up the other.

But I do like a challenge, and I really wanted to see this thing working...

For horizontal scrolling we need to draw a one byte wide vertical stripe of tile pieces. Right away there are a couple of problems. The first is caused by the tiles being two bytes wide and the second is caused by vertical scrolling.

The first problem is easily solved by modifying the 'tile zero' base address we use to calculate the tile image addresses. By adding one to the base address, we automatically access the right half of each tile. If the tile image data starts at an even address, then the least significant bit of the address can be set to achieve the same effect.

The information to decide if we are accessing the left or right half of a tile comes from the map column pointer. This pointer is incremented or decremented at the same time as the buffer pointer, meaning the least significant bit indicates which tile half is being pointed to.

The second problem is a bit more involved. It means we have to deal with partial tiles at the top and bottom of the display. This divides the vertical stripe into three sections: A partial tile at the top, a run of complete tiles (well, half tiles), and a partial tile at the bottom.

Sometimes the tiles will line up neatly with the top and bottom edges of the display, meaning no partial tiles, and fewer visible tiles. How does this work exactly? The screen is 12 tiles high, but if partially scrolled, 13 will be visible. We can define things a bit more precisely to help shape the algorithm:

  • The top tile will have some or all pixel rows visible.
  • The run of complete tiles will always be the same number of tiles (11 if part of a full height display).
  • The bottom tile will have some pixel rows visible, or none.

That puts the bottom tile in The Occasionally Disappearing 13th Row*. It simply gets skipped on those frames where the tiles have perfect vertical alignment with the screen.

Take it from the top

Let's look at the top tile first. When it is partially visible, the bit that is missing is the top part. We can use the vertical pixel counter to figure out the parameters. This counter is incremented each time we need to move the screen contents up one pixel. The tiles are eight pixels high, so we're interested in the bottom three bits of the counter, which gives us a number in the range zero to seven. When it is zero, all of the tile rows are visible, and when it seven, just the bottom row of the tile is visible. So what we need is to subtract this number from eight to give us the number of rows of pixels to draw.

The other thing we need to determine is an offset into the tile image data so that we draw the correct part of the tile. Just like we did for vertical scrolling, we can take the bottom three bits of the vertical counter and multiply by two to create an offset for the tile image data.

The list of hoops to jump through before we can start drawing looks like this:

  • Determine where in the buffer to start drawing using the logic discussed in Scrolling 101
  • Determine where in the map we will start reading using the map row and column pointers discussed in Details
  • Modify the tile image base address to select the left half or right half of each tile. (i.e. add one if the bottom bit of the map column pointer is set)
  • Determine the parameters for the partial top tile using the vertical counter

That sets us up for the top tile. We use the map pointer to give us the tile ID which in turn allows us to calculate the address of the image data. We can then copy tile image bytes to the buffer.

After each tile image byte is drawn into the buffer, we need to advance the buffer destination pointer to the next row and check that it hasn't crossed the end of the buffer. If it has, then the buffer size should be subtracted from the pointer so that drawing continues from the top of the buffer.

Take it to the bridge

After the top tile, we need to draw the run of full height tiles. These are relatively easy as they are all fixed height and can be drawn with two loops: An inner loop to output the fixed number of bytes per tile, and an outer loop to output the fixed number of tiles. We continue advancing and wrapping the buffer destination pointer for each byte written, and similarly advance and wrap the map pointer for each tile produced.

Having to check the buffer destination address for every byte written consumes a lot of cycles. It looks like this piece of code:

    cmpx #buffer_end
    blo no_adjust
    leax -buffer_size,x

As the end of the buffer can only be crossed once, this code does very little useful work. It nearly always executes just the cmpx and the blo, but that's still 96 x 7 = 672 cycles for a full height draw.

It would be nice to avoid as many of those checks as possible. The approach I've used is to check the buffer pointer before drawing each full tile. If there is room to draw a tile without reaching the end of the buffer then it draws the tile using an unrolled loop with no pointer check. Otherwise the tile is drawn byte by byte in a loop with the pointer check. That trims out a lot of cycles without adding a lot of complexity.

Throw it in the river

Finally we reach the partial tile at the bottom. This is easier to deal with than the top tile. Firstly, the part of the tile that is missing is the lower part of the tile, so there's no need to offset the image address. Secondly, the number of pixel rows we need to draw is simply the bottom three bits of the vertical pixel counter. If it's zero, we don't need to draw the tile at all as we've already reached the bottom of the screen.

What we have so far, is pixel-by-pixel vertical scrolling, but the horizontal scroll is still only byte-by-byte. To get fast horizontal scrolling working at the pixel level, we need to bring in additional buffers and expand the drawing routines to include pixel shifting. Another layer of complexity. But at least the scroll engine will then be complete. It couldn't get any more complicated than that. Could it? To be continued...

(Spoiler alert: Yeah, it could)

* The Occasionally Disappearing 13th Row is possibly a British movie of the "I say, that's inconvenient!" disaster movie sub-genre, starring Timothy Dalton as a cheesy airline boss; Bill Nighy, apparently legally required to be in every British movie; and Martin Freeman as Tim from The Office. Again.

Sunday, 2 April 2017


It's fair to warn you that this one could be heavy going. Time to figure out some details. It's just drawing tiles, so you'd think it would be easy, but there are complications. Oh the horror...

Problems, problems

For the moment, I'll focus on scrolling whole bytes rather than shifting pixels, to avoid getting too complicated, too quickly. I'm not smart enough to deal with the whole problem in one go and I find it's helpful to understand it in small byte-sized pieces, if you'll forgive the terrible pun. (And you've forgiven so much to have got this far). Here's a list to start thinking about:

  • We need to know which tiles need drawing. That implies some kind of pointer into the tile map to track movement on the screen.

  • We need to convert tile numbers from the map into addresses of graphics data for drawing.

  • For reasons that used to make sense, the tiles are two bytes wide. That's a complication. Sometimes we will be drawing the left half of a tile, and sometimes we will be drawing the right half.

  • We could reach the end of the buffer before we've finished drawing a strip of tile fragments. That means we need to keep control of the drawing address and wrap it when required.

  • We could fall off the edge of the world. I mean the map. We could fall off the edge of the map while drawing tiles. That means the map pointer may need adjustment between tiles.

  • The multidirectional scrolling means we will have variable sized pieces of tile on all sides of the screen. Oh joy. It didn't have to be this way. I could have done a nice side-scroller. That would have been much easier to program and document. But not me, nooOOOooo. Scrolling in one direction just wasn't good enough for Lady-La-De-Da-Fancy-Multiway-Scroller. Now everyone has to stay late and clear up the mess. Well, I hope I'm happy with myself. Hey where'd everyone go?

Solutions, solutions

The map is a two-dimensional structure, which suggests the use of separate row and column pointers. It's tempting to use a single pointer to access the map, just like we're doing with the buffer pointer, but there is a subtle difference: When we increment the address past the left or right hand edge of the map, we find ourselves on the other side of the map, but one row up or down. Not exactly what we want. We would like to appear on the other side of the map on the same row. That seems easier to handle (to me anyway) with separate pointers.

The row pointer could be the address of the first tile of the current map row, and it would need to be updated each time we've scrolled enough to bring a new row of tiles into view. i.e. every eight pixels. If the pointer goes off the top of the map then the map size can be added to get us back into bounds and a similar adjustment done for the other direction.

That leaves the column pointer to specify an offset for a particular tile in the row. If it is incremented or decremented at the same time as the buffer pointer then it represents a half tile. (Recall that the tiles are two bytes wide). This sounds pretty useful. The least significant bit can be shifted into the carry and used to select the left or right half of the tile, and the remaining bits can be added to the row pointer to give us the map location. As the map is a power of two wide, wrapping can be achieved by ANDing the pointer with a mask. (Otherwise it would need a compare and branch)

The tile that the map pointer is pointing to is the one that should appear in the top left corner of the screen. This is fine for scrolling left or up, because the new strip of graphics will include that tile. For scrolling right or down there's an extra step: An offset will need to be added to the column or row pointer to address the tiles on the other side of the screen. It's a similar concept to the one used to decide where to draw in the buffer for each scroll direction.

Each tile image takes up 16 bytes of memory, so all we need to do to find the address of  a tile image is to load the tile number from the map, multiply it by 16 and add the address of tile number zero. Using a lookup table would be a little faster, but I'll keep it simple and stick with a MUL instruction for now.

A closer look at vertical scrolling

Vertical scrolling involves drawing a horizontal strip one pixel high, running the width of the screen. A tile is eight pixels high, so we need to pick one of the eight pixel rows to draw. This can be handled by adding an offset to the tile zero address. The offset is simply two times the number of pixels of displacement required, because each row of the tile is two bytes.

That means we need to keep track of the vertical pixel offset. I'm using a vertical counter variable that is decremented or incremented for each pixel scrolled up or down. By ANDing the counter with #7, we can detect when we cross a tile boundary and therefore need to move the map pointer. After ANDing with #7, the value can be left-shifted (i.e. multiplied by two) to create the pixel row offset.

That works neatly for scrolling up, but there is a modification required for scrolling down. Firstly, we need a different offset. If you imagine we've scrolled into a position where the tiles are neatly lined up with the top and bottom edges of the screen, the counter ANDed with #7 will be zero. If we scrolled up into this position then we needed to draw the top edge of the tiles as the zero offset suggests. On the other hand, if we had scrolled down, then the bottom edge of the tiles has just appeared. That's a different offset and we get it by ADDing #7 to the counter value before we AND with #7.

When I first implemented that, something weird happened when scrolling down. Seven out of every eight lines drawn were correct, but the ones that appeared for vertical offset zero were wrong. This didn't make any intuitive sense and it took me a while to figure out: A screen full of tiles would be 12 tiles high, but in general you would be able to see 13 rows of tiles, thanks to the partial tiles at the top and bottom. So the bottom row is usually 12 map rows below the map pointer, but only 11 rows when the vertical offset is zero. I ended up doing a clunky little correction just for offset zero. There might be a better way but I haven't thought of one so far.

As if the vertical scroll wasn't complicated enough already, there are another two cases to consider. The tiles are two bytes wide, meaning the drawing starts on either the left or right hand half of a tile. When it starts with the left half of a tile, there will be 16 tiles across the screen, all drawn in the same kind of way. I call this even drawing. But if we start with the right half of a tile, then there will be three sections: The right half of the first tile, then 15 complete tiles, then the left half of a 17th tile. No prizes for guessing I call this odd drawing.

I handled this by having two drawing routines, one even and one odd, and the one called is determined by the least significant bit of the map column pointer.

Before I get into the drawing, here's a summary of the steps so far:

Scroll up:
  • Move buffer pointer up one line, wrapping if necessary
  • Decrement vertical pixel counter
  • If pixel counter AND #7 = 7 then move map row pointer up one row (and wrap)
  • Calculate the tile image base address: tile zero address + 2 x (pixel counter AND #7)
  • Buffer drawing start address = buffer pointer
  • Map source data = Map row pointer
  • Draw even or odd depending on map column offset LSB

Scroll down:
  • Move buffer pointer down one line, wrapping if necessary
  • Increment vertical pixel counter
  • If pixel counter AND #7 = 0 then move map row pointer down one row (and wrap)
  • Calculate the tile image base address: tile zero address + 2 x ((pixel counter + 7) AND #7)
  • Buffer drawing start address = old buffer pointer (the line above the current pointer)
  • Map source data = Map row pointer + number of visible rows - 1 row
  • Draw even or odd depending on map column offset LSB

Draw some tiles already

The even drawing routine looks something like the code below. On entry, u is the tile image base address and y points to the map row.

        lda map_col      ; map column pointer
        lsra             ; get the offset
        sta coffset      ; store offset directly in instruction

        lda #16          ; draw 16 tile pieces
        sta count        ;

coffset equ *+2          ; variable is part of instruction
        lda <0,y         ; get tile number from map
        ldb #16          ; calculate address of image data
        mul              ;
        ldd d,u          ; load the tile image bytes
        std ,x++         ; store tile bytes in buffer
        cmpx #BUF_END    ; check for end of buffer
        blo nowrap       ; not end of buffer
        leax -BUF_SIZE,x ; adjust back to start of buffer
nowrap lda coffset      ; move to next column in map
        inca             ;
        anda #(MAPWID-1) ; wrap map column
        sta coffset      ;

        dec count        ; do next tile
        bne loop         ;

There isn't all that much to it. I've used some self-modifying code to make life easier. The coffset variable is actually the 8 bit index mode offset used to load the tile number from the map. The < symbol tells the assembler to use an 8-bit offset so the instruction bytes are in the expected layout.

It's assumed that the two bytes of the tile image will never fall on either side of the end of the buffer. This is a safe assumption providing that the buffer size is a multiple of the tile width, and that the first tile is aligned with the start of the buffer.

Odd drawing is a bit more complicated. First there is the right half of the first tile:

        lda map_col      ; map column pointer
        lsra             ; get the offset
        lda a,y          ; get tile number
        ldb #16          ; calculate address of image data
        mul              ;
        ldd d,u          ; load the tile image bytes
        stb ,x+          ; store right half in buffer

It might seem strange to load both bytes of the tile to then throw away the left byte, but this is an easy way of accessing the right byte. If I wanted to load just one byte, then I would first have to add one to the address which would take longer.

After this we need to check and wrap the buffer address and then we can draw 15 tile pieces using the even drawing algorithm above. The only differences are we need to draw 15 pieces instead of 16, and we need to add one to the map column offset before we start.

The left half of the tile at the end is really easy. The registers already have the values we need thanks to the code that drew the 15 tiles:

        lda a,y
        ldb #16
        lda d,u
        sta ,x


Hey, what about horizontal scrolling?

Yeah, I know, we're having so much fun and it's hard to stop, but my fingers are all wore out, what with all that typing and everything. So next time we'll have a closer look at the horizontal scrolling...

Friday, 24 March 2017


At the moment I have a means of scrolling but nothing to actually scroll, other than some colourful stripes. Now I do like playing with pretty patterns, probably a little bit more than would be considered 'normal', but that's not getting the job done.

I mentioned tiles before as a way of drawing a background but didn't go into any detail. What are tiles and why do we need them?

Insert groutuitous tile pun here

A tile is simply a small graphic image, that can be drawn on screen with other tiles to build up a larger image. There's a brief discussion on Wikipedia here.

The advantage of using tiles is that it can dramatically reduce the memory requirement of a large image such as a background. This works best for images that contain many repeated elements.

To build the image, we need a map that specifies which tiles go where. This could be a simple two-dimensional array of numbers, with each location in the array corresponding to a location in the image, and the content of each location in the array specifying a particular tile from the tileset.

How big should the tiles be? There is a trade-off to be made between image size and memory use. What I'm trying to achieve is a background that's something like the one from Time Pilot '84, big enough to look interesting, and without eating up too much memory.

The Time Pilot '84 background is pretty big, 2048 x 2048 pixels judging by images I've found online. At an arcade speed of 60fps, it takes over 30 seconds to scroll all the way across, one pixel at a time. If the tiles were 8 x 8 pixels then the tile map would need to be 64K in size. 16 x 16 pixel tiles would bring that down to 16K, still too big as I would like to fit at least a demo into a 32K machine.

How much background could I fit into, say, 4K?

An 8 x 8 tile requires 16 bytes in PMODE1. We need a good selection of tiles to make an interesting image, so let's say 64 tiles for now. That would require 64 x 8 = 1K, leaving 3K for the tilemap, which lets us have an array of 64 x 48 bytes.

So the final background image would be 8 x 64 = 512 pixels wide and 8 x 48 = 384 pixels high. At the intended speed of 25fps, it would take about 20 seconds to scroll all the way across. That sounds like a reasonable size so I will go with that. If later in the project I find I have more memory available I will have the option of making the map bigger.

For initial testing, I generated the tileset and tilemap by hand, just by typing in fcb directives containing very simple patterns. Clearly this is not going to be very practical for anything complicated, so I would need to get hold of some tile editing software.

I looked around, not really sure what I was looking for. Eventually I settled on Tile Studio by Mike Wiering. For me, the killer feature is the scriptable output, allowing auto-generation of source files for any language. The visuals are a bit on the small size, making it less suitable for designing small tiles on HD displays, but other than that, I find it gets the job done and doesn't get in the way of what I want to do.

Of course, it doesn't matter what tools you use as long as you like using them. The best tools are the ones you know how to use. (And the ones you borrowed from work. They're the best ones as well.)

My artistic skills are somewhat limited but I was able to create a nice-looking background fairly quickly:

Cool. But what is it?

I've used around 90 different tiles, meaning I've overshot the budget a little, using up about 4.5K in total. Still pretty impressive though, considering that the uncompressed image would take up 48K of memory.

As the map is intended to wrap around both horizontally and vertically, the features at the edges have to line up with each other, so I've saved myself a bit of work by placing water across the left & right edges.

One problem with tiled landscapes is trying to avoid repeating patterns. For example if I had used the same 'dirt' tile throughout the cratered landscape, it wouldn't look very convincing as the eye easily picks out repeating features. The trick here is to have five or six different dirt tiles and randomly fill the area with them. That breaks up the repetition enough to hide any patterns. Tile Studio has a random fill function for this very purpose and I would imagine other tile editors have it too.

I should probably confess that I didn't draw the craters from scratch. These are simply colour-reduced-messed-about-with versions of something I liked the look of, just to get a quick result. It could all change when I get to spend more time on it.

Tile be back

OK, we now have several boxes of squeaky new tiles piled up in the hallway and there's just the small matter of sticking those bad boys to the wall. You will need adhesive, a spreader, tile cutter and beer. Three of those are metaphors...

Saturday, 18 March 2017

Copy that: The conclusion

If it aint broke, fix it anyway

Sometimes I just can't leave a piece of code alone. I'm thinking there must always be another optimisation just waiting to be found. Quite often I wouldn't find anything new, but on this occasion I would find one of my favourite optimisations ever.

The copy routine that I had developed so far transferred any number of bytes by splitting the task into two parts: Copy big chunks for speed, then copy byte by byte to hit the target count. It was pretty fast, about 10% slower than a hypothetical unrolled version but I wondered if I could get a bit more out of it. If I could grab a few hundred more cycles then that would buy time for an extra sprite or better sound effects.

My attention turned to the byte by byte copy because it was so slow. If it could be sped up enough then the optimal chunk size could be increased and it would go faster still. The code looks like this:

  lda ,u+  ; 6 cycles
  sta ,s+  ; 6
  decb     ; 2
  bne loop ; 3

Copying 32 bytes with this code takes 32*17 = 544 cycles. That's a long time in a game. The 6809 has 16-bit loads and stores, so why not take advantage? The problem is we need an even number of bytes to copy two bytes at a time. When the problem is stated like that, the solution almost suggests itself. The next version of the code deals with the odd byte first, then continues by copying two bytes at a time:

  lsrb       ; 2 cycles
  bcc stage2 ; 3
  lda ,u+    ; 6
  sta ,s+    ; 6
  tstb       ; 2
  beq done   ; 3
  pulu x     ; 7
  stx ,s++   ; 8
  decb       ; 2
  bne loop   ; 3

It's not quite as easy to figure out the cycle times for this version because the path through the code depends on the number of bytes to be copied. Time for another spreadsheet:

The two-stage code is slower for zero or one byte copied, but after that it's faster, and much faster for higher byte counts. That's cool, I thought to myself, but there's always another optimisation...

It suddenly occurred to me that copying an odd number of words is pretty much the same scenario as copying an odd number of bytes. If that could be evened out, then the way is clear for copying four bytes at a time. I added a third stage to the code:

  lsrb       ; 2 cycles
  bcc stage2 ; 3
  lda ,u+    ; 6
  sta ,s+    ; 6
  lsrb       ; 2
  bcc stage3 ; 3
  pulu x     ; 7
  stx ,s++   ; 8
  tstb       ; 2
  beq done   ; 3
  pulu x,y   ; 9
  stx ,s++   ; 8
  sty ,s++   ; 9
  decb       ; 2
  bne loop   ; 3

What does that do to the spreadsheet?

Another nice speed increase. It was at this point I realised a couple of things: One was that adding more stages would copy larger blocks of data ever faster, and the other thing was that I had turned the byte by byte copy problem into one of copying n bytes again, meaning the copy n bytes routine could be implemented more quickly with this multi-stage approach without even thinking about optimal chunk sizes.

I ended up with an eight stage routine, that copied 128 bytes per loop in the final stage, and was going to need a serious bit of number crunching to count the cycles. This time I created a spreadsheet that ran to over 3000 rows. The first few rows are shown below:

What I was trying to find out was the worst case run time for any value of buffer pointer. Recall that the copy routine has to copy the buffer in two goes to work around the end of the buffer, so I've arranged the spreadsheet as matching pairs of rows with the right hand total column showing the cycle count for a complete buffer copy.

The overhead value is the number of cycles taken by the routine even when there is no work to be done. It's the time spent manipulating and testing the byte count plus adjusting the s register when reaching the stage where octet-sized operations can be used for copying. Because the copy routine is run twice, the overhead has to be added twice. (Again, I haven't included any setup time for the copy, to allow a fair comparison with previous cycle counts)

A quick application of the MAX function revealed the worst case time to be 12342 cycles. That is more than 700 cycles faster than the bytes & chunks approach. I was really happy with that. It was hard to believe that such a weird looking bit of code could be so efficient but the numbers proved it. And sure enough, it looked fast too when running.

Interestingly, the best case run time was 12308 cycles, a difference of only 34 cycles. This means that the copy routine has a very consistent run time, which is a good thing to have in a game.

But what is it?

In summary: For an algorithm where the number of loops is not fixed, there is an efficient method of reducing loop overhead by doing an amount of work equal to the weight of each bit in the count variable. The best number of stages to implement is dependent on the average or maximum count required. (No point having the overhead of eight stages if you only want to copy at most 15 bytes)

I've tried to find out what this technique is called but have not been able to find any references. In my mind I call it 'binary loop decomposition' which sounds dangerously like technobabble. Has anyone else encountered anything like this?

Friday, 17 March 2017

Copy that

Some scenes have been created for entertainment purposes

First, I need to set something up. It would appear to be the fashion on 'reality' television nowadays, that if something unexpected happens or goes wrong in an amusing way, a comedy vinyl record scratch sound will be heard. Not sure how to write that down, so I'm just going to go with "verrrrp".

Anyway, having figured out the basics of the scrolling engine, I thought I would get the easy part out of the way first. All I needed to do was quickly copy a large block of data from the buffer to the display. Easy peasy lemon squeezy...

The fastest method I know of for copying on the 6809 makes use of stack instructions:

  pulu cc,dp,d,x,y  ; read 8 bytes from u
  pshs cc,dp,d,x,y  ; write 8 bytes to s-8
  leas 16,s         ; adjust s for next write

Using the dp register means we have to be careful to not use direct page addressing during the copy routine and using the cc register means we have to switch interrupts off in the PIAs. (Assuming you don't want to get interrupted that is; the outcome won't be good while that leas instruction is there)

The leas instruction could be removed by making the buffer write routines more complex, but I decided to leave that as an optimisation for the future.

Right, we know what we're doing, let's wrap this thing up, go home early and snack out before dinner. I don't mean the mint Oreos, they taste like toothpaste. I'm talking about the Hobnobs at the back of the cupboard that my wife thinks I don't know about.*

  copy stuff
  dec count  ; 7 cycles
  bne loop   ; 3

We need to run those instructions 3072/8 = 384 times. Dec and his unwelcome friend Benny need 10 cycles per loop, that's 3840 cycles just spent looping, ouch, so we'll need to unroll the loop a bit, and, oh wait, we need to wrap the read address round the circular buffer!


No Hobnobs

OK, so it wasn't going to be as easy as I thought. I would like to copy large chunks of data to reduce the loop overhead, but the buffer boundary could be anywhere within one of those chunks, meaning I could overrun the end of the buffer and mess things up.

If the pointer is at the very start of the buffer, we can copy the entire buffer without hitting the end. If the pointer is at the end of the buffer, we can only copy one byte before hitting the end. After hitting the end of the buffer, we need to reset the read address to the start of the buffer and then copy the balance.

The key observation here is that we will only hit the end of the buffer once during the copy and we can calculate in advance when that will happen. The steps could be broken down as follows:

  • Set destination pointer to start of screen
  • Set source pointer equal to buffer pointer
  • Copy n bytes where n = buffer size - buffer pointer
  • Set source pointer equal to buffer start
  • Destination pointer continues from where previous copy ended
  • Copy n bytes where n = buffer pointer - buffer start

So our new problem to solve is this: Find a way of quickly copying any number of bytes.

We would like to combine the speed of copying big chunks of data using stack instructions with the precision of byte by byte copying. So one approach might be to copy big chunks until there is less than one chunk left, then complete the operation byte by byte.

How big should the chunks be? Bigger chunks make the fast part of the copy faster at the cost of making the byte by byte part slower, due to there being more bytes left over after the fast copy. Is there an optimal chunk size that minimises the total time? Only one way to find out...


This looks way too tidy to be one of my spreadsheets...

Just to clarify some terminology: I'm calling a group of eight bytes an octet, the inspiration coming from the word septet describing seven bytes in this great write-up, with a chunk formed from a number of octets. (i.e. a multiple of eight bytes)

The idea behind this spreadsheet is the two copy operations will in effect copy the whole buffer chunk by chunk, except for the chunk containing the end of the buffer. This will have to be copied byte by byte. The spreadsheet calculates the number of cycles to copy the buffer for various sizes of chunk. Here it looks like a chunk size of 64 bytes is optimal.

The great thing about spreadsheets is that you can quickly see the effect of a change. For example, suppose I change my chunk loop code from dec/bne to cmps #/bne, reducing the overhead from 10 to 8 cycles:

Now a chunk size of 32 bytes looks more optimal. It can be surprising the effect a small change makes.

(I should point out that I'm not including any time for setting up the copy, just the time spent copying. This shouldn't affect the results significantly as it would be small, and be a similar amount for each case)

If I set the chunk loop cycles to zero, then the optimal chunk size would be 8, with a total cycle count of 12009. What use is that? We can't have a zero loop overhead. True, but it demonstrates the speed of an unrolled version of the code that has 383 consecutive octet copy operations, a calculated jmp into the middle of them to do just the right amount of copying, followed by 0-7 bytes copied individually. The instructions would take up more than 2K of memory, but it's 1000 cycles faster. That's a trade off that might be worth doing on a 64K machine.

But what if I said there's another way of copying that's a lot closer in speed to the unrolled, zero overhead code but instead of taking up over 2K, it takes around 300 bytes? That sounds too good to be true. I will have the pleasure of attempting to explain how it works in my next post...

*It turns out my wife knew all along I was on to the Hobnobs. Like all good zoo keepers, she hid them to enrich my environment.

Sunday, 12 March 2017

Scrolling 101

The scrolling idea I had developed so far could be broken down as follows:

  • Four screen-sized buffers, each containing the same image, but pixel-shifted different amounts.
  • A pointer that indicates the position of the top-left corner of the display within the buffers.
  • A tile-based map that provides the graphics.
  • Routines that read the map and draw new data into the buffers in the right places to appear at the screen edges as they scroll into view.
  • A fast copy routine that moves the data from one of the buffers to the screen.

I was fairly sure this would work, but I would need to start coding to test the idea. I decided to develop it one small step at a time and start with a single buffer, scrolling whole bytes instead of pixels. This would test that I was drawing into the right places in the buffer.

One problem with talking about scrolling is that the direction of scrolling is not clear. If I say 'scroll left', which way should the screen move? There's a sort of consensus on 'scrolling down' meaning the screen contents moving up, so that is the system I shall use. i.e. whatever direction we scroll in, the screen moves in the opposite direction.

For this sort of thing I find pictures are helpful in getting the numbers right. A thousand words might be a bit of a stretch, so here's a picture that paints 64 bytes: (oh dear...)

This represents a buffer. I have arranged it as an 8 x 8 square to correspond with a hypothetical 8 x 8 display. The real thing would be much larger, but this will serve to illustrate.

The numbers represent the memory addresses of the 64 locations making up the buffer and the highlighted square represents a pointer to one of those addresses. When we copy the buffer to the display, we use the pointer as the start position for the copy. In this case the display would end up looking exactly like the buffer:

Things get more interesting when the pointer starts moving around as this will make the screen contents appear to move. What happens if we add one to the pointer before copying?

The numbers illustrate where in the buffer the data came from. Compared to the previous display, everything has moved one byte to the left i.e. we've scrolled right. I've highlighted the right hand column to indicate where we expect new graphics to appear. These are the locations in the buffer where we should have written new information before copying the buffer to the screen, the start address given by pointer + width - 1.

Note the behaviour is that of a circular buffer. When the copy operation reaches the end of the buffer it continues from the start of the buffer. It will be the same with writing into the buffer. The addresses must be 'wrapped' to always stay in the buffer.

Let's now add the buffer width of eight to the pointer and copy to the display:

This time the display has moved up compared to the previous display and the new data needed to be written starting at pointer - width. Or, as I have just noticed, the old value of the pointer. (Which saves a calculation and a boundary check)

To scroll left we need to subtract one from the pointer and the new data needs to be written starting at the pointer:

That leaves scrolling up by subtracting eight  from the pointer. Again the new data will start at the pointer:

That covers the four main directions. Moving diagonally can be lazily achieved by performing separate horizontal and vertical scroll operations before copying. It's lazy because the data in a corner would get written twice where the horizontal and vertical sections overlap, but it should be possible to selectively draw in the corners depending on the situation.

These concepts can be tested fairly easily with some simple code. Four routines are required to handle each of the four scroll directions. These could be in response to key presses, and each routine could draw it's own colour or pattern to verify that data is being written to the right places.

The vertical scroll routines need to draw a horizontal stripe, with data at consecutive addresses. If the stripe reaches the end of the buffer before the full width has been drawn then it needs to continue from the start of the buffer.

The horizontal scroll routines need to draw a vertical stripe, with the address advancing by an amount equal to the buffer width. If the address falls outside of the buffer, then the buffer size needs to be subtracted to put the address back inside the buffer.

The copy routine can be a simple loop that transfers one byte per loop, checking for the end of the buffer and wrapping if necessary. It will be slow but it will work, or at least easy to debug.

It's a fairly gentle start, but I find all of this useful in understanding the problem properly, which will hopefully help with debugging later!

Monday, 6 March 2017

Game On

Solving puzzles is rewarding. That's why the weekend newspapers have a big pullout section chock full of them. They want those endorphins to flow freely and influence your future newspaper buying habits.

Previously I wrote about finding a faster way of scrolling by using pre-shifted data. I was trying to solve a puzzle, one that might not have a solution, but I had made a little bit of progress, got a little bit of reward, and kept thinking about it.

One thing making it slow was having to OR the tile edges together. A way to mitigate this is to have larger tiles. Less ORing and more SToring. (If I can find a few more rhymes like that then this blog is pretty much going to write itself!)

How about tiles that are four bytes wide? What does that do to the single row code?

  puls a,x,u   ; 10 cycles
  ora ,y       ; 4
  sta ,y       ; 4
  stx 2,y      ; 6
  stu 4,y      ; 6

We're now pointing to the tile data with s so that we can use u as a temporary register. The first tile row now takes 30 cycles, or 32 cycles for subsequent rows. For a PMODE3 or 4 screen, this needs to be repeated 6144 / 4 = 1536 times, taking 32 x 1536 = 49152 cycles per screen. That's about 18 fps and 50% faster than two byte wide tiles. A decent gain just by making the tiles larger. The thing is, I'm looking for a big speed increase. Time for a compromise.

Does the graphics mode really need to be PMODE3 or 4? It's almost ingrained that you use the highest graphics resolution available. Good money was paid for all those pixels so surely the higher the resolution, the better things will look?

A PMODE1 screen has a resolution of 128 x 96. Half the vertical resolution of PMODE3, but the pixels are actually square. At 3072 bytes the display now requires half the memory, and can be updated in half the time. Good Things have been done using that graphics mode, such as GloveFlagon Bird and sixxie's vertical scroll demo.

So I've quickly talked myself into using PMODE1 and traded detail for speed. That doubles the rate to 36 fps. Things are starting to get interesting. I wonder how much memory all these pre-shifted tiles are going to take up?

A PMODE1 four byte wide square tile requires 64 bytes of storage. Shifted versions will be five bytes wide and require 80 bytes. There are four pixels in a PMODE1 byte, so to be able to scroll to any pixel we would need three shifted versions of each tile in addition to the unshifted version. That's 304 bytes per tile. It sounds excessive but you could fit 32 tiles in under 10K and have a chance at producing an interesting background.

I nearly decided that was the way to go, and I think that approach could find good uses, but I still wasn't convinced it would be fast enough for what I was trying to do. To run at 25 fps, there would only be around 10,000 cycles left over for the rest of the game and I haven't accounted for fetching the tile addresses or for drawing partial tiles at the screen edges.

So I started thinking along the lines of having ever larger tiles to reduce the tile overhead but somehow slowly building them on demand out of smaller tiles. It was that line of thinking that led to the solution I settled on: Just have four buffers, each containing the screen image with a different amount of horizontal shift. Drawing the background is reduced to a large copy operation from one of the buffers. Scrolling would be achieved by moving the buffer start position and by drawing a stripe of new image data into the buffer in the right place to appear at the edge of the screen.

This sounded pretty simple. How fast could it run? A large block copy can be done with a set of instructions like this:

  pulu cc,dp,d,x,y   ; 13 cycles
  pshs cc,dp,d,x,y   ; 13
  leas 16,s          ; 5

The leas 16,s instruction is there to compensate for pul & psh working in opposite directions. It can actually be removed if adjustments are made to the code writing into the buffer, which gives a nice improvement in speed for a pure horizontal scroll, but there are complications with combined horizontal and vertical scrolling that offset some of the gain. Maybe a topic for the future.

Using the cc register seems a bit dangerous because it contains the interrupt masks but it's fine as long as interrupts are disabled at the hardware level by programming the PIAs.

Anyway, that's 8 bytes copied every 31 cycles or less than 12,000 cycles for a PMODE1 screen. The data that needs to be written into the buffers amounts to a byte-wide column (or equivalent row) of tile fragments per game loop. This is sounding much faster.

Memory usage is heavy, as we need 12K of shift buffers, but the simplicity of the method is very appealing. Admittedly this is better suited to 64K machines but I still feel I can do something for 32K machines as the buffer requirements can be reduced by not using the whole screen. Just reserving 8 lines of display for score and status reduces the buffers by 1K.

Game On!