Return of the Beast: Copy that

Some scenes have been created for entertainment purposes

First, I need to set something up. It would appear to be the fashion on 'reality' television nowadays, that if something unexpected happens or goes wrong in an amusing way, a comedy vinyl record scratch sound will be heard. Not sure how to write that down, so I'm just going to go with "verrrrp".

Anyway, having figured out the basics of the scrolling engine, I thought I would get the easy part out of the way first. All I needed to do was quickly copy a large block of data from the buffer to the display. Easy peasy lemon squeezy...

The fastest method I know of for copying on the 6809 makes use of stack instructions:

pulu cc,dp,d,x,y ; read 8 bytes from u
pshs cc,dp,d,x,y ; write 8 bytes to s-8
leas 16,s ; adjust s for next write

Using the dp register means we have to be careful to not use direct page addressing during the copy routine and using the cc register means we have to switch interrupts off in the PIAs. (Assuming you don't want to get interrupted that is; the outcome won't be good while that leas instruction is there)

The leas instruction could be removed by making the buffer write routines more complex, but I decided to leave that as an optimisation for the future.

Right, we know what we're doing, let's wrap this thing up, go home early and snack out before dinner. I don't mean the mint Oreos, they taste like toothpaste. I'm talking about the Hobnobs at the back of the cupboard that my wife thinks I don't know about.*

loop
copy stuff
dec count ; 7 cycles
bne loop ; 3

We need to run those instructions 3072/8 = 384 times. Dec and his unwelcome friend Benny need 10 cycles per loop, that's 3840 cycles just spent looping, ouch, so we'll need to unroll the loop a bit, and, oh wait, we need to wrap the read address round the circular buffer!

verrrrp

No Hobnobs

OK, so it wasn't going to be as easy as I thought. I would like to copy large chunks of data to reduce the loop overhead, but the buffer boundary could be anywhere within one of those chunks, meaning I could overrun the end of the buffer and mess things up.

If the pointer is at the very start of the buffer, we can copy the entire buffer without hitting the end. If the pointer is at the end of the buffer, we can only copy one byte before hitting the end. After hitting the end of the buffer, we need to reset the read address to the start of the buffer and then copy the balance.

The key observation here is that we will only hit the end of the buffer once during the copy and we can calculate in advance when that will happen. The steps could be broken down as follows:

Set destination pointer to start of screen
Set source pointer equal to buffer pointer
Copy n bytes where n = buffer size - buffer pointer
Set source pointer equal to buffer start
Destination pointer continues from where previous copy ended
Copy n bytes where n = buffer pointer - buffer start

So our new problem to solve is this: Find a way of quickly copying any number of bytes.

We would like to combine the speed of copying big chunks of data using stack instructions with the precision of byte by byte copying. So one approach might be to copy big chunks until there is less than one chunk left, then complete the operation byte by byte.

How big should the chunks be? Bigger chunks make the fast part of the copy faster at the cost of making the byte by byte part slower, due to there being more bytes left over after the fast copy. Is there an optimal chunk size that minimises the total time? Only one way to find out...

Spreadsheet!

This looks way too tidy to be one of my spreadsheets...

Just to clarify some terminology: I'm calling a group of eight bytes an octet, the inspiration coming from the word septet describing seven bytes in this great write-up, with a chunk formed from a number of octets. (i.e. a multiple of eight bytes)

The idea behind this spreadsheet is the two copy operations will in effect copy the whole buffer chunk by chunk, except for the chunk containing the end of the buffer. This will have to be copied byte by byte. The spreadsheet calculates the number of cycles to copy the buffer for various sizes of chunk. Here it looks like a chunk size of 64 bytes is optimal.

The great thing about spreadsheets is that you can quickly see the effect of a change. For example, suppose I change my chunk loop code from dec/bne to cmps #/bne, reducing the overhead from 10 to 8 cycles:

Now a chunk size of 32 bytes looks more optimal. It can be surprising the effect a small change makes.

(I should point out that I'm not including any time for setting up the copy, just the time spent copying. This shouldn't affect the results significantly as it would be small, and be a similar amount for each case)

If I set the chunk loop cycles to zero, then the optimal chunk size would be 8, with a total cycle count of 12009. What use is that? We can't have a zero loop overhead. True, but it demonstrates the speed of an unrolled version of the code that has 383 consecutive octet copy operations, a calculated jmp into the middle of them to do just the right amount of copying, followed by 0-7 bytes copied individually. The instructions would take up more than 2K of memory, but it's 1000 cycles faster. That's a trade off that might be worth doing on a 64K machine.

But what if I said there's another way of copying that's a lot closer in speed to the unrolled, zero overhead code but instead of taking up over 2K, it takes around 300 bytes? That sounds too good to be true. I will have the pleasure of attempting to explain how it works in my next post...

*It turns out my wife knew all along I was on to the Hobnobs. Like all good zoo keepers, she hid them to enrich my environment.

Return of the Beast

Friday 17 March 2017

Copy that

Some scenes have been created for entertainment purposes

No Hobnobs

Spreadsheet!

2 comments: