Saturday 18 March 2017

Copy that: The conclusion

If it aint broke, fix it anyway


Sometimes I just can't leave a piece of code alone. I'm thinking there must always be another optimisation just waiting to be found. Quite often I wouldn't find anything new, but on this occasion I would find one of my favourite optimisations ever.

The copy routine that I had developed so far transferred any number of bytes by splitting the task into two parts: Copy big chunks for speed, then copy byte by byte to hit the target count. It was pretty fast, about 10% slower than a hypothetical unrolled version but I wondered if I could get a bit more out of it. If I could grab a few hundred more cycles then that would buy time for an extra sprite or better sound effects.

My attention turned to the byte by byte copy because it was so slow. If it could be sped up enough then the optimal chunk size could be increased and it would go faster still. The code looks like this:

loop
  lda ,u+  ; 6 cycles
  sta ,s+  ; 6
  decb     ; 2
  bne loop ; 3

Copying 32 bytes with this code takes 32*17 = 544 cycles. That's a long time in a game. The 6809 has 16-bit loads and stores, so why not take advantage? The problem is we need an even number of bytes to copy two bytes at a time. When the problem is stated like that, the solution almost suggests itself. The next version of the code deals with the odd byte first, then continues by copying two bytes at a time:

  lsrb       ; 2 cycles
  bcc stage2 ; 3
  lda ,u+    ; 6
  sta ,s+    ; 6
  tstb       ; 2
stage2
  beq done   ; 3
loop
  pulu x     ; 7
  stx ,s++   ; 8
  decb       ; 2
  bne loop   ; 3

It's not quite as easy to figure out the cycle times for this version because the path through the code depends on the number of bytes to be copied. Time for another spreadsheet:




The two-stage code is slower for zero or one byte copied, but after that it's faster, and much faster for higher byte counts. That's cool, I thought to myself, but there's always another optimisation...

It suddenly occurred to me that copying an odd number of words is pretty much the same scenario as copying an odd number of bytes. If that could be evened out, then the way is clear for copying four bytes at a time. I added a third stage to the code:

  lsrb       ; 2 cycles
  bcc stage2 ; 3
  lda ,u+    ; 6
  sta ,s+    ; 6
stage2
  lsrb       ; 2
  bcc stage3 ; 3
  pulu x     ; 7
  stx ,s++   ; 8
  tstb       ; 2
stage3
  beq done   ; 3
loop
  pulu x,y   ; 9
  stx ,s++   ; 8
  sty ,s++   ; 9
  decb       ; 2
  bne loop   ; 3


What does that do to the spreadsheet?




Another nice speed increase. It was at this point I realised a couple of things: One was that adding more stages would copy larger blocks of data ever faster, and the other thing was that I had turned the byte by byte copy problem into one of copying n bytes again, meaning the copy n bytes routine could be implemented more quickly with this multi-stage approach without even thinking about optimal chunk sizes.

I ended up with an eight stage routine, that copied 128 bytes per loop in the final stage, and was going to need a serious bit of number crunching to count the cycles. This time I created a spreadsheet that ran to over 3000 rows. The first few rows are shown below:




What I was trying to find out was the worst case run time for any value of buffer pointer. Recall that the copy routine has to copy the buffer in two goes to work around the end of the buffer, so I've arranged the spreadsheet as matching pairs of rows with the right hand total column showing the cycle count for a complete buffer copy.

The overhead value is the number of cycles taken by the routine even when there is no work to be done. It's the time spent manipulating and testing the byte count plus adjusting the s register when reaching the stage where octet-sized operations can be used for copying. Because the copy routine is run twice, the overhead has to be added twice. (Again, I haven't included any setup time for the copy, to allow a fair comparison with previous cycle counts)

A quick application of the MAX function revealed the worst case time to be 12342 cycles. That is more than 700 cycles faster than the bytes & chunks approach. I was really happy with that. It was hard to believe that such a weird looking bit of code could be so efficient but the numbers proved it. And sure enough, it looked fast too when running.

Interestingly, the best case run time was 12308 cycles, a difference of only 34 cycles. This means that the copy routine has a very consistent run time, which is a good thing to have in a game.

But what is it?


In summary: For an algorithm where the number of loops is not fixed, there is an efficient method of reducing loop overhead by doing an amount of work equal to the weight of each bit in the count variable. The best number of stages to implement is dependent on the average or maximum count required. (No point having the overhead of eight stages if you only want to copy at most 15 bytes)

I've tried to find out what this technique is called but have not been able to find any references. In my mind I call it 'binary loop decomposition' which sounds dangerously like technobabble. Has anyone else encountered anything like this?

No comments:

Post a Comment