Difference between revisions of "Z80:Advanced Math"
Xeda112358 (talk | contribs) |
Xeda112358 (talk | contribs) m (→Exponentiation) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 97: | Line 97: | ||
ret | ret | ||
− | And you might realise that multiplying an 8-bit number by a 16-bit number can yield up to a 24-bit number, so to take care of that overflow by turning rlca into adc a,a which is the same size and speed. Note that adc a,a can be replaced with rla for the same effect. | + | And you might realise that multiplying an 8-bit number by a 16-bit number can yield up to a 24-bit number, so to take care of that overflow by turning rlca into adc a,a which is the same size and speed. Note that adc a,a can be replaced with rla for the same effect. Some overflow may occur from ''add hl,de'', so we take care of that, too |
.module DE_Times_A | .module DE_Times_A | ||
Line 103: | Line 103: | ||
ld hl,0 ; Use HL to store the product | ld hl,0 ; Use HL to store the product | ||
ld b,8 ; Eight bits to check | ld b,8 ; Eight bits to check | ||
+ | ld c,h | ||
_loop: | _loop: | ||
add hl,hl | add hl,hl | ||
− | adc a,a | + | adc a,a ; Check most-significant bit of accumulator |
jr nc,_skip ; If zero, skip addition | jr nc,_skip ; If zero, skip addition | ||
add hl,de | add hl,de | ||
+ | adc a,c ;add in overflow as needed. | ||
_skip: | _skip: | ||
djnz _loop | djnz _loop | ||
ret | ret | ||
− | |||
== Division == | == Division == | ||
Line 141: | Line 142: | ||
== Addition and Subtraction == | == Addition and Subtraction == | ||
As with integer arithmetic, addition and subtraction are trivial on the Z80. To add H.L+D.E, just use add hl,de, and for subtraction, 'or a \ sbc hl,de' will sufice. | As with integer arithmetic, addition and subtraction are trivial on the Z80. To add H.L+D.E, just use add hl,de, and for subtraction, 'or a \ sbc hl,de' will sufice. | ||
− | == Multiplication == | + | == Multiplication == |
+ | H.L*D.E is essentially HL*DE/65536 | ||
If we view HL as a 16 bit integer, then H.L is essentially HL/256. For multiplication, the math for H.L*D.E is then: | If we view HL as a 16 bit integer, then H.L is essentially HL/256. For multiplication, the math for H.L*D.E is then: | ||
(HL/256)(DE/256) = (HL*DE)/65536. | (HL/256)(DE/256) = (HL*DE)/65536. | ||
− | This means we can use a 16-bit multiplication routine that returns 32 bits and we now have a 16.16 fixed point result. To keep output as an 8.8 fixed point number, we need to grab the middle bytes around the decimal. If output was DE.HL, then E.H is a our 8.8 result. Overflow and underflow should be handled by the programmer depending on the task. | + | This means we can use a 16-bit '''signed''' multiplication routine that returns 32 bits and we now have a 16.16 fixed point result. To keep output as an 8.8 fixed point number, we need to grab the middle bytes around the decimal. If output was DE.HL, then E.H is a our 8.8 result. Overflow and underflow should be handled by the programmer depending on the task. |
'''note:''' Really, only a routine to return the lower 24-bits of the 16x16 multiplication is needed. | '''note:''' Really, only a routine to return the lower 24-bits of the 16x16 multiplication is needed. | ||
− | Allowing speed to take priority, here is a | + | Allowing speed to take priority, here is a 16-bit '''unsigned''' integer multiplication: |
BC_Times_DE: | BC_Times_DE: | ||
− | ; BHLA | + | ; BC*DE->BHLA |
ld a,b | ld a,b | ||
− | |||
ld hl,0 | ld hl,0 | ||
ld b,h | ld b,h | ||
− | + | add a,a \ jr nc,$+5 \ ld h,d \ ld l,e | |
− | add a,a | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b |
− | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b | |
− | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b | |
− | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b | |
− | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b | |
− | add hl,hl | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b |
− | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b | |
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
push hl | push hl | ||
ld h,b | ld h,b | ||
Line 211: | Line 170: | ||
ld a,c | ld a,c | ||
ld c,h | ld c,h | ||
− | + | add a,a \ jr nc,$+5 \ ld h,d \ ld l,e | |
− | add a,a | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c |
− | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c | |
− | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c | |
− | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c | |
− | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c | |
− | add hl,hl | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c |
− | + | add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c | |
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | add hl,hl | ||
− | |||
− | |||
− | |||
− | |||
− | |||
pop de | pop de | ||
− | |||
ld c,a | ld c,a | ||
ld a,l | ld a,l | ||
Line 268: | Line 186: | ||
ret nc | ret nc | ||
inc b | inc b | ||
− | |||
ret | ret | ||
− | + | == Division == | |
For division, we should notice that H.L/D.E evaluates as: | For division, we should notice that H.L/D.E evaluates as: | ||
(HL/256)/(DE/256)=HL/DE | (HL/256)/(DE/256)=HL/DE | ||
− | We could use a standard integer division routine, however, unlike with multiplication, the result can only be used a 16.0 integer. We need 8 more bits of accuracy for 16.8 output (of which we only use the lower 8-bits of the integer part). Since we are going to discard the upper 8 bits | + | We could use a standard signed integer division routine, however, unlike with multiplication, the result can only be used a 16.0 integer. We need 8 more bits of accuracy for 16.8 output (of which we only use the lower 8-bits of the integer part). Since we are going to discard the upper 8 bits of the acquired result, we can optimise the first 8 iterations to not store the bits anywhere. |
DE_Div_BC_88: | DE_Div_BC_88: | ||
Line 281: | Line 198: | ||
;Outputs: | ;Outputs: | ||
; DE is the 8.8 Fixed Point result (rounded to the least significant bit) | ; DE is the 8.8 Fixed Point result (rounded to the least significant bit) | ||
− | + | ld a,d | |
− | + | xor b | |
+ | push af | ||
+ | call m,negDE | ||
+ | call div8sub | ||
+ | pop af | ||
+ | ret p | ||
+ | negDE: | ||
+ | xor a | ||
+ | sub e | ||
+ | ld e,a | ||
+ | sbc a,a | ||
+ | sub d | ||
+ | ld d,a | ||
+ | ret | ||
+ | div8sub: | ||
+ | ld a,8 | ||
+ | ld hl,0 | ||
Loop1: | Loop1: | ||
− | + | rl d | |
− | + | adc hl,hl | |
− | + | sbc hl,bc | |
− | + | jr nc,$+3 | |
add hl,bc | add hl,bc | ||
− | + | dec a | |
− | + | jr nz,Loop1 | |
− | + | ld d,e | |
− | + | ld e,a | |
− | + | ld a,16 | |
− | + | jp $+6 | |
− | |||
− | |||
DivLoop: | DivLoop: | ||
− | + | add hl,bc | |
− | + | dec a | |
− | + | ret z | |
− | + | sla e | |
− | + | rl d | |
− | + | adc hl,hl | |
− | |||
jr c,overflow | jr c,overflow | ||
− | + | sbc hl,bc | |
− | + | jr c,DivLoop | |
− | + | inc e | |
− | + | jp DivLoop+1 | |
overflow: | overflow: | ||
or a | or a | ||
Line 501: | Line 431: | ||
== Exponentiation and Logarithms == | == Exponentiation and Logarithms == | ||
=== Exponentiation === | === Exponentiation === | ||
− | Here, exponentiation refers to 2^ | + | Here, exponentiation refers to <math>2^{x}</math>. |
− | Observe that 2^ | + | Observe that <math>2^{39/16}=2^{2+7/16} = 2^{2+0/2+1/4+1/8+1/16}=2^{2}2^{0/2}2^{1/4}2^{1/8}2^{1/16}</math> |
− | + | The observant reader will see that taking away the trivial integer part of the power, the fractional part can be represented as a sequence of square roots with 2s wherever there are 1s in the binary expansion. So to compute <math>2^{x}</math> where x is on [0,1), in pseudo code we have: | |
− | The observant reader will see that taking away the trivial integer part of the power, the fractional part can be represented as a sequence of square roots with 2s wherever there are 1s in the binary expansion. So to compute 2^ | ||
− | If | + | If x==0 |
− | + | return 1.0 | |
− | + | repeat until carry is set | |
− | + | x>>=1 ;shift X right | |
− | + | y=sqrt(2) | |
− | + | while x>0 | |
− | + | x>>=1 | |
− | + | if (carry is set) | |
− | + | y*=2 | |
− | + | y=sqrt(y) | |
− | |||
− | |||
− | |||
Converting that to assembly: | Converting that to assembly: |
Latest revision as of 00:36, 24 April 2016
Integer Math
Addition/Subtraction
To do addition or subtraction on an integer of arbitrary size, we use ADC.
Here we add two 32-bit integers:
LD HL, (word1) ; Get least-significant word of word1 LD DE, (word2) ; Get least-significant word of word2 ADD HL, DE ; Add them LD (result), HL ; Store least-significant result LD HL, (dword1 + 2) ; Get most-significant word of word1 LD DE, (dword2 + 2) ; Get most-significant word of word2 ADC HL, DE ; Add them with the carry from the previous addition LD (result + 2), HL ; Store most-significant result RET word1: .DB $B3, $90, $12, $32 ; Each word is stored with the least-significant word2: .DB $F1, $15, $06, $B8 ; bytes first. You could just as easily have stored result: .DB $00, $00, $00, $00 ; them in big-endian, but because of how registers are ; loaded from RAM, it wouldn't work.
But now, what if you wanted to add truly arbitrary sized integers? We might do something like this:
ld hl,Num1 ld de,Num2 ld bc,Size AddLoop: ;Inputs: ; HL points to Num1 (stored little-endian) ; DE points to Num2 (stored little-endian) ; BC is the size of the integers ;Outputs: ; BC is 0 ; HL points to the byte after Num1 ; DE points tot he byte after Num2 ; c flag is the carry ; z flag is set ; p/v flag reset ; Num1 is overwritten with the result ld a,(de) adc a,(hl) ld (hl),a inc de cpi ;increment HL, decrement BC, set parity flag of BC becomes 0 jp pe,AddLoop ret Num1: .db $B3,$90,$12,$32,$47 Num2: .db $F1,$15,$06,$B8,0
Similarly, one can perform arbitrary subtractions.
Multiplication
For multiplication, we could just add them in a DJNZ loop. But that is slow for multiplying big numbers. There is a much more efficient algorithm that you probably learned as a child in school that involves multiplying multi-digit numbers by hand. Similarly, we will do that here by using the binary representation of numbers and the astute reader will realise that binary digits are only 1 or 0 and multiplying by 1 or 0 is rather trivial!
To understand this algorithm, it is suggested that the reader try this by hand with some simple 4-bit numbers, then follow the algorithm or try to develop it in pseudocode.
Here is the first example:
.module DE_Times_A ;Note that we choose DE and A because of the readily available instructions that make this even more efficient. DE_Times_A: ; HL = DE × A LD HL, 0 ; Use HL to store the product LD B, 8 ; Eight bits to check _loop: RRCA ; Check least-significant bit of accumulator JR NC, _skip ; If zero, skip addition ADD HL, DE _skip: SLA E ; Shift DE one bit left RL D ;In binary, a shift left is multiplying by 2, just like a shift left in decimal is like multiplying by 10. ;We essentially added a 0 to the end of the number DJNZ _loop RET
Or alternatively for a slightly faster and smaller code:
.module DE_Times_A DE_Times_A: ; HL = DE × A ld hl,0 ; Use HL to store the product ld b,8 ; Eight bits to check _loop: add hl,hl rlca ; Check most-significant bit of accumulator jr nc,_skip ; If zero, skip addition add hl,de _skip: djnz _loop ret
And you might realise that multiplying an 8-bit number by a 16-bit number can yield up to a 24-bit number, so to take care of that overflow by turning rlca into adc a,a which is the same size and speed. Note that adc a,a can be replaced with rla for the same effect. Some overflow may occur from add hl,de, so we take care of that, too
.module DE_Times_A DE_Times_A: ; AHL = DE × A ld hl,0 ; Use HL to store the product ld b,8 ; Eight bits to check ld c,h _loop: add hl,hl adc a,a ; Check most-significant bit of accumulator jr nc,_skip ; If zero, skip addition add hl,de adc a,c ;add in overflow as needed. _skip: djnz _loop ret
Division
Division is also just like base-10 division as well, but instead of checking if the next digit is from 0 to 9, we only need to check if it is 0 or 1 which amounts to the question of "is our remainder smaller than the divisor (1) or not (0) ?"
Again, it is recommended that the reader try this with small examples to understand and possibly develop their own code.
.module Div_HL_D Div_HL_D: ; HL = HL ÷ D, A = remainder XOR A ; Clear upper eight bits of AHL LD B, 16 ; Sixteen bits in dividend _loop: ADD HL, HL ; Do a SLA HL. If the upper bit was 1, the c flag is set RLA ; This moves the upper bits of the dividend into A JR C, _overflow CP D ; Check if we can subtract the divisor JR C, _skip ; Carry means D > A _overflow: SUB D ; Do subtraction for real this time INC L ; Set the next bit of the quotient (currently bit 0) _skip: DJNZ _loop RET
Fixed Point Math
Fixed Point refers to a number format that includes a fixed size integer part, and a fixed size fractional part. For example, 8.8 fixed point refers to a format where the upper 8 bits form the integer part, the lower 8 bits form the fractional part. Fixed point math often provides a desired balance between the numerical precision of floating point and the computing speed of integer maths.
To elaborate a little more, to represent 3.0 as a fixed point number stored in HL, we might have HL=0300, where the decimal point is between H and L (03.00). Then HL=0341 would be equivalent to 3+65/256=3.25390625. If we were to square this, we get HL=0A96 where we have lost some bits of precision.
Addition and Subtraction
As with integer arithmetic, addition and subtraction are trivial on the Z80. To add H.L+D.E, just use add hl,de, and for subtraction, 'or a \ sbc hl,de' will sufice.
Multiplication
H.L*D.E is essentially HL*DE/65536 If we view HL as a 16 bit integer, then H.L is essentially HL/256. For multiplication, the math for H.L*D.E is then: (HL/256)(DE/256) = (HL*DE)/65536. This means we can use a 16-bit signed multiplication routine that returns 32 bits and we now have a 16.16 fixed point result. To keep output as an 8.8 fixed point number, we need to grab the middle bytes around the decimal. If output was DE.HL, then E.H is a our 8.8 result. Overflow and underflow should be handled by the programmer depending on the task. note: Really, only a routine to return the lower 24-bits of the 16x16 multiplication is needed.
Allowing speed to take priority, here is a 16-bit unsigned integer multiplication:
BC_Times_DE: ; BC*DE->BHLA ld a,b ld hl,0 ld b,h add a,a \ jr nc,$+5 \ ld h,d \ ld l,e add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b push hl ld h,b ld l,b ld b,a ld a,c ld c,h add a,a \ jr nc,$+5 \ ld h,d \ ld l,e add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c pop de ld c,a ld a,l ld l,h ld h,c add hl,de ret nc inc b ret
Division
For division, we should notice that H.L/D.E evaluates as: (HL/256)/(DE/256)=HL/DE We could use a standard signed integer division routine, however, unlike with multiplication, the result can only be used a 16.0 integer. We need 8 more bits of accuracy for 16.8 output (of which we only use the lower 8-bits of the integer part). Since we are going to discard the upper 8 bits of the acquired result, we can optimise the first 8 iterations to not store the bits anywhere.
DE_Div_BC_88: ;Inputs: ; DE,BC are 8.8 Fixed Point numbers ;Outputs: ; DE is the 8.8 Fixed Point result (rounded to the least significant bit) ld a,d xor b push af call m,negDE call div8sub pop af ret p negDE: xor a sub e ld e,a sbc a,a sub d ld d,a ret div8sub: ld a,8 ld hl,0 Loop1: rl d adc hl,hl sbc hl,bc jr nc,$+3 add hl,bc dec a jr nz,Loop1 ld d,e ld e,a ld a,16 jp $+6 DivLoop: add hl,bc dec a ret z sla e rl d adc hl,hl jr c,overflow sbc hl,bc jr c,DivLoop inc e jp DivLoop+1 overflow: or a sbc hl,bc inc e jp DivLoop
Inverse (1/x)
If we want to invert a number >1, we know that its result will be <1, so for 8.8 fixed point, this means there will only be 8 bits in the result. We can optimise out most of the cycles of the division, making it much faster. In the DE_Div_BC_88 routine, we initialise HL=0 and rotate a bit of DE into HL at each iteration. If DE=100h=1.0, then no bits will get rotated into HL until the 8th iteration. Then after that, HL=1, but we know BC>256, so we are guaranteed to have nothing happen for another 8 iterations until HL=256. At this point, we have only 8 more iterations to compute, and DE is 0, so we no longer need to even rotate DE into HL. So now at this point, we can just initialise HL=256, then rotate in a 0 each iteration. This can be done with add hl,hl, then we compare to BC to see if the remainder (here, HL). However, since we now only need two register pairs, lets use B for the iteration counter, and DE is the denominator. If we compare HL to DE, we can use sbc hl,de. If the C flag is set, it means HL<DE, so we need to rotate a 0 into our accumulator and then add DE to HL again. We could use sbc hl,de \ jr nc,$+3 \ add hl,de \ ccf \ rla, but if we note that we have a cpl instruction to invert the bits of a:
DEgt1_Inv: ;1/D.E->B.C ; 470 to 518 t-states ld hl,256 ld b,8 InvLoop: add hl,hl sbc hl,de jr nc,$+3 add hl,de rla djnz InvLoop cpl ld c,a ret
And to unroll, saving 102 t-states:
DEgt1_Inv: ;1/D.E->B.C ; 343 to 397 t-states ld hl,512 ld b,l InvLoop: or a \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ adc a,a cpl ld c,a ret
Note that this actually works as an HL_Div_DE_88 routine, by removing the ld hl,256 and making sure HL<DE.
Square Root
Using the integer square root of HL, sqrt(H.L) would be returned in 4.4 fixed point format. The upper 4 bits of the integer part will be 0 regardless, so we can essentially say that we have 8.4 Fixed Point, so we need to modify the algorithm to iterate 4 more times to get 8.8 fixed point output. The following is adapted from a speed-optimised routine used in a floating point library (which required 16 bits instead of 12):
SqrtHL_prec12: ;input: HL ;Output: HL ;183 bytes xor a ld b,a ld e,l ld l,h ld h,a add hl,hl add hl,hl cp h jr nc,$+5 dec h ld a,4 add hl,hl add hl,hl ld c,a sub h jr nc,$+6 cpl ld h,a inc c inc c ld a,c add hl,hl add hl,hl add a,a ld c,a sub h jr nc,$+6 cpl ld h,a inc c inc c ld a,c add hl,hl add hl,hl add a,a ld c,a sub h jr nc,$+6 cpl ld h,a inc c inc c ld a,c ld l,e add hl,hl add hl,hl add a,a ld c,a sub h jr nc,$+6 cpl ld h,a inc c inc c ld a,c add hl,hl add hl,hl add a,a ld c,a sub h jr nc,$+6 cpl ld h,a inc c inc c ld a,c add a,a \ ld c,a add hl,hl add hl,hl jr nc,$+6 sub h \ jp $+6 sub h jr nc,$+6 inc c \ inc c cpl ld h,a ld a,l ld l,h add a,a ld h,a adc hl,hl adc hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a ;iteration 9 add hl,hl \ add hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a add hl,hl \ add hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a add hl,hl \ add hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a add hl,hl \ add hl,hl sll c \ rl b sbc hl,bc jr nc,$+3 add hl,bc sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a ;12th iteration completed ; output in BC srl b \ rr c ld h,b ld l,c ret
Exponentiation and Logarithms
Exponentiation
Here, exponentiation refers to <math>2^{x}</math>. Observe that <math>2^{39/16}=2^{2+7/16} = 2^{2+0/2+1/4+1/8+1/16}=2^{2}2^{0/2}2^{1/4}2^{1/8}2^{1/16}</math> The observant reader will see that taking away the trivial integer part of the power, the fractional part can be represented as a sequence of square roots with 2s wherever there are 1s in the binary expansion. So to compute <math>2^{x}</math> where x is on [0,1), in pseudo code we have:
If x==0 return 1.0 repeat until carry is set x>>=1 ;shift X right y=sqrt(2) while x>0 x>>=1 if (carry is set) y*=2 y=sqrt(y)
Converting that to assembly:
;Inputs: ; HL is the 8.8 fixed point number 'x' for 2^x ;Outputs: ; DEHL is the 24.8 fixed point result. If there was overflow exceeding 2^24, then this value is set to the max. ld a,l or a push hl ;save H for later, H is the integer part of the power ld hl,1 jr z,integerpow scf ;set the carry flag so that a bit is rotated into a. This will act as our counter. ;wait until we come across the lowest bit. Also note that we rra jr nc,$-1 ld hl,2*256 powloop: push af call FPSqrtHL ;returns in HL pop af srl a jr z,integerpow jr nc,powloop add hl,hl jp powloop integerpow: pop bc ;Now b is the integer part for 2^x that we need to multiply HL by. ld de,0 ld a,b or a ret z add hl,hl rl e \ rl d \ jr c,wayoverflow djnz $-7 ret wayoverflow: ld hl,-1 ld d,h ld e,l ret
Note that in the spirit of optimisation, that we can significantly speed up the fixed point sqrtHL routine since HL is never greater than 4.0.
Logarithms
Here, logarithm refers to log,,2,, also written as lg(). Using logarithms will allow us to extend the exponential routine to other bases, such as 3^n without much difficulty. Here is a fast, fixed point log_2 routine, optimised for size:
Log_2_88_size: ;Inputs: ; HL is an unsigned 8.8 fixed point number. ;Outputs: ; HL is the signed 8.8 fixed point value of log base 2 of the input. ;Example: ; pass HL = 3.0, returns 1.58203125 (actual is ~1.584962501...) ;averages about 39 t-states slower than original ;62 bytes ex de,hl ld hl,0 ld a,d ld c,8 or a jr z,DE_lessthan_1 srl d jr z,logloop-1 inc l rr e jr $-7 DE_lessthan_1: ld a,e dec hl or a ret z inc l dec l add a,a jr nc,$-2 ld e,a inc d logloop: add hl,hl push hl ld h,d ld l,e ld a,e ld b,8 add hl,hl rla jr nc,$+5 add hl,de adc a,0 djnz $-7 ld e,h ld d,a pop hl rr a ;this is NOT supposed to be rra, we need the z flag affected jr z,$+7 srl d rr e inc l dec c jr nz,logloop ret
Arctangent and Natural Logarithm
Although these are not the fastest methods, the approach to the problem of computing these functions efficiently is interesting. As well, they can work faster than CORDIC or AGM methods. Given efficient routines for computing multiplication and division, the following are accurate to better than 8 bits of accuracy:
atan(x) : x(9+2x^2)/(9+5x^2) on [-1,1] ln(x+1) : x(8+x)/(8+5x) on [0,1]
For a more complete analysis of how these formulas were derived, see this document. What is convenient to note is that the constant multiplications can be reduced to a few shift and add instuctions. So to put the functions into code:
ArcTan_88: ;Inputs: DE is positive ld a,d or a jr z,ArcTan+2 inc a jr z,ArcTan+2 ld b,d ld c,e ld de,256 call DE_Div_BC_88 call ArcTan ld de,402 ;about pi/2 ex de,hl or a sbc hl,de ret ArcTan: ;Input: DE ;Output: HL push de ;save X for later ld b,d ld c,e call BC_Times_DE ;x^2 computed ld d,h ld e,l add hl,hl add hl,hl add hl,de ;5x^2 is now in HL ld a,9 add a,h ld h,a ;9+5x^2->HL ld a,9 sla e rl d add a,d ld d,a ;9+2x^2->DE ld b,h ld c,l call DE_Div_BC_88 ;->DE pop bc jp BC_Times_DE
For the following ln() routine, the approximating function is employed taking advantage of lots of optimisations, putting it between the computing speed of multiplication and division:
lognat: ;Input: H.L needs to be on [1,2] ;Output: H.L if z flag is set, else if nz, no result ;avg speed: 677 t-states dec h dec h jr nz,$+8 inc l \ dec l ret nz ld l,177 ret inc h ret nz ld b,h ld c,l ld e,l ld d,8 add hl,hl add hl,hl add hl,de ex de,hl add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ adc a,a ld h,a \ ld l,b sla h \ jr c,$+3 \ ld l,c add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc rl l ld a,h adc a,b ld l,a ld h,b cp a ret
But that only works on [1,2], so we have to reduce speed a little to extend it to all positive inputs (0,128):
lognat: ;Input: H.L needs to be on (0,128.0) ;Output: H.L if c flag set ; returns nc if input is negative (HL not modified) ;Error: ; The error on the outputs is as follows: ; 20592 inputs are exact ; 12075 inputs are off by 1/256 ; 100 inputs are off by 2/256 ; So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256. ;Size: 177 bytes ;Speed: average speed is less than 1250 t-states ld a,h \ or l \ jr nz,$+5 ld h,80h \ ret dec h dec h jr nz,$+9 inc l \ dec l jr nz,normalizeln-1 ld l,177 ret inc h jr nz,normalizeln ld b,h ld c,l ld e,l ld d,8 add hl,hl add hl,hl add hl,de ex de,hl call HL_Div_DE ld h,a \ ld l,b sla h \ jr c,$+3 \ ld l,c add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc add hl,hl \ jr c,$+3 \ add hl,bc rl l ld a,h adc a,b ld h,b ld l,a scf ret HL_Div_DE: add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a add hl,hl \ sbc hl,de \ adc a,a \ ret inc h normalizeln: xor a inc h \ ret m ld d,a \ ld e,a ld a,l jr z,toosmall inc e \ srl h \ rra \ jr nz,$-4 rla \ rl h dec e stepin: ld l,a push de call lognat pop de ;now multiply DE by 355, then divide by 2 (rounding) ld b,d \ ld c,e \ ld a,d ex de,hl add hl,hl add hl,hl ;4 add hl,bc ;5 add hl,hl ;10 add hl,bc ;11 add hl,hl ;22 add hl,hl add hl,hl add hl,hl add hl,bc add hl,hl add hl,bc sra h \ rr l adc hl,de scf ret toosmall: dec d dec e \ add a,a \ jr nc,$-2 inc h jp stepin
Now it averages about 1250 t-states, still faster than some implementations of division.
If you pass HL=768, in fixed point that is 3.0, so output is 282 which is 1.1015625 in fixed point.
Sine and Cosine
Unlike atan() and ln(), these have quickly converging Taylor series. From these, we can derive:
sin(x) : x-85x^3/512+x^5/128 is within 8 bits of accuracy on [-pi/2,pi/2] cos(x) : 1-x^2/2+5x^4/128 is within 8 bits of accuracy on [-pi/2,pi/2] for almost all values
For sine, noting that 85=17*5 will help with fast constant multiplication:
sine_88: ;Inputs: de push de sra d \ rr e ld b,d \ ld c,e call BC_Times_DE push hl ;x^2/4 sra h \ rr l ex de,hl ld b,d \ ld c,e call BC_Times_DE sra h \ rr l inc h ex (sp),hl ;x^4/128+1 is on stack, HL=x^2/4 xor a ld d,a ld b,h ld c,l add hl,hl \ rla add hl,hl \ rla add hl,bc \ adc a,d ld b,h ld c,l add hl,hl \ rla add hl,hl \ rla add hl,hl \ rla add hl,hl \ rla add hl,bc \ adc a,d ld e,l ld l,h ld h,a rl e adc hl,hl rl e jr nc,$+3 inc hl pop de ex hl,de or a sbc hl,de ex de,hl pop bc jp BC_Times_DE
Table-Based Fixed 8.8
When speed is priority, using table-based routines are often the fastest way to achieve a task.
Logarithms
Because we are working in binary, lg(), which is log,,2,,() is the easiest and most practical to compute. From this, we can derive other bases, too. Now to use some identities, note that lg(x/2)=lg(x)-lg(2) = lg(x)-1. So then we have lg(x)=1+lg(x/2). So what we will do is make a routine that works on [1,2] and extend that to all other values. If our input is less than 1, we have lg(x)=lg(x*2)-1, so we double our input, and subtract from an accumulator. If input is >1, we will divide it by 2 and add to the accumulator. Then, we only have 256 values to worry about in our table. But, if our input is now even, we can divide by 2 without losing any accuracy (no bits lost) and just increment the accumulator. We can keep doing this until we finally have an odd number (which means we need to take care of input=0 as a separate case). Then we have a table of 128 elements instead--all of the odd numbers. And now we can build our algorithm:
lg_88: ;Input: HL is a fixed point number ;Output: lg(H.L)->H.L ;Speed: Avg: 340 ld de,lgLUT ld b,0 ld a,h or a ret m ;If input is negative, this algorithm won't suffice (this is only for the Reals, not Complex) ld a,l jr z,$+8 inc b \ srl h \ rra \ jr nz,$-4 ;input >2, so we rotate, using b as the accumulator or a \ jr nz,$+6 ld hl,8000h \ ret ;If input was 0, use this as -inf rra \ inc b \ jr nc,$-2 ;input<1, so we rotate ;A is the element to look up in the LUT ld l,a ld c,h dec b add hl,hl add hl,de ld e,(hl) inc hl ld d,(hl) ex de,hl add hl,bc ret lglut: .dw $F800 .dw $F996 .dw $FA52 .dw $FACF .dw $FB2C .dw $FB76 .dw $FBB3 .dw $FBE8 .dw $FC16 .dw $FC3F .dw $FC64 .dw $FC86 .dw $FCA5 .dw $FCC1 .dw $FCDC .dw $FCF4 .dw $FD0B .dw $FD21 .dw $FD36 .dw $FD49 .dw $FD5C .dw $FD6D .dw $FD7E .dw $FD8E .dw $FD9D .dw $FDAC .dw $FDBA .dw $FDC8 .dw $FDD5 .dw $FDE2 .dw $FDEE .dw $FDFA .dw $FE06 .dw $FE11 .dw $FE1C .dw $FE26 .dw $FE31 .dw $FE3B .dw $FE44 .dw $FE4E .dw $FE57 .dw $FE60 .dw $FE69 .dw $FE71 .dw $FE7A .dw $FE82 .dw $FE8A .dw $FE92 .dw $FE9A .dw $FEA1 .dw $FEA9 .dw $FEB0 .dw $FEB7 .dw $FEBE .dw $FEC5 .dw $FECB .dw $FED2 .dw $FED8 .dw $FEDF .dw $FEE5 .dw $FEEB .dw $FEF1 .dw $FEF7 .dw $FEFD .dw $FF03 .dw $FF09 .dw $FF0E .dw $FF14 .dw $FF19 .dw $FF1E .dw $FF24 .dw $FF29 .dw $FF2E .dw $FF33 .dw $FF38 .dw $FF3D .dw $FF42 .dw $FF47 .dw $FF4B .dw $FF50 .dw $FF55 .dw $FF59 .dw $FF5E .dw $FF62 .dw $FF67 .dw $FF6B .dw $FF6F .dw $FF74 .dw $FF78 .dw $FF7C .dw $FF80 .dw $FF84 .dw $FF88 .dw $FF8C .dw $FF90 .dw $FF94 .dw $FF98 .dw $FF9B .dw $FF9F .dw $FFA3 .dw $FFA7 .dw $FFAA .dw $FFAE .dw $FFB2 .dw $FFB5 .dw $FFB9 .dw $FFBC .dw $FFC0 .dw $FFC3 .dw $FFC6 .dw $FFCA .dw $FFCD .dw $FFD0 .dw $FFD4 .dw $FFD7 .dw $FFDA .dw $FFDD .dw $FFE0 .dw $FFE4 .dw $FFE7 .dw $FFEA .dw $FFED .dw $FFF0 .dw $FFF3 .dw $FFF6 .dw $FFF9 .dw $FFFC .dw $FFFF
Adding a little overhead code, we can get an ln() routine. Note that to convert, we have ln(x)=lg(x)/lg(e), but 1/lg(e) is a constant, so we can use constant multiplication. (it is approximately 355/512 for a good estimate):
ln_88: ;Input: HL is a fixed point number ;Output: ln(H.L)->H.L ;Speed: Avg: 340+(325 worst case) call lg_88 ;now signed multiply HL by 355, then divide by 2 (rounding) ld de,0 bit 7,h jr z,$+9 dec e \ xor a \ sub l \ ld l,a sbc a,a \ sub h \ ld h,a ld b,h ld c,l xor a add hl,hl add hl,hl \ rla add hl,bc \ adc a,d add hl,hl \ rla add hl,bc \ adc a,d add hl,hl \ rla add hl,hl \ rla add hl,hl \ rla add hl,hl \ rla add hl,bc \ adc a,d add hl,hl \ rla add hl,bc \ adc a,d sra a \ rr h ld l,h ld h,a inc e ret nz xor a \ sub l \ ld l,a sbc a,a \ sub h \ ld h,a ret
Arctan
Arctan also can be computed on the whole real line just from the values on [0,1] by the identities: arctan(-x)=-arctan(x) and arctan(x)=pi/2-arctan(1/x). Unlike ln() where we need an LUT (Look Up Table) of 16-bit numbers, we can just use 8-bit numbers. Plus, since arctan is pretty close to a linear function for a while, a large chunk of the LUT can be omitted by checking if the input is within a certain range:
arctan_88: ;Input: ; D.E ;Output: atan(D.E)->D.E push de ld a,d or a jp p,$+5 neg ld d,a dec a jr nz,checkneedinv inc e \ dec e \ jr nz,checkneedinv pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret checkneedinv: inc a call nz,DEgt1_Inv ;0.E is the value to atan ld hl,adjustatan push hl ld a,e cp 46 \ ret c dec a \ cp 42h \ ret c dec a \ cp 4Eh \ ret c dec a \ cp 57h \ ret c dec a \ cp 5Eh \ ret c dec a \ cp 64h \ ret c dec a \ cp 6Ah \ ret c dec a \ cp 6Fh \ ret c sub 6Fh \ ld e,a ld hl,atanlut add hl,de ld a,(hl) ret adjustatan: ld e,a pop bc ld a,b or a jp p,$+5 neg jr z,$+9 ld hl,402 or a sbc hl,de ex de,hl rl b ret nc xor a sub e ld e,a sbc a,a sub d ld d,a ret DEgt1_Inv: ;Works if DE>1 ld hl,256 ld b,8 InvLoop: add hl,hl sbc hl,de jr nc,$+3 add hl,de adc a,a djnz InvLoop cpl ld e,a ld d,b ret atanlut: .db $6F .db $6F .db $70 .db $71 .db $72 .db $73 .db $73 .db $74 .db $75 .db $76 .db $77 .db $77 .db $78 .db $79 .db $7A .db $7B .db $7B .db $7C .db $7D .db $7E .db $7F .db $7F .db $80 .db $81 .db $82 .db $82 .db $83 .db $84 .db $85 .db $85 .db $86 .db $87 .db $88 .db $88 .db $89 .db $8A .db $8B .db $8B .db $8C .db $8D .db $8E .db $8E .db $8F .db $90 .db $90 .db $91 .db $92 .db $93 .db $93 .db $94 .db $95 .db $95 .db $96 .db $97 .db $97 .db $98 .db $99 .db $9A .db $9A .db $9B .db $9C .db $9C .db $9D .db $9E .db $9E .db $9F .db $A0 .db $A0 .db $A1 .db $A2 .db $A2 .db $A3 .db $A3 .db $A4 .db $A5 .db $A5 .db $A6 .db $A7 .db $A7 .db $A8 .db $A9 .db $A9 .db $AA .db $AA .db $AB .db $AC .db $AC .db $AD .db $AD .db $AE .db $AF .db $AF .db $B0 .db $B0 .db $B1 .db $B2 .db $B2 .db $B3 .db $B3 .db $B4 .db $B5 .db $B5 .db $B6 .db $B6 .db $B7 .db $B7 .db $B8 .db $B9 .db $B9 .db $BA .db $BA .db $BB .db $BB .db $BC .db $BC .db $BD .db $BE .db $BE .db $BF .db $BF .db $C0 .db $C0 .db $C1 .db $C1 .db $C2 .db $C2 .db $C3 .db $C3 .db $C4 .db $C4 .db $C5 .db $C6 .db $C6 .db $C7 .db $C7 .db $C8 .db $C8 .db $C9
Random Numbers
16-bit Pseudo-Random Number Generator
The previous routine in this place was 1592cc and 280 bytes (compare to 291cc, 58 bytes). It worked well enough in terms of randomness, but there were some issues that merited a replacement. The following was tested and passed all tests at the CAcert Research Lab (see zrng2 by Zeda Thomas). The data that was output from this routine was from calling this routine to generate two bytes at a time to generate 16MiB of data.
Usage Proper use of this routine is to use the upper bits. The lower bits are not nearly as "random" as the upper bits. With this said, if you want to genreate a number on a range from [0,N), treat the output as a 0.16 fixed-point number, and multiply by N and take the integer part. In assembly, this translated to multiplying the output by N and taking the overflow (bits[16:]).
#ifndef smc ;\ #define smc ; |Remove or comment for Apps or if you *really* don't want SMC #end ;/ #ifndef smc ;\ #define seed1_0 8000h ; |If you are not using SMC, you'll need to define #define seed1_1 8002h ; |these vars yourself. #define seed2_0 8004h ; | #define seed2_1 8006h ; | #endif ;/ rand16: ;Inputs: ; seed1 ; seed2>0 ;Output: ; HL ;Destroys: ; A,BC,DE. Could all be used as 'less random' options, too. ;Notes: ; Tested and passes all CAcert Labs tests. ; Has a period of 18,446,744,069,414,584,320 (roughly 18.4 quintillion) ; This would take over 11.3 million years at 15MHz to finish a full period. ;291cc, +32cc if not using smc ;58 bytes, +2 if not using smc ;;lcg #ifdef smc seed1_0=$+1 ld hl,12345 seed1_1=$+1 ld de,6789 #else ld hl,(seed1_0) ld de,(seed1_1) #endif ld b,h ld c,l add hl,hl \ rl e \ rl d add hl,hl \ rl e \ rl d inc l add hl,bc ld (seed1_0),hl ld hl,(seed1_1) adc hl,de ld (seed1_1),hl ex de,hl ;;lfsr #ifdef smc seed2_0=$+1 ld hl,9876 seed2_1=$+1 ld bc,54321 #else ld hl,(seed2_0) ld bc,(seed2_1) #endif add hl,hl rl c rl b ld (seed2_1),bc sbc a,a and %11000101 xor l ld l,a ld (seed2_0),hl ex de,hl add hl,bc ret
Conclusion
For more info, see the TI-83 Asm in 28 Days, Day 15.
When multiplying by a constant, there are often more efficient ways than using these subprograms. See here for more information.
For ways to perform more accurate math (Floating Point), there are many prebuilt functions that operate on the OP operators. They can be found here]. We also have Floating Point routines that cover other formats that may be of more use in other situations.
Sources: ASM in 28 Days, Day 15 FP_Math