z80:Advanced Math

From Learn @ Cemetech
Revision as of 00:36, 24 April 2016 by Xeda112358 (talk | contribs) (→‎Exponentiation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Integer Math

Addition/Subtraction

To do addition or subtraction on an integer of arbitrary size, we use ADC.

Here we add two 32-bit integers:

   LD     HL, (word1)        ; Get least-significant word of word1
       LD     DE, (word2)        ; Get least-significant word of word2
       ADD    HL, DE              ; Add them
       LD     (result), HL        ; Store least-significant result
   
       LD     HL, (dword1 + 2)    ; Get most-significant word of word1
       LD     DE, (dword2 + 2)    ; Get most-significant word of word2
       ADC    HL, DE              ; Add them with the carry from the previous addition
       LD     (result + 2), HL    ; Store most-significant result
       RET
   
   word1:    .DB    $B3, $90, $12, $32    ; Each word is stored with the least-significant
   word2:    .DB    $F1, $15, $06, $B8    ; bytes first. You could just as easily have stored
   result:    .DB    $00, $00, $00, $00    ; them in big-endian, but because of how registers are
                                        ; loaded from RAM, it wouldn't work.

But now, what if you wanted to add truly arbitrary sized integers? We might do something like this:

   ld hl,Num1
        ld de,Num2
        ld bc,Size
   AddLoop:
   ;Inputs:
   ;     HL points to Num1 (stored little-endian)
   ;     DE points to Num2 (stored little-endian)
   ;     BC is the size of the integers
   ;Outputs:
   ;     BC is 0
   ;     HL points to the byte after Num1
   ;     DE points tot he byte after Num2
   ;     c flag is the carry
   ;     z flag is set
   ;     p/v flag reset
   ;     Num1 is overwritten with the result
          ld a,(de)
          adc a,(hl)
          ld (hl),a
          inc de
          cpi         ;increment HL, decrement BC, set parity flag of BC becomes 0
          jp pe,AddLoop
        ret
   Num1:
        .db $B3,$90,$12,$32,$47
   Num2:
        .db $F1,$15,$06,$B8,0

Similarly, one can perform arbitrary subtractions.

Multiplication

For multiplication, we could just add them in a DJNZ loop. But that is slow for multiplying big numbers. There is a much more efficient algorithm that you probably learned as a child in school that involves multiplying multi-digit numbers by hand. Similarly, we will do that here by using the binary representation of numbers and the astute reader will realise that binary digits are only 1 or 0 and multiplying by 1 or 0 is rather trivial!

To understand this algorithm, it is suggested that the reader try this by hand with some simple 4-bit numbers, then follow the algorithm or try to develop it in pseudocode.

Here is the first example:

   .module    DE_Times_A
   
   ;Note that we choose DE and A because of the readily available instructions that make this even more efficient.
   
   DE_Times_A:           ; HL = DE × A
       LD     HL, 0      ; Use HL to store the product
       LD     B, 8       ; Eight bits to check
   _loop:
       RRCA              ; Check least-significant bit of accumulator
       JR     NC, _skip  ; If zero, skip addition
       ADD    HL, DE
   _skip:
       SLA    E          ; Shift DE one bit left
       RL     D
   
   ;In binary, a shift left is multiplying by 2, just like a shift left in decimal is like multiplying by 10.
   ;We essentially added a 0 to the end of the number
   
       DJNZ   _loop
       RET

Or alternatively for a slightly faster and smaller code:

   .module    DE_Times_A
   DE_Times_A:      ; HL = DE × A
       ld hl,0      ; Use HL to store the product
       ld b,8       ; Eight bits to check
   _loop:
       add hl,hl
       rlca         ; Check most-significant bit of accumulator
       jr nc,_skip  ; If zero, skip addition
       add hl,de
   _skip:
       djnz _loop
       ret

And you might realise that multiplying an 8-bit number by a 16-bit number can yield up to a 24-bit number, so to take care of that overflow by turning rlca into adc a,a which is the same size and speed. Note that adc a,a can be replaced with rla for the same effect. Some overflow may occur from add hl,de, so we take care of that, too

   .module    DE_Times_A
   DE_Times_A:      ; AHL = DE × A
       ld hl,0      ; Use HL to store the product
       ld b,8       ; Eight bits to check
       ld c,h
   _loop:
       add hl,hl
       adc a,a      ; Check most-significant bit of accumulator
       jr nc,_skip  ; If zero, skip addition
       add hl,de
       adc a,c      ;add in overflow as needed.
   _skip:
       djnz _loop
       ret

Division

Division is also just like base-10 division as well, but instead of checking if the next digit is from 0 to 9, we only need to check if it is 0 or 1 which amounts to the question of "is our remainder smaller than the divisor (1) or not (0) ?"

Again, it is recommended that the reader try this with small examples to understand and possibly develop their own code.

   .module    Div_HL_D
   Div_HL_D:            ; HL = HL ÷ D, A = remainder
       XOR    A         ; Clear upper eight bits of AHL
       LD     B, 16     ; Sixteen bits in dividend
   _loop:
       ADD    HL, HL    ; Do a SLA HL. If the upper bit was 1, the c flag is set
       RLA              ; This moves the upper bits of the dividend into A
       JR     C, _overflow
       CP     D         ; Check if we can subtract the divisor
       JR     C, _skip  ; Carry means D > A
   _overflow:
       SUB    D         ; Do subtraction for real this time
       INC    L         ; Set the next bit of the quotient (currently bit 0)
   _skip:
       DJNZ   _loop
       RET

Fixed Point Math

Fixed Point refers to a number format that includes a fixed size integer part, and a fixed size fractional part. For example, 8.8 fixed point refers to a format where the upper 8 bits form the integer part, the lower 8 bits form the fractional part. Fixed point math often provides a desired balance between the numerical precision of floating point and the computing speed of integer maths.

To elaborate a little more, to represent 3.0 as a fixed point number stored in HL, we might have HL=0300, where the decimal point is between H and L (03.00). Then HL=0341 would be equivalent to 3+65/256=3.25390625. If we were to square this, we get HL=0A96 where we have lost some bits of precision.

Addition and Subtraction

As with integer arithmetic, addition and subtraction are trivial on the Z80. To add H.L+D.E, just use add hl,de, and for subtraction, 'or a \ sbc hl,de' will sufice.

Multiplication

H.L*D.E is essentially HL*DE/65536 If we view HL as a 16 bit integer, then H.L is essentially HL/256. For multiplication, the math for H.L*D.E is then: (HL/256)(DE/256) = (HL*DE)/65536. This means we can use a 16-bit signed multiplication routine that returns 32 bits and we now have a 16.16 fixed point result. To keep output as an 8.8 fixed point number, we need to grab the middle bytes around the decimal. If output was DE.HL, then E.H is a our 8.8 result. Overflow and underflow should be handled by the programmer depending on the task. note: Really, only a routine to return the lower 24-bits of the 16x16 multiplication is needed.

Allowing speed to take priority, here is a 16-bit unsigned integer multiplication:

   BC_Times_DE:
   ;  BC*DE->BHLA
   	ld a,b
   	ld hl,0
   	ld b,h
   	add a,a \ jr nc,$+5 \ ld h,d \ ld l,e
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,b
   	push hl
   	ld h,b
   	ld l,b
   	ld b,a
   	ld a,c
   	ld c,h
   	add a,a \ jr nc,$+5 \ ld h,d \ ld l,e
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
   	add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c    
   	pop de
   	ld c,a
   	ld a,l
   	ld l,h
   	ld h,c
   	add hl,de
   	ret nc
   	inc b
   	ret

Division

For division, we should notice that H.L/D.E evaluates as: (HL/256)/(DE/256)=HL/DE We could use a standard signed integer division routine, however, unlike with multiplication, the result can only be used a 16.0 integer. We need 8 more bits of accuracy for 16.8 output (of which we only use the lower 8-bits of the integer part). Since we are going to discard the upper 8 bits of the acquired result, we can optimise the first 8 iterations to not store the bits anywhere.

   DE_Div_BC_88:
   ;Inputs:
   ;     DE,BC are 8.8 Fixed Point numbers
   ;Outputs:
   ;     DE is the 8.8 Fixed Point result (rounded to the least significant bit)
       ld a,d
       xor b
       push af
       call m,negDE
       call div8sub
       pop af
       ret p
   negDE:
       xor a
       sub e
       ld e,a
       sbc a,a
       sub d
       ld d,a
       ret
   div8sub:
       ld a,8
       ld hl,0
   Loop1:
       rl d
       adc hl,hl
       sbc hl,bc
       jr nc,$+3
   	add hl,bc
       dec a
       jr nz,Loop1
       ld d,e
       ld e,a
       ld a,16
  	jp $+6
   DivLoop:
       add hl,bc
       dec a
       ret z
       sla e
       rl d
       adc hl,hl
   	jr c,overflow
       sbc hl,bc
       jr c,DivLoop
       inc e
       jp DivLoop+1
   overflow:
   	or a
   	sbc hl,bc
   	inc e
   	jp DivLoop

Inverse (1/x)

If we want to invert a number >1, we know that its result will be <1, so for 8.8 fixed point, this means there will only be 8 bits in the result. We can optimise out most of the cycles of the division, making it much faster. In the DE_Div_BC_88 routine, we initialise HL=0 and rotate a bit of DE into HL at each iteration. If DE=100h=1.0, then no bits will get rotated into HL until the 8th iteration. Then after that, HL=1, but we know BC>256, so we are guaranteed to have nothing happen for another 8 iterations until HL=256. At this point, we have only 8 more iterations to compute, and DE is 0, so we no longer need to even rotate DE into HL. So now at this point, we can just initialise HL=256, then rotate in a 0 each iteration. This can be done with add hl,hl, then we compare to BC to see if the remainder (here, HL). However, since we now only need two register pairs, lets use B for the iteration counter, and DE is the denominator. If we compare HL to DE, we can use sbc hl,de. If the C flag is set, it means HL<DE, so we need to rotate a 0 into our accumulator and then add DE to HL again. We could use sbc hl,de \ jr nc,$+3 \ add hl,de \ ccf \ rla, but if we note that we have a cpl instruction to invert the bits of a:

   DEgt1_Inv:
   ;1/D.E->B.C
   ; 470 to 518 t-states
   	ld hl,256
   	ld b,8
   InvLoop:
   	add hl,hl
   	sbc hl,de
   	jr nc,$+3
   	add hl,de
   	rla
   	djnz InvLoop
   	cpl
   	ld c,a
   	ret

And to unroll, saving 102 t-states:

   DEgt1_Inv:
   ;1/D.E->B.C
   ; 343 to 397 t-states
   	ld hl,512
   	ld b,l
   InvLoop:
   	or a       \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ adc a,a
   	cpl
   	ld c,a
   	ret

Note that this actually works as an HL_Div_DE_88 routine, by removing the ld hl,256 and making sure HL<DE.

Square Root

Using the integer square root of HL, sqrt(H.L) would be returned in 4.4 fixed point format. The upper 4 bits of the integer part will be 0 regardless, so we can essentially say that we have 8.4 Fixed Point, so we need to modify the algorithm to iterate 4 more times to get 8.8 fixed point output. The following is adapted from a speed-optimised routine used in a floating point library (which required 16 bits instead of 12):

   SqrtHL_prec12:
   ;input: HL
   ;Output: HL
   ;183 bytes
   	xor a
   	ld b,a
   
   	ld e,l
   	ld l,h
   	ld h,a
   
   	add hl,hl
   	add hl,hl
   	cp h
   	jr nc,$+5
   	dec h
   	ld a,4
   
   	add hl,hl
   	add hl,hl
   	ld c,a
   	sub h
   	jr nc,$+6
   	cpl
   	ld h,a
   	inc c
   	inc c
   
   	ld a,c
   	add hl,hl
   	add hl,hl
   	add a,a
   	ld c,a
   	sub h
   	jr nc,$+6
   	cpl
   	ld h,a
   	inc c
   	inc c
   
   	ld a,c
   	add hl,hl
   	add hl,hl
   	add a,a
   	ld c,a
   	sub h
   	jr nc,$+6
   	cpl
   	ld h,a
   	inc c
   	inc c
   
   	ld a,c
   	ld l,e
   
   	add hl,hl
   	add hl,hl
   	add a,a
   	ld c,a
   	sub h
   	jr nc,$+6
   	cpl
   	ld h,a
   	inc c
   	inc c
   
   	ld a,c
   	add hl,hl
   	add hl,hl
   	add a,a
   	ld c,a
   	sub h
   	jr nc,$+6
   	cpl
   	ld h,a
   	inc c
   	inc c
   
   	ld a,c
   	add a,a \ ld c,a
   	add hl,hl
   	add hl,hl
   	jr nc,$+6
   	sub h \ jp $+6
   	sub h
   	jr nc,$+6
   	inc c \ inc c
   	cpl
   	ld h,a
   
   
   	ld a,l
   	ld l,h
   	add a,a
   	ld h,a
   	adc hl,hl
   	adc hl,hl
   	sll c \ rl b
   	sbc hl,bc
   	jr nc,$+3
   	add hl,bc
   	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a
   
   ;iteration 9
   	add hl,hl \ add hl,hl
   	sll c \ rl b
   	sbc hl,bc
   	jr nc,$+3
   	add hl,bc
   	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a
   
   	add hl,hl \ add hl,hl
   	sll c \ rl b
   	sbc hl,bc
   	jr nc,$+3
   	add hl,bc
   	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a
   
   	add hl,hl \ add hl,hl
   	sll c \ rl b
   	sbc hl,bc
   	jr nc,$+3
   	add hl,bc
   	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a
   
   	add hl,hl \ add hl,hl
   	sll c \ rl b
   	sbc hl,bc
   	jr nc,$+3
   	add hl,bc
   	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a
   ;12th iteration completed
   ; output in BC
       srl b \ rr c
       ld h,b
       ld l,c
       ret

Exponentiation and Logarithms

Exponentiation

Here, exponentiation refers to <math>2^{x}</math>. Observe that <math>2^{39/16}=2^{2+7/16} = 2^{2+0/2+1/4+1/8+1/16}=2^{2}2^{0/2}2^{1/4}2^{1/8}2^{1/16}</math> The observant reader will see that taking away the trivial integer part of the power, the fractional part can be represented as a sequence of square roots with 2s wherever there are 1s in the binary expansion. So to compute <math>2^{x}</math> where x is on [0,1), in pseudo code we have:


   If x==0
     return 1.0
   repeat until carry is set
     x>>=1    ;shift X right
   y=sqrt(2)
   while x>0
     x>>=1
     if (carry is set)
       y*=2
     y=sqrt(y)

Converting that to assembly:

   ;Inputs:
   ;     HL is the 8.8 fixed point number 'x' for 2^x
   ;Outputs:
   ;     DEHL is the 24.8 fixed point result. If there was overflow exceeding 2^24, then this value is set to the max.
        ld a,l
        or a
        push hl     ;save H for later, H is the integer part of the power
        ld hl,1
        jr z,integerpow
        scf      ;set the carry flag so that a bit is rotated into a. This will act as our counter.
   ;wait until we come across the lowest bit. Also note that we 
        rra
        jr nc,$-1
        ld hl,2*256
   powloop:
        push af
        call FPSqrtHL    ;returns in HL
        pop af
        srl a
        jr z,integerpow
        jr nc,powloop
        add hl,hl
        jp powloop
   integerpow:
        pop bc
   ;Now b is the integer part for 2^x that we need to multiply HL by.
        ld de,0
        ld a,b
        or a
        ret z
   
        add hl,hl
        rl e \ rl d \ jr c,wayoverflow
        djnz $-7
        ret
   wayoverflow:
        ld hl,-1
        ld d,h
        ld e,l
        ret


Note that in the spirit of optimisation, that we can significantly speed up the fixed point sqrtHL routine since HL is never greater than 4.0.

Logarithms

Here, logarithm refers to log,,2,, also written as lg(). Using logarithms will allow us to extend the exponential routine to other bases, such as 3^n without much difficulty. Here is a fast, fixed point log_2 routine, optimised for size:

   Log_2_88_size:
   ;Inputs:
   ;     HL is an unsigned 8.8 fixed point number.
   ;Outputs:
   ;     HL is the signed 8.8 fixed point value of log base 2 of the input.
   ;Example:
   ;     pass HL = 3.0, returns 1.58203125 (actual is ~1.584962501...)
   ;averages about 39 t-states slower than original
   ;62 bytes
        ex de,hl
        ld hl,0
        ld a,d
        ld c,8
        or a
        jr z,DE_lessthan_1
        srl d
        jr z,logloop-1
        inc l
        rr e
        jr $-7
   DE_lessthan_1:
        ld a,e
        dec hl
        or a
        ret z
        inc l
        dec l
        add a,a
        jr nc,$-2
        ld e,a
   
        inc d
   logloop:
        add hl,hl
        push hl
        ld h,d
        ld l,e
        ld a,e
        ld b,8
   
        add hl,hl
        rla
        jr nc,$+5
          add hl,de
          adc a,0
        djnz $-7
   
        ld e,h
        ld d,a
        pop hl
        rr a         ;this is NOT supposed to be rra, we need the z flag affected
        jr z,$+7
          srl d
          rr e
          inc l
        dec c
        jr nz,logloop
        ret

Arctangent and Natural Logarithm

Although these are not the fastest methods, the approach to the problem of computing these functions efficiently is interesting. As well, they can work faster than CORDIC or AGM methods. Given efficient routines for computing multiplication and division, the following are accurate to better than 8 bits of accuracy:

   atan(x) : x(9+2x^2)/(9+5x^2) on [-1,1]
   ln(x+1) : x(8+x)/(8+5x) on [0,1]

For a more complete analysis of how these formulas were derived, see this document. What is convenient to note is that the constant multiplications can be reduced to a few shift and add instuctions. So to put the functions into code:

   ArcTan_88:
   ;Inputs: DE is positive
       ld a,d
       or a
       jr z,ArcTan+2
       inc a
       jr z,ArcTan+2
       ld b,d
       ld c,e
       ld de,256
       call DE_Div_BC_88
       call ArcTan
       ld de,402     ;about pi/2
       ex de,hl
       or a
       sbc hl,de
       ret
   ArcTan:
   ;Input: DE
   ;Output: HL
       push de    ;save X for later
       ld b,d
       ld c,e
       call BC_Times_DE ;x^2 computed
       ld d,h
       ld e,l
       add hl,hl
       add hl,hl
       add hl,de   ;5x^2 is now in HL
       ld a,9
       add a,h
       ld h,a      ;9+5x^2->HL
       ld a,9
       sla e
       rl d
       add a,d
       ld d,a      ;9+2x^2->DE
       ld b,h
       ld c,l
       call DE_Div_BC_88   ;->DE     
       pop bc
       jp BC_Times_DE

For the following ln() routine, the approximating function is employed taking advantage of lots of optimisations, putting it between the computing speed of multiplication and division:

   lognat:
   ;Input:  H.L needs to be on [1,2]
   ;Output: H.L if z flag is set, else if nz, no result
   ;avg speed: 677 t-states
   	dec h
      	dec h
   	jr nz,$+8
   	inc l \ dec l
   	ret nz
   	ld l,177
   	ret
   	inc h
   	ret nz
   	ld b,h
   	ld c,l
   	ld e,l
   	ld d,8
   	add hl,hl
   	add hl,hl
   	add hl,de
   	ex de,hl
   
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ adc a,a
   
   	ld h,a \ ld l,b
   	sla h \ jr c,$+3 \ ld l,c
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	rl l
   	ld a,h
   	adc a,b
   	ld l,a
   	ld h,b
   	cp a
   	ret

But that only works on [1,2], so we have to reduce speed a little to extend it to all positive inputs (0,128):

   lognat:
   ;Input:  H.L needs to be on (0,128.0)
   ;Output: H.L if c flag set
   ;	 returns nc if input is negative (HL not modified)
   ;Error:
   ;	The error on the outputs is as follows:
   ;		20592 inputs are exact
   ;		12075 inputs are off by 1/256
   ;		100 inputs are off by 2/256
   ;	So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256.
   ;Size: 177 bytes
   ;Speed: average speed is less than 1250 t-states
   
   	ld a,h \ or l \ jr nz,$+5
   	ld h,80h \ ret
   	dec h
   	dec h
   	jr nz,$+9
   	inc l \ dec l
   	jr nz,normalizeln-1
   	ld l,177
   	ret
   	inc h
   	jr nz,normalizeln
   	ld b,h
   	ld c,l
   	ld e,l
   	ld d,8
   	add hl,hl
   	add hl,hl
   	add hl,de
   	ex de,hl
   	call HL_Div_DE
   	ld h,a \ ld l,b
   	sla h \ jr c,$+3 \ ld l,c
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	add hl,hl \ jr c,$+3 \ add hl,bc
   	rl l
   	ld a,h
   	adc a,b
   	ld h,b
   	ld l,a
   	scf
   	ret
   HL_Div_DE:
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
   	add hl,hl \ sbc hl,de \ adc a,a \ ret
   
   	inc h
   normalizeln:
   	xor a
   	inc h \ ret m
   	ld d,a \ ld e,a
   	ld a,l
   	jr z,toosmall
   	inc e \ srl h \ rra \ jr nz,$-4
   	rla \ rl h
   	dec e
   stepin:
   	ld l,a
   	push de
   	call lognat
   	pop de
   	;now multiply DE by 355, then divide by 2 (rounding)
   	ld b,d \ ld c,e \ ld a,d
   	ex de,hl
   	add hl,hl
   	add hl,hl	;4
   	add hl,bc	;5
   	add hl,hl	;10
   	add hl,bc	;11
   	add hl,hl	;22
   	add hl,hl
   	add hl,hl
   	add hl,hl
   	add hl,bc
   	add hl,hl
   	add hl,bc
   	sra h \ rr l
   	adc hl,de
   	scf
   	ret
   toosmall:
   	dec d
   	dec e \ add a,a \ jr nc,$-2
   	inc h
   	jp stepin

Now it averages about 1250 t-states, still faster than some implementations of division.

If you pass HL=768, in fixed point that is 3.0, so output is 282 which is 1.1015625 in fixed point.

Sine and Cosine

Unlike atan() and ln(), these have quickly converging Taylor series. From these, we can derive:

   sin(x) : x-85x^3/512+x^5/128 is within 8 bits of accuracy on [-pi/2,pi/2]
   cos(x) : 1-x^2/2+5x^4/128 is within 8 bits of accuracy on [-pi/2,pi/2] for almost all values

For sine, noting that 85=17*5 will help with fast constant multiplication:

   sine_88:
   ;Inputs: de
       push de
       sra d \ rr e
       ld b,d \ ld c,e
       call BC_Times_DE
       push hl     ;x^2/4
       sra h \ rr l
       ex de,hl
       ld b,d \ ld c,e
       call BC_Times_DE
       sra h \ rr l
       inc h
       ex (sp),hl    ;x^4/128+1 is on stack, HL=x^2/4
       xor a
       ld d,a
       ld b,h
       ld c,l
       add hl,hl \ rla
       add hl,hl \ rla
       add hl,bc \ adc a,d
       ld b,h
       ld c,l
       add hl,hl \ rla
       add hl,hl \ rla
       add hl,hl \ rla
       add hl,hl \ rla
       add hl,bc \ adc a,d
       ld e,l
       ld l,h
       ld h,a
       rl e
       adc hl,hl
       rl e
       jr nc,$+3
       inc hl
   
       pop de
       ex hl,de
       or a
       sbc hl,de
       ex de,hl
       pop bc
       jp BC_Times_DE

Table-Based Fixed 8.8

When speed is priority, using table-based routines are often the fastest way to achieve a task.

Logarithms

Because we are working in binary, lg(), which is log,,2,,() is the easiest and most practical to compute. From this, we can derive other bases, too. Now to use some identities, note that lg(x/2)=lg(x)-lg(2) = lg(x)-1. So then we have lg(x)=1+lg(x/2). So what we will do is make a routine that works on [1,2] and extend that to all other values. If our input is less than 1, we have lg(x)=lg(x*2)-1, so we double our input, and subtract from an accumulator. If input is >1, we will divide it by 2 and add to the accumulator. Then, we only have 256 values to worry about in our table. But, if our input is now even, we can divide by 2 without losing any accuracy (no bits lost) and just increment the accumulator. We can keep doing this until we finally have an odd number (which means we need to take care of input=0 as a separate case). Then we have a table of 128 elements instead--all of the odd numbers. And now we can build our algorithm:

   lg_88:
   ;Input: HL is a fixed point number
   ;Output: lg(H.L)->H.L
   ;Speed: Avg: 340
    ld de,lgLUT
    ld b,0
    ld a,h
    or a
    ret m     ;If input is negative, this algorithm won't suffice (this is only for the Reals, not Complex)
    ld a,l
    jr z,$+8
    inc b \ srl h \ rra \ jr nz,$-4   ;input >2, so we rotate, using b as the accumulator
    or a \ jr nz,$+6
    ld hl,8000h \ ret     ;If input was 0, use this as -inf
    rra \ inc b \ jr nc,$-2      ;input<1, so we rotate
   ;A is the element to look up in the LUT
    ld l,a
    ld c,h
    dec b
    add hl,hl
    add hl,de
    ld e,(hl)
    inc hl
    ld d,(hl)
    ex de,hl
    add hl,bc
    ret
   lglut:
   .dw $F800
   .dw $F996
   .dw $FA52
   .dw $FACF
   .dw $FB2C
   .dw $FB76
   .dw $FBB3
   .dw $FBE8
   .dw $FC16
   .dw $FC3F
   .dw $FC64
   .dw $FC86
   .dw $FCA5
   .dw $FCC1
   .dw $FCDC
   .dw $FCF4
   .dw $FD0B
   .dw $FD21
   .dw $FD36
   .dw $FD49
   .dw $FD5C
   .dw $FD6D
   .dw $FD7E
   .dw $FD8E
   .dw $FD9D
   .dw $FDAC
   .dw $FDBA
   .dw $FDC8
   .dw $FDD5
   .dw $FDE2
   .dw $FDEE
   .dw $FDFA
   .dw $FE06
   .dw $FE11
   .dw $FE1C
   .dw $FE26
   .dw $FE31
   .dw $FE3B
   .dw $FE44
   .dw $FE4E
   .dw $FE57
   .dw $FE60
   .dw $FE69
   .dw $FE71
   .dw $FE7A
   .dw $FE82
   .dw $FE8A
   .dw $FE92
   .dw $FE9A
   .dw $FEA1
   .dw $FEA9
   .dw $FEB0
   .dw $FEB7
   .dw $FEBE
   .dw $FEC5
   .dw $FECB
   .dw $FED2
   .dw $FED8
   .dw $FEDF
   .dw $FEE5
   .dw $FEEB
   .dw $FEF1
   .dw $FEF7
   .dw $FEFD
   .dw $FF03
   .dw $FF09
   .dw $FF0E
   .dw $FF14
   .dw $FF19
   .dw $FF1E
   .dw $FF24
   .dw $FF29
   .dw $FF2E
   .dw $FF33
   .dw $FF38
   .dw $FF3D
   .dw $FF42
   .dw $FF47
   .dw $FF4B
   .dw $FF50
   .dw $FF55
   .dw $FF59
   .dw $FF5E
   .dw $FF62
   .dw $FF67
   .dw $FF6B
   .dw $FF6F
   .dw $FF74
   .dw $FF78
   .dw $FF7C
   .dw $FF80
   .dw $FF84
   .dw $FF88
   .dw $FF8C
   .dw $FF90
   .dw $FF94
   .dw $FF98
   .dw $FF9B
   .dw $FF9F
   .dw $FFA3
   .dw $FFA7
   .dw $FFAA
   .dw $FFAE
   .dw $FFB2
   .dw $FFB5
   .dw $FFB9
   .dw $FFBC
   .dw $FFC0
   .dw $FFC3
   .dw $FFC6
   .dw $FFCA
   .dw $FFCD
   .dw $FFD0
   .dw $FFD4
   .dw $FFD7
   .dw $FFDA
   .dw $FFDD
   .dw $FFE0
   .dw $FFE4
   .dw $FFE7
   .dw $FFEA
   .dw $FFED
   .dw $FFF0
   .dw $FFF3
   .dw $FFF6
   .dw $FFF9
   .dw $FFFC
   .dw $FFFF

Adding a little overhead code, we can get an ln() routine. Note that to convert, we have ln(x)=lg(x)/lg(e), but 1/lg(e) is a constant, so we can use constant multiplication. (it is approximately 355/512 for a good estimate):

   ln_88:
   ;Input: HL is a fixed point number
   ;Output: ln(H.L)->H.L
   ;Speed: Avg: 340+(325 worst case)
    call lg_88
    ;now signed multiply HL by 355, then divide by 2 (rounding)
       ld de,0
       bit 7,h
       jr z,$+9
       dec e \ xor a \ sub l \ ld l,a
       sbc a,a \ sub h \ ld h,a
       ld b,h
       ld c,l
       xor a
       add hl,hl
       add hl,hl \ rla
       add hl,bc \ adc a,d
       add hl,hl \ rla
       add hl,bc \ adc a,d
       add hl,hl \ rla
       add hl,hl \ rla
       add hl,hl \ rla
       add hl,hl \ rla
       add hl,bc \ adc a,d
       add hl,hl \ rla
       add hl,bc \ adc a,d
       sra a \ rr h
       ld l,h
       ld h,a
       inc e
       ret nz
       xor a \ sub l \ ld l,a
       sbc a,a \ sub h \ ld h,a
       ret

Arctan

Arctan also can be computed on the whole real line just from the values on [0,1] by the identities: arctan(-x)=-arctan(x) and arctan(x)=pi/2-arctan(1/x). Unlike ln() where we need an LUT (Look Up Table) of 16-bit numbers, we can just use 8-bit numbers. Plus, since arctan is pretty close to a linear function for a while, a large chunk of the LUT can be omitted by checking if the input is within a certain range:

   arctan_88:
   ;Input:
   ;	D.E
   ;Output: atan(D.E)->D.E
   	push de
   	ld a,d
   	or a
   	jp p,$+5
   	neg
   	ld d,a
   	dec a
   	jr nz,checkneedinv
   	inc e \ dec e \ jr nz,checkneedinv
   	pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret
   checkneedinv:
   	inc a
   	call nz,DEgt1_Inv
   ;0.E is the value to atan
   	ld hl,adjustatan
   	push hl
   	ld a,e
   	cp 46 \ ret c
   	dec a \ cp 42h \ ret c
   	dec a \ cp 4Eh \ ret c
   	dec a \ cp 57h \ ret c
   	dec a \ cp 5Eh \ ret c
   	dec a \ cp 64h \ ret c
   	dec a \ cp 6Ah \ ret c
   	dec a \ cp 6Fh \ ret c
   	sub 6Fh \ ld e,a
   	ld hl,atanlut
   	add hl,de
   	ld a,(hl)
   	ret
   adjustatan:
   	ld e,a
   	pop bc
   	ld a,b
   	or a
   	jp p,$+5
   	neg
   	jr z,$+9
   	ld hl,402
   	or a
   	sbc hl,de
   	ex de,hl
   	rl b
   	ret nc
   	xor a
   	sub e
   	ld e,a
   	sbc a,a
   	sub d
   	ld d,a
   	ret
   DEgt1_Inv:
   ;Works if DE>1
   	ld hl,256
   	ld b,8
   InvLoop:
   	add hl,hl
   	sbc hl,de
   	jr nc,$+3
   	add hl,de
   	adc a,a
   	djnz InvLoop
       cpl
   	ld e,a
       ld d,b
       ret
   atanlut:
   .db $6F
   .db $6F
   .db $70
   .db $71
   .db $72
   .db $73
   .db $73
   .db $74
   .db $75
   .db $76
   .db $77
   .db $77
   .db $78
   .db $79
   .db $7A
   .db $7B
   .db $7B
   .db $7C
   .db $7D
   .db $7E
   .db $7F
   .db $7F
   .db $80
   .db $81
   .db $82
   .db $82
   .db $83
   .db $84
   .db $85
   .db $85
   .db $86
   .db $87
   .db $88
   .db $88
   .db $89
   .db $8A
   .db $8B
   .db $8B
   .db $8C
   .db $8D
   .db $8E
   .db $8E
   .db $8F
   .db $90
   .db $90
   .db $91
   .db $92
   .db $93
   .db $93
   .db $94
   .db $95
   .db $95
   .db $96
   .db $97
   .db $97
   .db $98
   .db $99
   .db $9A
   .db $9A
   .db $9B
   .db $9C
   .db $9C
   .db $9D
   .db $9E
   .db $9E
   .db $9F
   .db $A0
   .db $A0
   .db $A1
   .db $A2
   .db $A2
   .db $A3
   .db $A3
   .db $A4
   .db $A5
   .db $A5
   .db $A6
   .db $A7
   .db $A7
   .db $A8
   .db $A9
   .db $A9
   .db $AA
   .db $AA
   .db $AB
   .db $AC
   .db $AC
   .db $AD
   .db $AD
   .db $AE
   .db $AF
   .db $AF
   .db $B0
   .db $B0
   .db $B1
   .db $B2
   .db $B2
   .db $B3
   .db $B3
   .db $B4
   .db $B5
   .db $B5
   .db $B6
   .db $B6
   .db $B7
   .db $B7
   .db $B8
   .db $B9
   .db $B9
   .db $BA
   .db $BA
   .db $BB
   .db $BB
   .db $BC
   .db $BC
   .db $BD
   .db $BE
   .db $BE
   .db $BF
   .db $BF
   .db $C0
   .db $C0
   .db $C1
   .db $C1
   .db $C2
   .db $C2
   .db $C3
   .db $C3
   .db $C4
   .db $C4
   .db $C5
   .db $C6
   .db $C6
   .db $C7
   .db $C7
   .db $C8
   .db $C8
   .db $C9


Random Numbers

16-bit Pseudo-Random Number Generator

The previous routine in this place was 1592cc and 280 bytes (compare to 291cc, 58 bytes). It worked well enough in terms of randomness, but there were some issues that merited a replacement. The following was tested and passed all tests at the CAcert Research Lab (see zrng2 by Zeda Thomas). The data that was output from this routine was from calling this routine to generate two bytes at a time to generate 16MiB of data.

Usage Proper use of this routine is to use the upper bits. The lower bits are not nearly as "random" as the upper bits. With this said, if you want to genreate a number on a range from [0,N), treat the output as a 0.16 fixed-point number, and multiply by N and take the integer part. In assembly, this translated to multiplying the output by N and taking the overflow (bits[16:]).

   #ifndef smc     ;\
   #define smc     ; |Remove or comment for Apps or if you *really* don't want SMC
   #end            ;/
   #ifndef smc             ;\
   #define seed1_0 8000h   ; |If you are not using SMC, you'll need to define
   #define seed1_1 8002h   ; |these vars yourself.
   #define seed2_0 8004h   ; |
   #define seed2_1 8006h   ; |
   #endif                  ;/
   rand16:
   ;Inputs:
   ;   seed1
   ;   seed2>0
   ;Output:
   ;   HL
   ;Destroys:
   ;   A,BC,DE. Could all be used as 'less random' options, too.
   ;Notes:
   ;   Tested and passes all CAcert Labs tests.
   ;   Has a period of 18,446,744,069,414,584,320 (roughly 18.4 quintillion)
   ;   This would take over 11.3 million years at 15MHz to finish a full period.
   ;291cc, +32cc if not using smc
   ;58 bytes, +2 if not using smc
   ;;lcg
   #ifdef smc
   seed1_0=$+1
       ld hl,12345
   seed1_1=$+1
       ld de,6789
   #else
       ld hl,(seed1_0)
       ld de,(seed1_1)
   #endif
       ld b,h
       ld c,l
       add hl,hl \ rl e \ rl d
       add hl,hl \ rl e \ rl d
       inc l
       add hl,bc
       ld (seed1_0),hl
       ld hl,(seed1_1)
       adc hl,de
       ld (seed1_1),hl
       ex de,hl
   ;;lfsr
   #ifdef smc
   seed2_0=$+1
       ld hl,9876
   seed2_1=$+1
       ld bc,54321
   #else
       ld hl,(seed2_0)
       ld bc,(seed2_1)
   #endif
       add hl,hl
       rl c
       rl b
       ld (seed2_1),bc
       sbc a,a
       and %11000101
       xor l
       ld l,a
       ld (seed2_0),hl
       ex de,hl
       add hl,bc
       ret

Conclusion

For more info, see the TI-83 Asm in 28 Days, Day 15.

When multiplying by a constant, there are often more efficient ways than using these subprograms. See here for more information.


For ways to perform more accurate math (Floating Point), there are many prebuilt functions that operate on the OP operators. They can be found here]. We also have Floating Point routines that cover other formats that may be of more use in other situations.

Sources: ASM in 28 Days, Day 15 FP_Math