# Integer Math

To do addition or subtraction on an integer of arbitrary size, we use ADC.

Here we add two 32-bit integers:

```   LD     HL, (word1)        ; Get least-significant word of word1
LD     DE, (word2)        ; Get least-significant word of word2
LD     (result), HL        ; Store least-significant result

LD     HL, (dword1 + 2)    ; Get most-significant word of word1
LD     DE, (dword2 + 2)    ; Get most-significant word of word2
LD     (result + 2), HL    ; Store most-significant result
RET

word1:    .DB    \$B3, \$90, \$12, \$32    ; Each word is stored with the least-significant
word2:    .DB    \$F1, \$15, \$06, \$B8    ; bytes first. You could just as easily have stored
result:    .DB    \$00, \$00, \$00, \$00    ; them in big-endian, but because of how registers are
; loaded from RAM, it wouldn't work.
```

But now, what if you wanted to add truly arbitrary sized integers? We might do something like this:

```   ld hl,Num1
ld de,Num2
ld bc,Size
;Inputs:
;     HL points to Num1 (stored little-endian)
;     DE points to Num2 (stored little-endian)
;     BC is the size of the integers
;Outputs:
;     BC is 0
;     HL points to the byte after Num1
;     DE points tot he byte after Num2
;     c flag is the carry
;     z flag is set
;     p/v flag reset
;     Num1 is overwritten with the result
ld a,(de)
ld (hl),a
inc de
cpi         ;increment HL, decrement BC, set parity flag of BC becomes 0
ret
Num1:
.db \$B3,\$90,\$12,\$32,\$47
Num2:
.db \$F1,\$15,\$06,\$B8,0
```

Similarly, one can perform arbitrary subtractions.

## Multiplication

For multiplication, we could just add them in a DJNZ loop. But that is slow for multiplying big numbers. There is a much more efficient algorithm that you probably learned as a child in school that involves multiplying multi-digit numbers by hand. Similarly, we will do that here by using the binary representation of numbers and the astute reader will realise that binary digits are only 1 or 0 and multiplying by 1 or 0 is rather trivial!

To understand this algorithm, it is suggested that the reader try this by hand with some simple 4-bit numbers, then follow the algorithm or try to develop it in pseudocode.

Here is the first example:

```   .module    DE_Times_A

;Note that we choose DE and A because of the readily available instructions that make this even more efficient.

DE_Times_A:           ; HL = DE × A
LD     HL, 0      ; Use HL to store the product
LD     B, 8       ; Eight bits to check
_loop:
RRCA              ; Check least-significant bit of accumulator
JR     NC, _skip  ; If zero, skip addition
_skip:
SLA    E          ; Shift DE one bit left
RL     D

;In binary, a shift left is multiplying by 2, just like a shift left in decimal is like multiplying by 10.
;We essentially added a 0 to the end of the number

DJNZ   _loop
RET
```

Or alternatively for a slightly faster and smaller code:

```   .module    DE_Times_A
DE_Times_A:      ; HL = DE × A
ld hl,0      ; Use HL to store the product
ld b,8       ; Eight bits to check
_loop:
rlca         ; Check most-significant bit of accumulator
jr nc,_skip  ; If zero, skip addition
_skip:
djnz _loop
ret
```

And you might realise that multiplying an 8-bit number by a 16-bit number can yield up to a 24-bit number, so to take care of that overflow by turning rlca into adc a,a which is the same size and speed. Note that adc a,a can be replaced with rla for the same effect. Some overflow may occur from add hl,de, so we take care of that, too

```   .module    DE_Times_A
DE_Times_A:      ; AHL = DE × A
ld hl,0      ; Use HL to store the product
ld b,8       ; Eight bits to check
ld c,h
_loop:
adc a,a      ; Check most-significant bit of accumulator
jr nc,_skip  ; If zero, skip addition
_skip:
djnz _loop
ret
```

## Division

Division is also just like base-10 division as well, but instead of checking if the next digit is from 0 to 9, we only need to check if it is 0 or 1 which amounts to the question of "is our remainder smaller than the divisor (1) or not (0) ?"

Again, it is recommended that the reader try this with small examples to understand and possibly develop their own code.

```   .module    Div_HL_D
Div_HL_D:            ; HL = HL ÷ D, A = remainder
XOR    A         ; Clear upper eight bits of AHL
LD     B, 16     ; Sixteen bits in dividend
_loop:
ADD    HL, HL    ; Do a SLA HL. If the upper bit was 1, the c flag is set
RLA              ; This moves the upper bits of the dividend into A
JR     C, _overflow
CP     D         ; Check if we can subtract the divisor
JR     C, _skip  ; Carry means D > A
_overflow:
SUB    D         ; Do subtraction for real this time
INC    L         ; Set the next bit of the quotient (currently bit 0)
_skip:
DJNZ   _loop
RET
```

# Fixed Point Math

Fixed Point refers to a number format that includes a fixed size integer part, and a fixed size fractional part. For example, 8.8 fixed point refers to a format where the upper 8 bits form the integer part, the lower 8 bits form the fractional part. Fixed point math often provides a desired balance between the numerical precision of floating point and the computing speed of integer maths.

To elaborate a little more, to represent 3.0 as a fixed point number stored in HL, we might have HL=0300, where the decimal point is between H and L (03.00). Then HL=0341 would be equivalent to 3+65/256=3.25390625. If we were to square this, we get HL=0A96 where we have lost some bits of precision.

As with integer arithmetic, addition and subtraction are trivial on the Z80. To add H.L+D.E, just use add hl,de, and for subtraction, 'or a \ sbc hl,de' will sufice.

## Multiplication

H.L*D.E is essentially HL*DE/65536 If we view HL as a 16 bit integer, then H.L is essentially HL/256. For multiplication, the math for H.L*D.E is then: (HL/256)(DE/256) = (HL*DE)/65536. This means we can use a 16-bit signed multiplication routine that returns 32 bits and we now have a 16.16 fixed point result. To keep output as an 8.8 fixed point number, we need to grab the middle bytes around the decimal. If output was DE.HL, then E.H is a our 8.8 result. Overflow and underflow should be handled by the programmer depending on the task. note: Really, only a routine to return the lower 24-bits of the 16x16 multiplication is needed.

Allowing speed to take priority, here is a 16-bit unsigned integer multiplication:

```   BC_Times_DE:
;  BC*DE->BHLA
ld a,b
ld hl,0
ld b,h
add a,a \ jr nc,\$+5 \ ld h,d \ ld l,e
push hl
ld h,b
ld l,b
ld b,a
ld a,c
ld c,h
add a,a \ jr nc,\$+5 \ ld h,d \ ld l,e
pop de
ld c,a
ld a,l
ld l,h
ld h,c
ret nc
inc b
ret
```

## Division

For division, we should notice that H.L/D.E evaluates as: (HL/256)/(DE/256)=HL/DE We could use a standard signed integer division routine, however, unlike with multiplication, the result can only be used a 16.0 integer. We need 8 more bits of accuracy for 16.8 output (of which we only use the lower 8-bits of the integer part). Since we are going to discard the upper 8 bits of the acquired result, we can optimise the first 8 iterations to not store the bits anywhere.

```   DE_Div_BC_88:
;Inputs:
;     DE,BC are 8.8 Fixed Point numbers
;Outputs:
;     DE is the 8.8 Fixed Point result (rounded to the least significant bit)
ld a,d
xor b
push af
call m,negDE
call div8sub
pop af
ret p
negDE:
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ret
div8sub:
ld a,8
ld hl,0
Loop1:
rl d
sbc hl,bc
jr nc,\$+3
dec a
jr nz,Loop1
ld d,e
ld e,a
ld a,16
jp \$+6
DivLoop:
dec a
ret z
sla e
rl d
jr c,overflow
sbc hl,bc
jr c,DivLoop
inc e
jp DivLoop+1
overflow:
or a
sbc hl,bc
inc e
jp DivLoop
```

### Inverse (1/x)

If we want to invert a number >1, we know that its result will be <1, so for 8.8 fixed point, this means there will only be 8 bits in the result. We can optimise out most of the cycles of the division, making it much faster. In the DE_Div_BC_88 routine, we initialise HL=0 and rotate a bit of DE into HL at each iteration. If DE=100h=1.0, then no bits will get rotated into HL until the 8th iteration. Then after that, HL=1, but we know BC>256, so we are guaranteed to have nothing happen for another 8 iterations until HL=256. At this point, we have only 8 more iterations to compute, and DE is 0, so we no longer need to even rotate DE into HL. So now at this point, we can just initialise HL=256, then rotate in a 0 each iteration. This can be done with add hl,hl, then we compare to BC to see if the remainder (here, HL). However, since we now only need two register pairs, lets use B for the iteration counter, and DE is the denominator. If we compare HL to DE, we can use sbc hl,de. If the C flag is set, it means HL<DE, so we need to rotate a 0 into our accumulator and then add DE to HL again. We could use sbc hl,de \ jr nc,\$+3 \ add hl,de \ ccf \ rla, but if we note that we have a cpl instruction to invert the bits of a:

```   DEgt1_Inv:
;1/D.E->B.C
; 470 to 518 t-states
ld hl,256
ld b,8
InvLoop:
sbc hl,de
jr nc,\$+3
rla
djnz InvLoop
cpl
ld c,a
ret
```

And to unroll, saving 102 t-states:

```   DEgt1_Inv:
;1/D.E->B.C
; 343 to 397 t-states
ld hl,512
ld b,l
InvLoop:
or a       \ sbc hl,de \ jr nc,\$+3 \ add hl,de \ adc a,a
cpl
ld c,a
ret
```

Note that this actually works as an HL_Div_DE_88 routine, by removing the ld hl,256 and making sure HL<DE.

## Square Root

Using the integer square root of HL, sqrt(H.L) would be returned in 4.4 fixed point format. The upper 4 bits of the integer part will be 0 regardless, so we can essentially say that we have 8.4 Fixed Point, so we need to modify the algorithm to iterate 4 more times to get 8.8 fixed point output. The following is adapted from a speed-optimised routine used in a floating point library (which required 16 bits instead of 12):

```   SqrtHL_prec12:
;input: HL
;Output: HL
;183 bytes
xor a
ld b,a

ld e,l
ld l,h
ld h,a

cp h
jr nc,\$+5
dec h
ld a,4

ld c,a
sub h
jr nc,\$+6
cpl
ld h,a
inc c
inc c

ld a,c
ld c,a
sub h
jr nc,\$+6
cpl
ld h,a
inc c
inc c

ld a,c
ld c,a
sub h
jr nc,\$+6
cpl
ld h,a
inc c
inc c

ld a,c
ld l,e

ld c,a
sub h
jr nc,\$+6
cpl
ld h,a
inc c
inc c

ld a,c
ld c,a
sub h
jr nc,\$+6
cpl
ld h,a
inc c
inc c

ld a,c
jr nc,\$+6
sub h \ jp \$+6
sub h
jr nc,\$+6
inc c \ inc c
cpl
ld h,a

ld a,l
ld l,h
ld h,a
sll c \ rl b
sbc hl,bc
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

;iteration 9
sll c \ rl b
sbc hl,bc
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

sll c \ rl b
sbc hl,bc
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

sll c \ rl b
sbc hl,bc
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

sll c \ rl b
sbc hl,bc
jr nc,\$+3
sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a
;12th iteration completed
; output in BC
srl b \ rr c
ld h,b
ld l,c
ret
```

## Exponentiation and Logarithms

### Exponentiation

Here, exponentiation refers to [itex]2^{x}[/itex]. Observe that [itex]2^{39/16}=2^{2+7/16} = 2^{2+0/2+1/4+1/8+1/16}=2^{2}2^{0/2}2^{1/4}2^{1/8}2^{1/16}[/itex] The observant reader will see that taking away the trivial integer part of the power, the fractional part can be represented as a sequence of square roots with 2s wherever there are 1s in the binary expansion. So to compute [itex]2^{x}[/itex] where x is on [0,1), in pseudo code we have:

```   If x==0
return 1.0
repeat until carry is set
x>>=1    ;shift X right
y=sqrt(2)
while x>0
x>>=1
if (carry is set)
y*=2
y=sqrt(y)
```

Converting that to assembly:

```   ;Inputs:
;     HL is the 8.8 fixed point number 'x' for 2^x
;Outputs:
;     DEHL is the 24.8 fixed point result. If there was overflow exceeding 2^24, then this value is set to the max.
ld a,l
or a
push hl     ;save H for later, H is the integer part of the power
ld hl,1
jr z,integerpow
scf      ;set the carry flag so that a bit is rotated into a. This will act as our counter.
;wait until we come across the lowest bit. Also note that we
rra
jr nc,\$-1
ld hl,2*256
powloop:
push af
call FPSqrtHL    ;returns in HL
pop af
srl a
jr z,integerpow
jr nc,powloop
jp powloop
integerpow:
pop bc
;Now b is the integer part for 2^x that we need to multiply HL by.
ld de,0
ld a,b
or a
ret z

rl e \ rl d \ jr c,wayoverflow
djnz \$-7
ret
wayoverflow:
ld hl,-1
ld d,h
ld e,l
ret
```

Note that in the spirit of optimisation, that we can significantly speed up the fixed point sqrtHL routine since HL is never greater than 4.0.

### Logarithms

Here, logarithm refers to log,,2,, also written as lg(). Using logarithms will allow us to extend the exponential routine to other bases, such as 3^n without much difficulty. Here is a fast, fixed point log_2 routine, optimised for size:

```   Log_2_88_size:
;Inputs:
;     HL is an unsigned 8.8 fixed point number.
;Outputs:
;     HL is the signed 8.8 fixed point value of log base 2 of the input.
;Example:
;     pass HL = 3.0, returns 1.58203125 (actual is ~1.584962501...)
;averages about 39 t-states slower than original
;62 bytes
ex de,hl
ld hl,0
ld a,d
ld c,8
or a
jr z,DE_lessthan_1
srl d
jr z,logloop-1
inc l
rr e
jr \$-7
DE_lessthan_1:
ld a,e
dec hl
or a
ret z
inc l
dec l
jr nc,\$-2
ld e,a

inc d
logloop:
push hl
ld h,d
ld l,e
ld a,e
ld b,8

rla
jr nc,\$+5
djnz \$-7

ld e,h
ld d,a
pop hl
rr a         ;this is NOT supposed to be rra, we need the z flag affected
jr z,\$+7
srl d
rr e
inc l
dec c
jr nz,logloop
ret
```

## Arctangent and Natural Logarithm

Although these are not the fastest methods, the approach to the problem of computing these functions efficiently is interesting. As well, they can work faster than CORDIC or AGM methods. Given efficient routines for computing multiplication and division, the following are accurate to better than 8 bits of accuracy:

```   atan(x) : x(9+2x^2)/(9+5x^2) on [-1,1]
ln(x+1) : x(8+x)/(8+5x) on [0,1]
```

For a more complete analysis of how these formulas were derived, see this document. What is convenient to note is that the constant multiplications can be reduced to a few shift and add instuctions. So to put the functions into code:

```   ArcTan_88:
;Inputs: DE is positive
ld a,d
or a
jr z,ArcTan+2
inc a
jr z,ArcTan+2
ld b,d
ld c,e
ld de,256
call DE_Div_BC_88
call ArcTan
ex de,hl
or a
sbc hl,de
ret
ArcTan:
;Input: DE
;Output: HL
push de    ;save X for later
ld b,d
ld c,e
call BC_Times_DE ;x^2 computed
ld d,h
ld e,l
add hl,de   ;5x^2 is now in HL
ld a,9
ld h,a      ;9+5x^2->HL
ld a,9
sla e
rl d
ld d,a      ;9+2x^2->DE
ld b,h
ld c,l
call DE_Div_BC_88   ;->DE
pop bc
jp BC_Times_DE
```

For the following ln() routine, the approximating function is employed taking advantage of lots of optimisations, putting it between the computing speed of multiplication and division:

```   lognat:
;Input:  H.L needs to be on [1,2]
;Output: H.L if z flag is set, else if nz, no result
;avg speed: 677 t-states
dec h
dec h
jr nz,\$+8
inc l \ dec l
ret nz
ld l,177
ret
inc h
ret nz
ld b,h
ld c,l
ld e,l
ld d,8
ex de,hl

ld h,a \ ld l,b
sla h \ jr c,\$+3 \ ld l,c
rl l
ld a,h
ld l,a
ld h,b
cp a
ret
```

But that only works on [1,2], so we have to reduce speed a little to extend it to all positive inputs (0,128):

```   lognat:
;Input:  H.L needs to be on (0,128.0)
;Output: H.L if c flag set
;	 returns nc if input is negative (HL not modified)
;Error:
;	The error on the outputs is as follows:
;		20592 inputs are exact
;		12075 inputs are off by 1/256
;		100 inputs are off by 2/256
;	So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256.
;Size: 177 bytes
;Speed: average speed is less than 1250 t-states

ld a,h \ or l \ jr nz,\$+5
ld h,80h \ ret
dec h
dec h
jr nz,\$+9
inc l \ dec l
jr nz,normalizeln-1
ld l,177
ret
inc h
jr nz,normalizeln
ld b,h
ld c,l
ld e,l
ld d,8
ex de,hl
call HL_Div_DE
ld h,a \ ld l,b
sla h \ jr c,\$+3 \ ld l,c
rl l
ld a,h
ld h,b
ld l,a
scf
ret
HL_Div_DE:

inc h
normalizeln:
xor a
inc h \ ret m
ld d,a \ ld e,a
ld a,l
jr z,toosmall
inc e \ srl h \ rra \ jr nz,\$-4
rla \ rl h
dec e
stepin:
ld l,a
push de
call lognat
pop de
;now multiply DE by 355, then divide by 2 (rounding)
ld b,d \ ld c,e \ ld a,d
ex de,hl
sra h \ rr l
scf
ret
toosmall:
dec d
dec e \ add a,a \ jr nc,\$-2
inc h
jp stepin
```

Now it averages about 1250 t-states, still faster than some implementations of division.

If you pass HL=768, in fixed point that is 3.0, so output is 282 which is 1.1015625 in fixed point.

## Sine and Cosine

Unlike atan() and ln(), these have quickly converging Taylor series. From these, we can derive:

```   sin(x) : x-85x^3/512+x^5/128 is within 8 bits of accuracy on [-pi/2,pi/2]
cos(x) : 1-x^2/2+5x^4/128 is within 8 bits of accuracy on [-pi/2,pi/2] for almost all values
```

For sine, noting that 85=17*5 will help with fast constant multiplication:

```   sine_88:
;Inputs: de
push de
sra d \ rr e
ld b,d \ ld c,e
call BC_Times_DE
push hl     ;x^2/4
sra h \ rr l
ex de,hl
ld b,d \ ld c,e
call BC_Times_DE
sra h \ rr l
inc h
ex (sp),hl    ;x^4/128+1 is on stack, HL=x^2/4
xor a
ld d,a
ld b,h
ld c,l
ld b,h
ld c,l
ld e,l
ld l,h
ld h,a
rl e
rl e
jr nc,\$+3
inc hl

pop de
ex hl,de
or a
sbc hl,de
ex de,hl
pop bc
jp BC_Times_DE
```

## Table-Based Fixed 8.8

When speed is priority, using table-based routines are often the fastest way to achieve a task.

### Logarithms

Because we are working in binary, lg(), which is log,,2,,() is the easiest and most practical to compute. From this, we can derive other bases, too. Now to use some identities, note that lg(x/2)=lg(x)-lg(2) = lg(x)-1. So then we have lg(x)=1+lg(x/2). So what we will do is make a routine that works on [1,2] and extend that to all other values. If our input is less than 1, we have lg(x)=lg(x*2)-1, so we double our input, and subtract from an accumulator. If input is >1, we will divide it by 2 and add to the accumulator. Then, we only have 256 values to worry about in our table. But, if our input is now even, we can divide by 2 without losing any accuracy (no bits lost) and just increment the accumulator. We can keep doing this until we finally have an odd number (which means we need to take care of input=0 as a separate case). Then we have a table of 128 elements instead--all of the odd numbers. And now we can build our algorithm:

```   lg_88:
;Input: HL is a fixed point number
;Output: lg(H.L)->H.L
;Speed: Avg: 340
ld de,lgLUT
ld b,0
ld a,h
or a
ret m     ;If input is negative, this algorithm won't suffice (this is only for the Reals, not Complex)
ld a,l
jr z,\$+8
inc b \ srl h \ rra \ jr nz,\$-4   ;input >2, so we rotate, using b as the accumulator
or a \ jr nz,\$+6
ld hl,8000h \ ret     ;If input was 0, use this as -inf
rra \ inc b \ jr nc,\$-2      ;input<1, so we rotate
;A is the element to look up in the LUT
ld l,a
ld c,h
dec b
ld e,(hl)
inc hl
ld d,(hl)
ex de,hl
ret
lglut:
.dw \$F800
.dw \$F996
.dw \$FA52
.dw \$FACF
.dw \$FB2C
.dw \$FB76
.dw \$FBB3
.dw \$FBE8
.dw \$FC16
.dw \$FC3F
.dw \$FC64
.dw \$FC86
.dw \$FCA5
.dw \$FCC1
.dw \$FCDC
.dw \$FCF4
.dw \$FD0B
.dw \$FD21
.dw \$FD36
.dw \$FD49
.dw \$FD5C
.dw \$FD6D
.dw \$FD7E
.dw \$FD8E
.dw \$FD9D
.dw \$FDAC
.dw \$FDBA
.dw \$FDC8
.dw \$FDD5
.dw \$FDE2
.dw \$FDEE
.dw \$FDFA
.dw \$FE06
.dw \$FE11
.dw \$FE1C
.dw \$FE26
.dw \$FE31
.dw \$FE3B
.dw \$FE44
.dw \$FE4E
.dw \$FE57
.dw \$FE60
.dw \$FE69
.dw \$FE71
.dw \$FE7A
.dw \$FE82
.dw \$FE8A
.dw \$FE92
.dw \$FE9A
.dw \$FEA1
.dw \$FEA9
.dw \$FEB0
.dw \$FEB7
.dw \$FEBE
.dw \$FEC5
.dw \$FECB
.dw \$FED2
.dw \$FED8
.dw \$FEDF
.dw \$FEE5
.dw \$FEEB
.dw \$FEF1
.dw \$FEF7
.dw \$FEFD
.dw \$FF03
.dw \$FF09
.dw \$FF0E
.dw \$FF14
.dw \$FF19
.dw \$FF1E
.dw \$FF24
.dw \$FF29
.dw \$FF2E
.dw \$FF33
.dw \$FF38
.dw \$FF3D
.dw \$FF42
.dw \$FF47
.dw \$FF4B
.dw \$FF50
.dw \$FF55
.dw \$FF59
.dw \$FF5E
.dw \$FF62
.dw \$FF67
.dw \$FF6B
.dw \$FF6F
.dw \$FF74
.dw \$FF78
.dw \$FF7C
.dw \$FF80
.dw \$FF84
.dw \$FF88
.dw \$FF8C
.dw \$FF90
.dw \$FF94
.dw \$FF98
.dw \$FF9B
.dw \$FF9F
.dw \$FFA3
.dw \$FFA7
.dw \$FFAA
.dw \$FFAE
.dw \$FFB2
.dw \$FFB5
.dw \$FFB9
.dw \$FFBC
.dw \$FFC0
.dw \$FFC3
.dw \$FFC6
.dw \$FFCA
.dw \$FFCD
.dw \$FFD0
.dw \$FFD4
.dw \$FFD7
.dw \$FFDA
.dw \$FFDD
.dw \$FFE0
.dw \$FFE4
.dw \$FFE7
.dw \$FFEA
.dw \$FFED
.dw \$FFF0
.dw \$FFF3
.dw \$FFF6
.dw \$FFF9
.dw \$FFFC
.dw \$FFFF
```

Adding a little overhead code, we can get an ln() routine. Note that to convert, we have ln(x)=lg(x)/lg(e), but 1/lg(e) is a constant, so we can use constant multiplication. (it is approximately 355/512 for a good estimate):

```   ln_88:
;Input: HL is a fixed point number
;Output: ln(H.L)->H.L
;Speed: Avg: 340+(325 worst case)
call lg_88
;now signed multiply HL by 355, then divide by 2 (rounding)
ld de,0
bit 7,h
jr z,\$+9
dec e \ xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
ld b,h
ld c,l
xor a
sra a \ rr h
ld l,h
ld h,a
inc e
ret nz
xor a \ sub l \ ld l,a
sbc a,a \ sub h \ ld h,a
ret
```

### Arctan

Arctan also can be computed on the whole real line just from the values on [0,1] by the identities: arctan(-x)=-arctan(x) and arctan(x)=pi/2-arctan(1/x). Unlike ln() where we need an LUT (Look Up Table) of 16-bit numbers, we can just use 8-bit numbers. Plus, since arctan is pretty close to a linear function for a while, a large chunk of the LUT can be omitted by checking if the input is within a certain range:

```   arctan_88:
;Input:
;	D.E
;Output: atan(D.E)->D.E
push de
ld a,d
or a
jp p,\$+5
neg
ld d,a
dec a
jr nz,checkneedinv
inc e \ dec e \ jr nz,checkneedinv
pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret
checkneedinv:
inc a
call nz,DEgt1_Inv
;0.E is the value to atan
push hl
ld a,e
cp 46 \ ret c
dec a \ cp 42h \ ret c
dec a \ cp 4Eh \ ret c
dec a \ cp 57h \ ret c
dec a \ cp 5Eh \ ret c
dec a \ cp 64h \ ret c
dec a \ cp 6Ah \ ret c
dec a \ cp 6Fh \ ret c
sub 6Fh \ ld e,a
ld hl,atanlut
ld a,(hl)
ret
ld e,a
pop bc
ld a,b
or a
jp p,\$+5
neg
jr z,\$+9
ld hl,402
or a
sbc hl,de
ex de,hl
rl b
ret nc
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
ret
DEgt1_Inv:
;Works if DE>1
ld hl,256
ld b,8
InvLoop:
sbc hl,de
jr nc,\$+3
djnz InvLoop
cpl
ld e,a
ld d,b
ret
atanlut:
.db \$6F
.db \$6F
.db \$70
.db \$71
.db \$72
.db \$73
.db \$73
.db \$74
.db \$75
.db \$76
.db \$77
.db \$77
.db \$78
.db \$79
.db \$7A
.db \$7B
.db \$7B
.db \$7C
.db \$7D
.db \$7E
.db \$7F
.db \$7F
.db \$80
.db \$81
.db \$82
.db \$82
.db \$83
.db \$84
.db \$85
.db \$85
.db \$86
.db \$87
.db \$88
.db \$88
.db \$89
.db \$8A
.db \$8B
.db \$8B
.db \$8C
.db \$8D
.db \$8E
.db \$8E
.db \$8F
.db \$90
.db \$90
.db \$91
.db \$92
.db \$93
.db \$93
.db \$94
.db \$95
.db \$95
.db \$96
.db \$97
.db \$97
.db \$98
.db \$99
.db \$9A
.db \$9A
.db \$9B
.db \$9C
.db \$9C
.db \$9D
.db \$9E
.db \$9E
.db \$9F
.db \$A0
.db \$A0
.db \$A1
.db \$A2
.db \$A2
.db \$A3
.db \$A3
.db \$A4
.db \$A5
.db \$A5
.db \$A6
.db \$A7
.db \$A7
.db \$A8
.db \$A9
.db \$A9
.db \$AA
.db \$AA
.db \$AB
.db \$AC
.db \$AC
.db \$AE
.db \$AF
.db \$AF
.db \$B0
.db \$B0
.db \$B1
.db \$B2
.db \$B2
.db \$B3
.db \$B3
.db \$B4
.db \$B5
.db \$B5
.db \$B6
.db \$B6
.db \$B7
.db \$B7
.db \$B8
.db \$B9
.db \$B9
.db \$BA
.db \$BA
.db \$BB
.db \$BB
.db \$BC
.db \$BC
.db \$BD
.db \$BE
.db \$BE
.db \$BF
.db \$BF
.db \$C0
.db \$C0
.db \$C1
.db \$C1
.db \$C2
.db \$C2
.db \$C3
.db \$C3
.db \$C4
.db \$C4
.db \$C5
.db \$C6
.db \$C6
.db \$C7
.db \$C7
.db \$C8
.db \$C8
.db \$C9
```

# Random Numbers

## 16-bit Pseudo-Random Number Generator

The previous routine in this place was 1592cc and 280 bytes (compare to 291cc, 58 bytes). It worked well enough in terms of randomness, but there were some issues that merited a replacement. The following was tested and passed all tests at the CAcert Research Lab (see zrng2 by Zeda Thomas). The data that was output from this routine was from calling this routine to generate two bytes at a time to generate 16MiB of data.

Usage Proper use of this routine is to use the upper bits. The lower bits are not nearly as "random" as the upper bits. With this said, if you want to genreate a number on a range from [0,N), treat the output as a 0.16 fixed-point number, and multiply by N and take the integer part. In assembly, this translated to multiplying the output by N and taking the overflow (bits[16:]).

```   #ifndef smc     ;\
#define smc     ; |Remove or comment for Apps or if you *really* don't want SMC
#end            ;/
#ifndef smc             ;\
#define seed1_0 8000h   ; |If you are not using SMC, you'll need to define
#define seed1_1 8002h   ; |these vars yourself.
#define seed2_0 8004h   ; |
#define seed2_1 8006h   ; |
#endif                  ;/
rand16:
;Inputs:
;   seed1
;   seed2>0
;Output:
;   HL
;Destroys:
;   A,BC,DE. Could all be used as 'less random' options, too.
;Notes:
;   Tested and passes all CAcert Labs tests.
;   Has a period of 18,446,744,069,414,584,320 (roughly 18.4 quintillion)
;   This would take over 11.3 million years at 15MHz to finish a full period.
;291cc, +32cc if not using smc
;58 bytes, +2 if not using smc
;;lcg
#ifdef smc
seed1_0=\$+1
ld hl,12345
seed1_1=\$+1
ld de,6789
#else
ld hl,(seed1_0)
ld de,(seed1_1)
#endif
ld b,h
ld c,l
add hl,hl \ rl e \ rl d
add hl,hl \ rl e \ rl d
inc l
ld (seed1_0),hl
ld hl,(seed1_1)
ld (seed1_1),hl
ex de,hl
;;lfsr
#ifdef smc
seed2_0=\$+1
ld hl,9876
seed2_1=\$+1
ld bc,54321
#else
ld hl,(seed2_0)
ld bc,(seed2_1)
#endif
rl c
rl b
ld (seed2_1),bc
sbc a,a
and %11000101
xor l
ld l,a
ld (seed2_0),hl
ex de,hl