N64 Programming/Printable version


N64 Programming

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/N64_Programming

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.

CPU overview

 
The NEC VR4300 at the heart of the N64.

Processor: 93.75 MHz NEC VR4300 , based on MIPS R4300i-series 64-bit RISC CPU (info)

Registers

edit

32 general registers, of which Nintendo has given a naming convention.

R0 = always zero. Any attempts to modify this register silently fail.
T0-T9 = scratch registers. CPU RAM.
S0-S7 = registers saved upon function protocol. Trash at will if you know how.
A0-A3 = parameter passing to subroutines. Formal but not rigid.
RA = return address from subroutine. Not pulled from 'stack'. Change at convenience.
V0-V1 = arithmetic values, function return values.
SP = stack pointer. Informal.
AT = assembler temporary. Free use.

These are formal definitions but not strictly enforced, save for R0 which is hardwired.


Instructions are WORD sized (32-bits).

Coprocessors

edit
 
The RCP, also known as COP2.

In addition to the CPU, there are three other coprocessors.

  • COP0 = memory management unit (MMU). Better known as 'virtual memory'.
  • COP1 = floating-point unit (FPU).
  • COP2 = video coprocessor (RCP).

Branch delays

edit

When performing branches, a 1-cycle delay is incurred. This means a branch instruction such as beq r0,r0,8006D234h would also execute the instruction following it. There is a limit on which opcodes can be placed in the delay slot.

Note that 'beq r0,r0,TARGET' is effectively 'bra TARGET'.


Examples

edit

The following machine code is found in Mario Golf.

[1120:0027] 800B0130: BEQ     t1[800FBBD0],r0[00000000],800B01D0h
[0000:0000] 800B0134: NOP

NOP (no operation) is executed before the next instruction is fetched from memory.

[0c02:c0d7] 800B01B4: JAL     800B035C
[0120:2021] 800B01B8: ADDU    a0[00000038],t1[800FBBD0],r0[00000000]

An ADDU (add unsigned) is executed as the program counter is set to address 800B035C.

For our hobbyist purposes, it is much safer to always inefficiently waste a NOP (no operation) in the slot, and optimize later if necessary.

Signed Addition

edit

Care is required when using the unsigned family of MIPS instructions as the signedness refers to only whether the instruction will generate a trap on overflow. MIPS will sign-extend an operand regardless of whether ADDI or ADDIU is used:

LI $A0, 8013 ;A0 = 0x80130000
ADDIU $A0, $A0, FFFF ;A0 = 0x8012FFFF

Essentially when adding a value greater than 0x7FFF, even with the unsigned addition instructions, the effect is as if the value were negative. This also affects relative addressing, e.g.:

LI $A0, 8013 ;A0 = 0x80130000
LW $A1, FFFF($A0) ;load word from 0x8012FFFF


Video coprocessor

The RCP (VDP,PPU,video controller) is mainly interesting for polygon crunching and post-filter effects.

Formed from two components - RSP(Reality Signal Processor) and RDP(Reality Drawing Processor).

Texture cache and framebuffers are shared in RDRAM with the CPU.

Actual texture memory is only 4KB -- only allowed to operate on this amount at a time.

RSP(Reality Signal Processor)

edit

RSP is your transform and lighting unit (TnL). It manipulates world data and textures.

Mathematically, lots of matrices to transform from local data -> world space -> view space (projection w/ z-perspective correction).

Transform means scaling, translating and rotating for polygons, lighting normals, and texture UVs.

Creates primitive lists of triangles and lines for the RDP to render.

RDP(Reality Drawing Processor)

edit

RDP is the display unit.

Rasterizer, fog, environmental, color blending. Anti-aliasing effects.

Lower-level pixel handler.

uCodes

edit

The RCP has its own language, dubbed 'uCodes' (256 microcodes).

Think of modern vertex and fragment shaders from OpenGL or vertex and pixel shaders from Direct3D - both stages combined.

R4000 coprocessor (COP2). Each uCode is a string of ASM instructions run by the RSP. Also sets up the RDP batch renderer.

Display lists are a sequence of uCodes defined by the game. This is fed to the RSP.


Note: Emulator authors choose to translate uCodes into higher-level languages.

The programmers can define their own vertex / texture formats. Lighting methods. Overdraw detection and other flexible wizardry.

The microcodes are uploaded to the RCP at run-time. So each game has its own library of drawing functions (some are like DSP1 -> DSP1A -> DSP1B and others are akin to DSP2,DSP3).

Texture variations

edit

Games use custom formats. Furthermore, linear bitmaps can be any size (320x8, 48x13). 4/8/16/32-bpp is the norm.

The microcode libraries tend to define several 'accepted' formats.

This is a sample list from video plugin authors.

  • 16-bit RGBA = 5551. Red,green,blue,alpha (transparency).
  • 32-bit RGBA = 8888.
  • 4-bit IA = 31. Grayscale intensity (luminosity,brightness) + alpha.
  • 8-bit IA = 44.
  • 16-bit IA = 88.
  • 4-bit I = 40. Grayscale only.
  • 8-bit I = 80. Grayscale.
  • 16-bit I = (16)(0).
  • 4-bit CI = 40. Palette lookup --> 16-bit RGBA or 16-bit IA.
  • 8-bit CI = 80. Palette lookup --> 16-bit RGBA or 16-bit IA.

YUV = some other color format. Output is RGB (888) + Full alpha.


Memory mapping

CPU  Main Memory (RDRAM) (2)           CDROM (8)
 |        |                               |
 ------------------------------------------
    |                    |
 Registers (1)      Cartridge (ROM) (4)


Naturally, we'd like to keep our data in registers (1 clock cycle).

Having the CPU execute from CDROM would be slow (8 cycles). Even cartridge memory is faster.

Most likely, the game will copy the important code and data to RDRAM to decrease load times. And improve execution.

This is done via Direct-Memory Access (DMA), speedy transfer.

Note that you may also find self-modifying code since we're running from RAM.

Because the code is run off of memory, it is compiled from RAM and not ROM addresses.

Think of 8-bit NES/SMS/GB page offsets.

Examples

edit
[0361:d824] 8002A14C: AND     k1[0000FF03],k1[0000FF03],at[FFFF00FF]
[0369:d825] 8002A150: OR      k1[00000003],k1[00000003],t1[0000FF00]
[3c09:a430] 8002A154: LUI     t1[0000FF00],FFFFA430h

is at ROM $554C (Fushigi no Dungeon 2).

Memory map

edit

Our memory map goes from $0000:0000-FFFF:FFFF.

COP0 can touch $2000:0000 upwards. Meaning it can change the physical addresses pointed at 24-bit offsets (16 MB). Or 8MB down to 4KB pages.

Generally,

$0000:0000 = ROM.
$1000:0000 = ROM.
$8000:0000 = RDRAM. Code.
$A000:0000 = RDRAM. Data.
$A400:0000 = PI,SI. DMA registers.
$B000:0000 = ROM (DMA, LD).

You'll see this as 'Translation Look-aside buffer' (TLB).

PI DMA

edit

DMA is a function of all modern systems to quickly move data between different components without involving the CPU. It is important to note that PI DMA memory transfers are done in 64-bit blocks. We can DMA from:

  • RDRAM to/from Cartridge (ROM,SRAM,FlashRAM,..)
  • RDRAM to/from RCP

DMA registers:

$A460:0000 = RAM address (address & 0x00FFFFFF)
$A460:0004 = ROM address (address & 0x1FFFFFFF)
$A460:0008 = Transfer size (from RAM to cartridge)
$A460:000C = Transfer size (from cartridge to RAM)
$A460:0010 = DMA Status

The last two addresses accept the DMA copy length (minus 1--BPL loop) and start the transfer. Which length register you write to depends on the transfer direction you want. Once the transfer has been initiated, the status of the transfer can be checked by reading the status register. The following flags can be used to check the status:

  • DMA_BUSY = 0x00000001
  • DMA_ERROR = 0x00000008

Below is some example C code to utilize the N64's DMA functions:

/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **
** N64 DMA                         **
** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */

typedef struct
{
	/* Pointers to data */
	void *ramp;
	void *romp;
	
	/* Filesizes (8-byte aligned) */
	u32 size_ramrom; /* RAM -> ROM */
	u32 size_romram; /* RAM <- ROM */
	
	/* Status register */
	u32 status;
} DMA_REG;

/* DMA status flags */
enum
{
	DMA_BUSY  = 0x00000001,
	DMA_ERROR = 0x00000008
};

/* DMA registers ptr */
static volatile DMA_REG * dmaregs = (DMA_REG*)0xA4600000;

/* Copy data from ROM to RAM */
int dma_write_ram ( void *ram_ptr, void *rom_ptr, u32 length )
{
	/* Check that DMA is not busy already */
	while( dmaregs->status & DMA_BUSY );
	
	/* Write addresses */
	dmaregs->ramp = (u32)ram_ptr & 0x00FFFFFF; /* ram pointer */
	dmaregs->romp = (u32)rom_ptr & 0x1FFFFFFF; /* rom pointer */
	
	/* Write size */
	dmaregs->size_romram = length - 1;
	
	/* Wait for transfer to finish */
	while( dmaregs->status & DMA_BUSY );
	
	/* Return size written */
	return length & 0xFFFFFFF8;
}


Compiling

 
The error handling routine of an application coded from the ground up in C

The N64 is by no means resource limited, so writing software for it in C is perfectly reasonable. One thing you must keep in mind, though: coding for the N64 requires extensive knowledge of both C and MIPS R4K assembly. However, assembly will only have to be used in small routines that initialize the N64 (or handle exceptions). You also have to be familiar with the GNU toolchain (binutils and gcc namely).

Initial Steps

edit

Choose a directory that you want the compiled binaries, set this to $PREFIX, and ensure that it exists. For example,

export PREFIX=/opt/n64
mkdir -p $PREFIX

Next, set $GCC to the compiler you are going to use. For example,

export GCC=gcc

Building binutils

edit

Download the binutils source code (we will use version 2.35.1), and extract it.

Create an out-of-tree build directory, and change into it:

mkdir -p build-binutils
cd build-binutils

Then run the following commands:

../binutils-2.35.1/configure \
  --target=mips64-elf --prefix=$PREFIX \
  --program-prefix=mips64- --with-cpu=vr4300 \
  --with-sysroot --disable-nls --disable-werror
make CC=$GCC
make install

Hopefully everything went well, and now you'll have a binutils package targeting MIPS. Move back to your working directory and now it is time to build GCC.

Compiling GCC

edit

GCC needs to use some of the binaries that you compiled above, so do the following:

export PATH=$PREFIX/bin:$PATH

This will make GCC be able to find them.

As with binutils, download the the gcc source (we will use version 10.2.0) and extract it.

Create an out-of-tree build directory, and change into it:

mkdir -p build-gcc
cd build-gcc

Then run the following commands:

../gcc-10.2.0/configure \
  --target=mips64-elf --prefix=$PREFIX \
  --program-prefix=mips64- --with-arch=vr4300 \
  -with-languages=c,c++ --disable-threads \
  --disable-nls --without-headers
make all-gcc CC=$GCC
make all-target-libgcc CC=$GCC
make install-gcc
make install-target-libgcc

The mips64 toolchain should now be installed in your chosen $PREFIX/bin.

Coding examples

edit

While coding C for the N64 is no different than any other platform (except that you don't have any libraries at your disposal), you may, at times, have to write to memory mapped registers. The following example (which utilizes DMA) demonstrates this:

/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **
** N64 DMA                         **
** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */

typedef struct
{
	/* Pointers to data */
	void *ramp;
	void *romp;
	
	/* Filesizes (8-byte aligned) */
	u32 size_ramrom; /* RAM -> ROM */
	u32 size_romram; /* RAM <- ROM */
	
	/* Status register */
	u32 status;
} DMA_REG;

/* DMA status flags */
enum
{
	DMA_BUSY  = 0x00000001,
	DMA_ERROR = 0x00000008
};

/* DMA registers ptr */
static volatile DMA_REG * dmaregs = (DMA_REG*)0xA4600000;

/* Copy data from ROM to RAM */
int dma_write_ram ( void *ram_ptr, void *rom_ptr, u32 length )
{
	/* Check that DMA is not busy already */
	while( dmaregs->status & DMA_BUSY );
	
	/* Write addresses */
	dmaregs->ramp = (u32)ram_ptr & 0x00FFFFFF; /* ram pointer */
	dmaregs->romp = (u32)rom_ptr & 0x1FFFFFFF; /* rom pointer */
	
	/* Write size */
	dmaregs->size_romram = length - 1;
	
	/* Wait for transfer to finish */
	while( dmaregs->status & DMA_BUSY );
	
	/* Return size written */
	return length & 0xFFFFFFF8;
}

You may also have to use inline assembly a fair bit. The function below sets a breakpoint on a region of memory:

enum
{
        BREAKPOINT_READ  = 1,
        BREAKPOINT_WRITE = 2
};

/* Set breakpoint */
void bp_set ( u32 addr, u8 flags )
{
        addr &= 0x3FFFF8; /* assuming lower 4MB, also doubleword */
        flags &= 0x03; /* only lower two bits */
        addr |= flags;

        asm("mtc0 %0, $18\n" /* WatchLo */
        "mtc0 $zero, $19\n" /* WatchHi */
        ::"r"(addr));
}
edit