"How Do I Write an Emulator?",  Part 1, R1.00
by Daniel Boris (dboris@home.com)
October 17, 1999


1.0 Introduction
I have often seen people ask the question "How do I write and emulator?" This is 
a very difficult question to answer since it is a very complex topic. In this 
article I will attempt to teach the basics of emu programming. This article will 
not turn you into an emulator expert nor will it give step by step instructions 
on how to write a specific emulator. It will teach the basic concepts needed to 
understand emulation and give you a good place to start.  What I will be 
teaching here is how "I" write an emulator. These techniques are not the only 
way of doing things but they will show you the basic concepts, which you can 
build on and improve.

1.1 Prerequisites
I will attempt to keep this article as basic as possible, but I do have to 
assume that you (the reader) have a basic level of starting knowledge. First, 
you should know how to program in some language. I really can't teach 
programming in general in this article and attempting to learn emu programming 
and general programming at the same time is a very difficult task. If you do not 
know how to program I recommend learning that first with some simple projects 
then move on to learning emu programming. I am going to try to keep this as non-
language specific as possible but I will eventually have to get into some code 
examples, in which case I will use my native language, C. C is a popular 
language for writing emulators, it's platform independent, and it's easy to find 
information on. I will try to explain things clearly enough so that even if you 
don't know C you will still understand what the code is doing.

The other perquisite is that you understand the binary and hexadecimal numbering 
systems and how to convert between binary, decimal and hex. When you are working 
at the hardware level everything is numbers, so it is very important to 
understand how these numbering systems work and I will be using all three 
systems liberally in this article.

1.2 What is an emulator?
Before we discuss how to write an emulator we really need to know what an 
emulator is. An emulator is a program that runs on a specific platform or 
platforms (PC, Mac, Unix, etc) that allows you to run software written for a 
different platform (arcade game, console system, computer etc.) For clarity we 
will call the system the emulator is running on the host system and the system 
that is being emulated the target system. The emulator is basically a program 
that simulates the behavior of the target systems hardware which allows the host 
system to run software written specifically for the target system.

For example if I want to run the arcade game Pac-Man on a PC I would write an 
emulator on the PC that simulates the hardware in the arcade game Pac-Man. I can 
then load the software that runs on the Pac-Man hardware into the emulator and 
run it on the PC just like it was running on the real hardware.


2.0 Hardware Basics
Before we can get into the discussion of how to write an emulator we need to 
understand the basics of how microprocessor based hardware works. When it comes 
to writing emulators the topics of hardware and software are inexorably tied 
together. You really need a good understanding of both to be able to effectively 
write emulators.

Every processor-based system has three major components, the processor, memory, 
and IO hardware.


2.1 The Processor
The heart of the system is the microprocessor. The processor reads instructions 
from memory and does what these instructions tell it to do. An instruction may 
tell the processor to read a number from memory, add two numbers together, 
compare one number to another, etc. The processor will execute these 
instructions sequentially, it will read an instruction execute it, read the next
execute it, and so on. 

There are many different types of processors and most are identified by a 
number. Some common processors you might have heard of are the 6502, Z80, 6809, 
68000, etc. Each processor does the same basic thing as I described above but 
each does it in a different way. We also sometimes refer to processor 
"families". These are group of processors, usually made by the same company, 
which are all very similar. For example the 68K processor family from Motorola 
includes the 68000, 68010, and 68020. Each of these processors is similar but 
each is slightly more advanced then the previous.

2.2 Processor Registers
Every processor has a series of internal registers that are used to store data, 
addresses and to control the processor. 

Program Counter
The most common register that you will find on all processors is the Program 
Counter (PC). The PC holds the address where the next instruction will be loaded 
from memory. The PC is initialized to some know state when the processor is 
reset and increments as each byte of each instruction is read. The PC can also 
be changed using jump and branch type instructions.

Working Registers
Processors have 1 or more "working registers" which are used to hold data that 
the processor needs to operate on. The 6502 for example has three working 
registers, the Accumulator, the X register and the Y register. The accumulator 
is used to hold data used in mathematical operations and also receives the 
result of the operations. The X and Y registers can also be used to hold general 
data, but they also have the special purpose of being used as counters.

Stack Pointer
Most processors have a special area of memory called the stack. The processor 
accesses the stack using what is called the LIFO method, Last In First Out. This 
means that the last piece of data to be put (or pushed) onto the stack will be 
the first piece to be retrieved (or pulled) from the stack. The stack is very 
handy for handling things like subroutine calls. For example, when the processor 
encounters a subroutine call it will push the current program counter onto the 
stack, then jump to the subroutine. When the subroutine ends and the processor 
needs to return, it pulls the old PC off the stack thus picking up where it left 
off. Processors usually have instructions which allow the programmer to manually 
push and pull values from the stack.

The Stack Pointer(SP)is used to keep track of the current position of the stack. 
For example the stack on the 6502 is at memory locations $1ff-$100, it starts at 
$1ff and works it's way down towards $100. The stack pointer is 8 bits wide so 
it would start out at $ff (the processor knows it really means $1ff). When a 
value is pushed onto the stack it will be put at memory location $1ff and then 
the SP will be de-incremented to it points to $1fe. When data is pulled of the 
stack, the SP is incremented, then the data is read from that memory location.

Status Register
The status register(s) usually serve two purposes. First they allow you to 
control certain aspects of the processor. For example there may be a bit in the 
status register that you can write to that enables or disables interrupts.

The other important part of the status register are the status flags. When 
instructions are executed they will often effect the state of one or more flag 
bits in the status register. For example the 6502 has a flag called the Zero 
Flag. Whenever the execution of an instruction results in a 0 this flag will be 
set to 1, and if an instructions results in anything else this flag is set to 0. 

2.3 Memory
Memory is where the instructions that the processor executes and the data that 
these instructions act on is stored. There are 2 major types of memory, RAM and 
ROM. RAM stands for Random Access Memory and can be both written to and read 
from by the processor. ROM stands for Read Only Memory and can only be read 
from, not written to. 

2.4 IO
IO is the hardware that allows the processor to access the outside world. It 
allows it to get input from the user and to output results back to the user. IO 
includes things like sound circuitry, video circuits, controller inputs, and 
communication chips that communicate with external devices such as disk drives 
and printers. IO also includes things like timer circuits, which allow the 
processor to keep track of "real world" time.

2.5 Buses
For the processor, memory and IO to work together there needs to be some sort of 
interconnection between them. This is where buses come in. Buses are basically a 
group of wires that connect the devices in a system together. For example the 
data bus carries data between the processor, memory and IO devices. Each line in 
a bus carries 1 bit of information. So if a processor needs to move data 8 bits 
at a time it would need a bus that is 8 bits wide. There are three types of 
buses in a processor-based system, the data bus, the address bus, and the 
control bus. You can think of these buses as the what, where and how of moving 
data around in the system. The data bus tells what to move, the address bus 
tells where to move it and the control bus tells how to move it.

2.5.1 The Data Bus
The data bus is the path that data takes between the processor and the RAM and 
IO circuits. The data bus is bi-directional meaning that the same bus is used to 
send data from the processor to memory as is used to transfer data from memory 
back to the processor. The data bus is usually either 8 bits (1 byte), 16 bits 
(1 word), or 32 bits (1 longword) wide, although there are some exceptions to 
this.

2.5.2 The Address Bus
The address bus is used by the processor to tell the hardware where it wants 
data to go to or where it wants to get data from. So if the processor wants to 
write something out to memory it puts the data on the data bus and the address 
it wants to write to on the address bus. Every processor can access a limited 
number of memory addresses depending on how big the processors address bus is. 
If the processor has a 16 bit address bus then it can access 65536 memory 
locations. These locations are numbered 0 - 65535 ($0 - $FFFF in hex). The 
Memory Map for a system tells you what is at each of those locations. For 
example addresses $0000-$0FFF might be working RAM, $1000-$1FFF might be video 
RAM, and $2000 might be an IO port that reads the position of a joystick. The 
circuitry in the system that actually implements the memory map is called an 
address decoder. This circuit looks at the addresses coming from the processor 
and activates the appropriate chip based on that address. This is important 
since the data and address bus might be connected to many different chips in the 
system and you only want one of these activated at any one time.

2.5.3 The Control Bus
These signals aren't always referred to as a bus, but it is convenient to group 
them this way. As I said before the Control Bus is the "how" portion of the data 
transfer. The most important part of the control bus is the Read/Write 
signal(s). This signal is generated by the processor and indicates to the 
external hardware if the processor wants to write data to memory or read data 
from memory. This is obviously important for something like RAM which can be 
read or written, but it's also important for IO devices since the address 
decoding could have a read from a specific address do something different than a 
write to that same address. For example in an arcade game a read from address 
$2000 might read the state of a joystick, but a write may turn on some lights on 
the control panel. There are usually other signals on the control bus besides 
R/W and these will vary from processor to processor. 

2.6 Microcontrollers
You will sometimes here about a special type of microprocessor called a 
microcontroller. A microcontroller is a microprocessor with RAM, ROM, and/or IO 
built into the same chip. It's very possible to have a microcontroller with RAM, 
ROM code and I/O ports all built in so it needs almost no external circuits to 
operate.

2.7 Interrupts
Interrupts are external signals that come into the processor and interrupt the 
normal flow of a program. When an interrupt signal is activated the processor 
stops what it is currently doing, saves some information about where it 
currently is in the program, and then jumps to a specific address in memory and 
executes an "interrupt handler" routine. When this routine is finished executing 
a special instruction tells the processor that the interrupt handler is done and 
to resume what it was doing when the interrupt occurred. The exact details of 
how interrupts are caused and handled will vary from CPU to CPU.

Some processors also have what are called exceptions. Exceptions are similar to 
interrupts but are usually caused by something inside the processor. For example 
a processor that has opcodes used for division will probably have a divide by 
zero exception since dividing by zero is mathematically invalid. So if a program 
tried to divide by zero the processor would jump to an exception handler routine 
for divide by zero.

2.8 Memory Mapped IO / Port Mapped IO
There are two ways that processors can access IO devices, memory mapped IO and 
port mapped IO. 

With memory mapped IO, the IO devices are accessed in the same way that RAM and 
ROM are accessed. The address decoding circuitry determines if the processor is 
accessing memory or an IO device and enables the appropriate device. This is the 
way that the 6502 processor (among others) accesses IO.

With port mapped IO, the processor has special instructions that are used to 
access IO devices. The instructions will activate a signal output from the 
processor which tells the external hardware that it is trying to do an IO access 
as opposed to a memory access. Port mapped IO is found on the Z80 and Intel 
80x86 processors among others.

Any processor can do memory mapped IO, even if they also support port mapped IO, 
it all depends on how the external hardware is configured. 

2.9 Big/Little Endian
Another issue that is important to emulation is "endianness". Endian determines 
how a processor handles multi byte numbers. Big Endian processors store the most 
significant byte first and the least significant byte last. Little endian 
processors store the bytes in the opposite way. Here is an example; lets say we 
want to store the hex number $1234 at memory location $1000. In a big endian 
processor it will be stored like this:

$1000  $12
$1001  $34

in a little endian processor it will be stored like this:

$1000  $34
$1001  $12

This also applies to 32 bit numbers. For example lets store $11223344 at 
location $1000. 

Big endian:                 Little endian
$1000  $11	                $1000  $44
$1001  $22                  $1001  $33
$1002  $33                  $1002  $22
$1003  $44                  $1003  $11

Each processor has a specific endianess. For example the 6502 is little endian 
and the 68000 series is big endian. There are also a few processors that can be 
configured to work either way.


3.0 The CPU Core
Just as the CPU is the heart of a system, the CPU core is the heart of an 
emulator. It is the CPU core's job to read the instructions from memory and 
simulate their behavior.

The first question to ask about a CPU core is whether you want to write your own 
or use a pre-existing core. Most of the popular processors have publicly 
available CPU cores which can save you the trouble of writing your own. Writing 
a CPU core is a very tedious and time consuming process, and CPU cores are 
notorious for being difficult to debug.

3.1 Processor Registers
The first thing you need in a CPU core is to define variables for the various 
internal registers in the CPU. So for example the 6502 CPU has 6 internal 
registers; the program counter, the stack pointer, the status register, the X,Y 
registers and the accumulator. The program counter is 16-bits wide and the 
others are all 8-bits so they could be defined in C like this:

unsigned int program_counter;
unsigned char stack_pointer,status_register,x_reg,y_reg,accumulator;

The status register is composed of a series of 1 bit flags. For example if the 
result of an instruction is zero then the zero flag is set otherwise it is 
cleared. The individual flags are used extensively by the CPU, but they are 
rarely used in the form of a complete 8-bit number so it is more efficient to 
handle each flag as a separate variable:

int    zero_flag;
int    sign_flag;
int    overflow_flag;
int    break_flag;
int    decimal_flag;
int    interrupt_flag;
int    carry_flag;

In those few cases when the whole status byte is needed we can call a routine to 
assemble these back into a complete byte.

3.1 CPU Reset
The next routine we need is one to simulate a reset of the CPU. When a system 
starts up it usually holds the processor in reset for a short period of time 
this is called a Power On Reset. The POR will force the internal registers in 
the processor to a known state. The data sheet for a processor will usually 
specify what all the registers are set to during a reset. Accurately simulating 
a reset is usually not important since a good programmer should set all the 
registers to a known state at the start of his program, but there are times that 
this is not done and the programmer relies on the reset state to be something 
specific. I ran into this situation on a few occasions while working on an Atari 
2600 console emulator. The reset routine for the 6502 could look like this:

1  void reset_cpu(void)
2  {

3  status_register = 0x20;       
4  zero_flag = sign_flag = overflow_flag = break_flag = 0;
5  decimal_flag = interrupt_flag = carry_flag = 0;
6  stack_pointer = 0xFF;
7  program_counter = (memory[0xFFFD] << 8) | memory[0xFFFC];
8  clk=0;
9  accumulator=x_reg=y_reg=0;
10 }

In line 3 we set the initial state of the status register. Bit 5 of the status 
register is unused in the 6502 and always reads as a 1. In lines 4 and 5 we set 
all the individual flag registers to 0. Line 6 sets the initial value for the 
stack pointer. Line 7 sets the initial value of the program_counter. The array 
memory[] represents the memory space of our processor. The starting address for 
a 6502 program is stored at location $FFFC and $FFFD in memory. The 6502 stores 
addresses in low byte/hi byte format, so $FFFD contains the upper 8 bits of the 
address and $FFFC the lower 8 bits. This line assembles the 2 bytes into a 16-
bit address. Don't worry about line 8 for now we will talk about that more 
later. Finally line 9 sets the initial value of the 3 CPU working registers, X,Y 
and the accumulator.

3.2 Execution
The next thing we need in the CPU core is the actual command execution routine. 
In this routine we will read the opcodes from memory and call the appropriate 
routine to simulate the function of that instruction. In C the execution routine 
could be implemented with a switch/case function like this:

1 switch (memory[program_counter++])  {
2 case 0:
3        /* Execute opcode 0 here */
4        break;
5 case 1:
6        /* Execute opcode 1 here */
7        break;
        .
        .
        .
        etc..
}

The address in program_counter tells us where the next opcode to be executed is 
so we use that to read the opcode from the memory array in line 1. The "++" 
after program_counter means to increment the value in program_counter after we 
have used it. So if program counter contains $1000 before this line, the line 
would read the opcode at location $1000 then increment program counter by 1 so 
it would contain $1001 when this line is done. Line 2 begins the code for opcode 
"0". Line 5 begins the code for opcode "1" and this would continue for each 
opcode.

Lets now look at a sample opcode routine. Lets take the 6502 instruction LDA 
#$55. This instruction loads the hex value 55 into the accumulator. This 
instruction is stored in memory as: $A9,$55. The $A9 is the opcode for LDA and 
the second byte, $55, is the value to be loaded into the accumulator. The code 
for this would look like:

1 case 0xA9:  /* LDA immediate */
2        accumulator = memory[program_counter];
3        program_counter++;    /* C shorthand for program_counter =  
program_counter + 1 */

4        sign_flag = accumulator & 0x80;
5        zero_flag = !(accumulator);
6 break;


Line 1 starts our opcode 0xA9 routine. The comment at the end of the line makes 
it clear which instruction this routine emulates. Line 2 is the actual meat of 
the instruction. program_counter at this point is pointing to the second byte of 
the instruction which, as I said above, contains the data to be loaded into the 
accumulator. So this line just copies that data from memory to the variable 
accumulator. Line three advances the program counter so it will now be pointing 
to the next instruction in memory. Line 4 evaluates the 6502's sign flag. The 
sign flag is always the same as bit 7 of the result of an instruction. So we 
just use a logical AND to get bit 7 of the accumulator. Line 5 evaluates the 
6502's zero flag. The zero flag will be 1 if the result of an operation is 0 
otherwise the zero flag will be 0. This line uses a logical NOT to accomplish 
this.

This routine demonstrates why emulators can sometimes be very slow. This simple 
6502 instruction required 4 lines of C code to execute and when this is 
converted to assembly language by the compiler it will probably require quite a 
few assembly instructions to simulate 1 6502 instruction.

Lets look at another instruction, the JMP $F000 instruction. This 6502 
instruction tells the CPU to jump to address $F000 and continue executing the 
program there. In memory this instruction would look like: $4C,$00,$F0. The $4C 
is the opcode, the $00,$F0 is the address to jump to in low byte/high byte 
format. The code for this instruction would look like:

case 0x4c: /* JMP absolute */
      program_counter = (memory[program_counter+1] << 8) | 
memory[program_counter];     
break;

This instruction is pretty simple. We first read the high byte of the new 
address from memory, shift it up 8 bits, the use a logical OR to combine it with 
the lower 8 bits. This assembles the two 8 bits parts of the address into a 16-
bit address. Notice we don't need to increment the program counter at all here 
since we are explicitly changing it to a new value.

Another example, LDA $1000. This instruction tells the processor to load the 
byte that is at memory location $1000 into the accumulator. In memory it looks 
like: $AD,$00,$10. Here is the code:

1 case 0xAD: /* LDA absolute */
2    addr = (memory[program_counter+1] << 8) | memory[program_counter];     
3    accumulator = memory_read(addr);
4    program_counter += 2; /* C shorthand for program_counter = program_counter 
+ 2 */
5    sign_flag = accumulator & 0x80;
6    zero_flag = !(accumulator);
7 break;

This instruction is a little more complicated. In line 2 we get the address that 
the data is going to be read from. This works the same way as in the JMP 
instruction, but this time we store it in a temporary variable addr. Line 3 
reads the data byte from memory that is at the address stored in addr, in our 
example this would be address $1000. Notice that we do not read the byte 
directly from our memory array, but instead we call a routine called 
memory_read(). The reason for this is that we don't know if the byte we are 
reading is coming from normal RAM/ROM or if it was coming from and IO port, 
maybe $1000 is the IO port that reads the joystick. If it does happen to be an 
IO port we will need to execute some extra code so that we can go out and read 
the status of the real joystick on the host system. So instead of reading 
directly from memory we call memory_read() which will deal with situations like 
this. We will talk more about memory_read() in the section on memory. You may 
wonder why we don't call this routine to read opcodes. The reason for this is 
that opcodes will always come from RAM or ROM, never from an IO address so we 
can safely read these from the memory[] array.

This shows that basics of how the CPU opcode emulation is written. The actual 
details will vary from processor to processor but this shows some of the things 
you will encounter.

3.3 Timing
The next thing we need in our CPU core is a way of tracking the passage of time 
in our emulated system. In the real hardware the CPU is controlled by a clock of 
a specific frequency. Each instruction that the CPU can execute will take 1 or 
more of these clock cycles to execute. In our CPU core we are going to do things 
in reverse, instead of the clock driving the CPU core we are going to have the 
CPU core drive the clock. For example the LDA immediate instruction we talked 
about above takes 2 CPU clock cycles to execute. So lets say our CPU input clock 
is 2Mhz: 1/2Mhz = .0000005 seconds (.5us) per CPU cycle, so our LDA instruction 
will take 1us to execute. Thus we can say that 1us of emulated time has passed 
during the execution of that instruction. 

This timing will be used for various things in our emulator, for example it can 
be used for video timing. Most video displays update every 1/60sec, so we may 
want to run our CPU for 1/60sec update the display, run the next 1/60sec, update 
the display again, etc. 

Most CPU cores are implemented to execute for a specific number of clock cycles 
so we could set our CPU_execute routine up like this:

1 int CPU_execute(int cycles) {
2 int cycle_count;

3 	cycle_count = cycles;
4 	do {

5	/* OPCODE execution here */

6	} while(cycle_count > 0);

7	return cycles - cycle_count;
}     

In line 1 we define our routine CPU_execute() which is passed the number of 
machine cycles we want the core to execute, which is stored in the variable 
cycles. In line 3 we copy the number of cycles we want to execute into the 
variable cycle_count, you will see why in line 7. In line 4 we start a loop. 
Line 5 is where our select/case statement that executes the CPU opcodes would 
be. It's not shown here but in each of these opcode routines we need to de-
increment cycle_count by the number of cycles that instruction would take. So in 
our routine for "LDA immediate" we would put:

cycle_count -= 2; 

In line 6 we see if cycle_count is less then 0 which would indicate that we have 
executed all the requested machine cycles. Finally in line 7 we exit from the 
routine and return the actual number of machine cycles that was executed. This 
becomes important when we are writing an emulator that requires very accurate 
timing. The reason for this is that the CPU core could very easily run for more 
machine cycles then we requested it to. Lets take an very simple example, lets 
say we ask the CPU core to execute 6 cycles. The first instruction it executes 
takes 5 cycles, so we now have 1 cycle left. If the next instruction takes 4 
cycles to execute then that means the CPU core will run for 3 more cycles then 
we requested. By returning the actual number of cycles executed the main 
emulator routine can compensate for this.

3.3 Interrupts

As mentioned earlier interrupts are something the "interrupts" the normal flow 
of a program running on a microprocessor. Dealing with interrupts in an emulator 
can sometimes be very tricky. In a real system interrupts will occur independent 
of the processor, in an emulator this is not really possible to do. In an 
emulator we have to be actively looking for the event that causes an interrupt 
and when it occurs we then call a routine which cause the processor to handle 
and interrupt call. Before we get to the actual interrupt routine lets define a 
couple C macros to make our life easier.

#define PUSH(b) 		memory[stack_pointer+0x100]=(b); stack_pointer--
#define PULL()		memory[(++stack_pointer)+0x100]
#define GET_SR()		((sign_flag ? 0x80 : 0) |\
				 (zero_flag ? 0x02 : 0) |\
				 (carry_flag ? 0x01 : 0) |\
				 (interrupt_flag ? 0x04 : 0) |\
				 (decimal_flag ? 0x08 : 0) |\
				 (overflow_flag ? 0x40 : 0) |\
				 (break_flag ? 0x10 : 0) | 0x20)

Macros are an easy way of defining code that we will use a lot in our programs. 
Anytime the C compiler encounters a macro in your program it will replace it 
with the code in the macro definition. For example, if the compiler encountered 
this piece of code:

PUSH(accumulator);

It would replace it with:

Memory[stack_pointer+0x100] = (accumulator); stack_pointer--;

The first macro we define is called PUSH and it pushed a value onto the stack. 
First it calculates the current address of the top of the stack by adding $100 
to the stack pointer (SP). Remember the stack in the 6502 is from $100-$1FF so 
we have to add the $100 to get the correct address. Once it has this it puts the 
data at that address. Finally it decrements the stack pointer (SP). We decrement 
because the stack starts at $1FF and works down to $100. 

The second macro we define is called PULL and it pulls a value off the stack. If 
you are not familiar with C this line might look a bit confusing, but what it 
does is increment the stack pointers (SP), add $100 to it, then retrieve that 
value at that memory location. 

The final macro is something I talked about earlier. For speed and convenience we 
are keeping each of the processor flags in a separate variable. Occasionally we 
will need these assembled back into a single byte and that's what this macro does. 
Once again, if you don't understand C you might not understand the macro but 
trust me on what it does.

Now we can look at the interrupt routine:

1 void IRQ() {	
2 	if (!interrupt_flag) {
3		PUSH((program_counter & 0xFF00) >> 8);
4		PUSH(program_counter & 0xFF); 
5		PUSH(GET_SR());  
6 interrupt_flag = 1;
7 program_counter = (memory[0xFFFF] << 8) | memory[0xFFFE];
8		cycle_count-= 7;
9	}
10  }


4.0 Memory
The next thing we need to know how to emulate is memory. 

4.1 Allocating Memory
The most straightforward way of handling memory is to allocate a block of memory 
the full size of the memory space for each processor you are emulating. For 
example a 6502 processor has a 65536 bytes memory space, so in C we would 
allocate it like this:

unsigned char *memory;
memory = (unsigned char *)malloc(65536);

The first line creates a pointer called memory. We make it an unsigned char so 
that we can access this memory block 1 byte at a time. The second line allocates 
64K of RAM and points the pointer 'memory' to that block. 

We can now use this block of memory like the processor's address memory. For 
example if we needed to put the value $55 at memory location $1000 we would 
write:

memory[0x1000] = 0x55;

When we are ready to exit from the emulator we need to free up this memory:

free(memory);


4.2 Loading memory
All processor systems must have some sort of permanent memory to at least get 
them started. This usually comes in the form of a ROM or ROMS. Since these have 
to be present at startup we need a way to load them into memory before the 
emulation is started. Here is a simple example of loading a ROM in C:

1  int load_roms(void) {
2  FILE *fp;

3  fp=fopen("game.rom","rb");
4  if (!fp) {
5	printf("Error loading game.rom\n");
6	return 1;
7  }
8 read(&memory[0xF000],1,0x1000,fp); 
9 fclose(fp);
10 return 0;
11 }
	
Line 1 starts our rom load routine. We declare it as a int so we can return a 
value which indicates whether the load was successful or not. In line 2 we 
create a C file pointer. In line 3 we open the file we want to load, in this 
case "game.rom". In line 4 we check if line 3 actually succeeded in opening the 
ROM file. If the file was missing, or named wrong we want to catch this and 
display an error which is what we do in line 5. Line 6 immediately exits the 
routine if the ROM failed to open. The "1" in line 6 is returned to the calling 
routine and in our case indicates an error loading the file, this allows the 
main emulator routine to take appropriate action if the roms can't be loaded. In 
line 8 we actually load the data into the emulators memory space. In this case 
we are assuming we have a $1000 byte ROM that starts at memory locaiton $F000. 
In line 9 we close the file. In line 10 we return from the routine and return a 
0 to indicate success.

This is a very simple example of loading a ROM into memory. This works best with 
fixed length ROMS like the ones used for BIOS ROMS or in arcade machines. 
Loading console game ROMS can get trickier for a few reasons. First, some 
console ROM dumps have headers attached to the ROM which aren't part of the 
actual data. In these cases this header data will have to be loaded separately 
then the data from the ROM can be loaded into the emulator's memory space. 

Another problem with console ROMS is that they sometime have variable lengths. 
With these ROMS it will first be necessary to determine the length of the ROM 
file before you can actually load it. These types of ROMS are also very often 
"bank switched" meaning that the entire ROM does not get loaded into the 
emulators memory space at the start. Some of it will be loaded into the memory 
space and part will be loaded into some temporary memory buffers. The details of 
bank switching are best left for another time.


4.3 Memory Handlers
As I said in the section on the CPU we need a couple routines to handle memory 
accesses by the CPU core. Whenever the CPU core needs to read data from memory 
it will call a read handler and whenever it needs to write data to memory it 
will call a write handler. 

Before we write the handlers lets talk about memory maps. As I said before each 
device in a system resides at a certain series of addresses in the processors 
memory space. A memory map tells you what addresses each device is as. Here is a 
sample memory map:

$0000 - $0FFF  R/W   RAM
$1000 - $1FFF  R/W   Video RAM
$2000          R     Read Joystick
$3000 - $300F  W     Sound chip
$E000 - $FFFF  R     ROM

Each line lists a range of memory locations, what is at those locations, and 
whether the locations are read only (R), write only (W) or read/write (R/W). 

From the information in the memory map we can write our memory handlers. The 
read handler might look something like this:

1 Unsigned char read_memory(unsigned int address) {
	
2    If (address < 0x1000 || address > 0xDFFF) return memory[address];
3    If (address < 0x2000) return vidram[address - 0x1000];
4    If (address == 0x2000) return read_joystick();
5    return 0xFF;
}

In line 1 we declare our read_memory routine. It will return 1 byte so we 
declare it as an unsigned char. It will be passed the address that the cpu 
core wants to read from and this will be stored in the variable address. 
In line 2 we check if we are reading from ram (address < 0x1000) or if we are 
reading from ROM (address > 0xDFFF) and return the appropriate value from our 
memory array. 
In line 3 we handle the video memory in a slightly different way. Video memory 
is from $1000 to $1fff. Line 2 has already handled addresses under $1000 so 
these will never make it to line 3, so we only need to see if the address is 
less the $2000. If it is, then we return a value from an array set aside just 
for video memory, which you may want to do for various reasons. We would have 
allocated the array vidram[] to be $1000 bytes long elsewhere in our emulator. 
Since our vidram[] array is only $1000 bytes long and video memory starts at 
location $1000 in memory we need to subtract $1000 from address to get the 
correct location in vidram[]. 
In line 4 we handle a read of the joystick IO port. From our memory map we see 
that this is at only one address so we check for only one address and not a 
range. We then call a routine called read_joystick() which takes care of reading 
the real joystick on the host system. 
In line 5 we return a $FF if the address that was being read wasn't in the 
memory map. Different hardware will return different results on an undefined 
memory access but emulating this usually isn't important, although sometime it 
is. While you are developing and emulator it might be good to put a statement 
like:

Printf("Error undefined read at %x\n",address);

At the end of that routine before the return 0xff. This will let you know that 
the processor is accessing an undefined address so you can try to figure out 
why. You may also want to open up a log file and print this to a file so it's 
easier to keep track of.

The write handler is done in pretty much the same way:

1 void write_memory(unsigned int address,unsigned char data) {
	
2    If (address < 0x1000){
3        memory[address] = data;
4        return;
5	}
6     If (address < 0x2000) {
7        vidram[address - 0x1000] = data;
8        return;
9	}
10      If (address > 0x2FFF && address < 0x3010) write_sound(address,data);
11 }

In line 1 we start the routine. It's declared as a void because we are not 
returning a value from it and we pass it the address to write to and the data to 
be written. Line 2 checks if we are in the RAM range and if so line 3 writes 
that data into the memory array. Line 4 exits from the routine. The advantage to 
this is that we can exit the routine as soon as we have found the address, we 
don't have to go through the rest of the address checks.
In lines 6-9 we handle writes to the video ram just like writes to the normal 
RAM. In line 10 we handle writes to the sound chip. We check if the address is 
within the range of addresses for the sound chip, then call the routine 
write_sound() to handle the write.

4.4 Optimizing Memory Handlers
Memory handlers can have a big impact on the speed of your emulator. The 
examples I gave in the last section are very basic handlers and are not very 
efficient. The memory handlers are going to be called a lot by the CPU core 
especially in 8-bit processors which have fewer internal registers to work with. 
In a high level language like C when a jump is made to a routine the CPU
registers of the host machine are saved then restored at the end of the routine. 
This takes time so we want to avoid jumping out of the CPU core as much as possible. 

We have already taken one step to help this by not calling the memory handler to 
read opcodes. We know that opcodes are always going to come from RAM or ROM so 
we can read them directly from the memory array instead of having to do all the 
decoding.

Another possibility is to eliminate the read and/or write handlers completely, 
but this can only be done in certain situations. For examples lets say that the 
only input that a system has is a register that contains the status of the 
joystick input. To get around using a read handler in this case we could 
periodically read the joystick on the host system and write this information 
into the appropriate location in the memory array. Now whenever the processor 
needs to read the joystick port it can just read it from the memory array 
instead of having to call a routine to read the host joystick port. 

The write handlers can be a little more tricky to get rid of. If the system you 
are emulating just writes data to output registers that don't need to be acted 
on immediately then you may be able to get rid of the write handler. For example 
maybe the system writes to a port in the video controller chip that sets the 
background color of the screen. The cpu core can put this directly into the 
memory array since you won't actually need it until you draw the screen. 
Unfortunately it's not always this easy. Some systems will have "trigger" 
addresses. When written to, these addresses trigger something to happen 
immediately regardless of what data is written to them. Since the data may not 
change with each write it would not be possible to tell how many times the 
register was written to if the writes went directly to the memory array.

Another way that this can be optimized is to do some of the address decoding in 
the CPU core so that calls don't have to be made out of the core every time a 
memory access happens. One technique for doing this in C is to declare a second 
array the same size as the memory (lets call it mem_type[] for example). For 
each location in memory that is IO and needs decoding put a 1 in the mem_type 
array and leave all the others at 0. In you CPU core put a routine that looks 
like this:


1 Inline mem_write(unsigned int address, unsigned char data) {
2 if (mem_type[address]) 
3    	    memory_write_handler(address,data);
4     else
5         memory[address] = data;
6 }

Every time you need to write data in your CPU core call this routine. By 
declaring this as inline the whole block of code will be substituted whenever 
the compiler comes across a call to mem_write. The routine will check to see if 
mem_type for that address is a 1, if it is it jumps out to a traditional memory 
handler, if it's a 0 then it puts the data directly into the memory array. Being 
inline will prevent the CPU core from having to constantly jump out to another 
routine when it does a memory access. The downside to using inline is that it 
can quickly inflate the size of your code if you are not careful.

Still another option for optimizing is to write the CPU core in assembly 
language. Since you have finer control of the code in assembly you can integrate 
the memory handlers a little more closely into the CPU core thus making things 
more efficient. 

These are just a few ideas on optimizing the memory handlers and there are still 
other approaches to doing this. You will have to determine what works best for 
the specific system you are emulating.


5.0 Conclusion
Well this concludes the first part of my emulator how-to. I have touched on some 
of the basic concepts for writing the core of an emulator but there is still a 
lot to be covered. Look for future installments that explain some more emulation 
concepts.