ARM7TDMI emulation core rewrite overview 
Author Message
User avatar

Joined: 2014-09-25 13:52
Posts: 8294
 ARM7TDMI emulation core rewrite overview
Now that I've finished emulating new systems in higan, there won't be an opportunity to talk about new processor cores again. And now that I've finished the last major rewrite of a CPU core, this will be my last chance at this. So, let's go for it.

My previous ARM core comes from approximately March of 2012, possibly earlier. It initially started out as a requirement to run the last unemulated SNES coprocessor, the ST018, which is in fact most likely an ARM60 (yet incredibly, even with a chip decap die scan, we have been unable to determine this for sure.)

Unsurprisingly, my coding style has evolved quite a bit over the past 5+ years, and the code no longer reflects the design aesthetics of most of the rest of higan's code. It would've been too much work to try and patch it in-place, and so I opted for a full rewrite. Although most of the core logic is a copy, so perhaps it's somewhere in the middle.

Every design decision for a CPU core depends upon the individual characteristics: RISC vs CISC easily has the biggest impact. But so do Z80 instruction prefixes, x86 ModRM/68K EffectiveAddress, etc. So don't take this as "how to design any CPU core", or hell, even "how to design an ARM CPU interpreter." This is just my own preference.




Registers

One of the most important things that sets precedent for a CPU core are the registers. They're always the first thing to consider when designing a core.

The old core had a rather complex design of class with overloaded operators.

Code:
struct GPR {
  inline operator uint32_t() const { return data; }
  inline auto operator=(uint32_t n) { data = n; if(modify) modify(); return *this; }
  inline auto operator=(const GPR& source) { return operator=(source.data); }

  inline auto operator &=(uint32_t n) { return operator=(data  & n); }
  inline auto operator |=(uint32_t n) { return operator=(data  | n); }
  inline auto operator ^=(uint32_t n) { return operator=(data  ^ n); }
  inline auto operator +=(uint32_t n) { return operator=(data  + n); }
  inline auto operator -=(uint32_t n) { return operator=(data  - n); }
  inline auto operator *=(uint32_t n) { return operator=(data  * n); }
  inline auto operator /=(uint32_t n) { return operator=(data  / n); }
  inline auto operator %=(uint32_t n) { return operator=(data  % n); }
  inline auto operator<<=(uint32_t n) { return operator=(data << n); }
  inline auto operator>>=(uint32_t n) { return operator=(data >> n); }

  uint32_t data = 0;
  function<void ()> modify;
};


(side tangent: Cydrak spotted a long-standing memory allocation issue that went unnoticed since v095 with the above code. Can you spot it?)

I'm never a huge fan of having to implement operator overloads for every possible mathematical operation.

The new design moves to avoid using all of these overloads, and instead favor more explicit
value = value OPERATOR argument
coding instead.

Code:
  struct GPR {
    inline operator uint32_t() const {
      return data;
    }

    inline auto operator=(uint32 value) -> GPR& {
      data = value;
      if(modify) modify();
      return *this;
    }

    inline auto operator=(const GPR& value) -> GPR& {
      data = value.data;
      if(modify) modify();
      return *this;
    }

    uint32 data;
    function<void ()> modify;
  };


However, there is still a GPR class. And the reason I need that is, modifying PC forces an extra pipeline reload step. See, the ARM7 uses a three-stage pipeline: fetch, decode, execute. And most ARM instructions (and some THUMB instructions) allow you to directly manipulate the PC register.

GPR is meant as a way to avoid having to add if(register==PC) checks to every single ARM instruction that modifies any register -- where register is a value from 0-15, and PC is 15.

Because of the function binding, we do have to overload operator=(const GPR&) to prevent copying. I may be able to just =delete the function, but it's only three lines of code.




Flags

Flags are about equally important as registers to design.

The old implementation used a union/template design to access processor flags individually, where writing to them would change the entire flag value in-place.

Code:
struct PSR {
  union {
    uint32_t data = 0;
    BooleanBitField<uint32_t, 31>    n;  //negative
    BooleanBitField<uint32_t, 30>    z;  //zero
    BooleanBitField<uint32_t, 29>    c;  //carry
    BooleanBitField<uint32_t, 28>    v;  //overflow
    BooleanBitField<uint32_t,  7>    i;  //irq
    BooleanBitField<uint32_t,  6>    f;  //fiq
    BooleanBitField<uint32_t,  5>    t;  //thumb
    NaturalBitField<uint32_t,  4, 0> m;  //mode
  };

  PSR() = default;
  PSR(const PSR& value) { data = value.data; }

  inline operator uint() const { return data & 0xf00000ff; }
  inline auto& operator=(uint value) { return data = value, *this; }
  inline auto& operator=(const PSR& value) { return data = value.data, *this; }
};


Although I find the *BitField class to be a very interesting idea, in practice I really dislike the fact that BitField<31> is a different type than BitField<30>, which makes things like passing these by reference to functions an incredible pain.

So I moved back to my earlier design aesthetic of individual boolean values, and simply constructing the value on the rare instances the entire value needs to be read or written.

Code:
  struct PSR {
    inline operator uint32_t() const {
      return m << 0 | t << 5 | f << 6 | i << 7 | v << 28 | c << 29 | z << 30 | n << 31;
    }

    inline auto operator=(uint32 data) -> PSR& {
      m = data.bits(0,4);
      t = data.bit(5);
      f = data.bit(6);
      i = data.bit(7);
      v = data.bit(28);
      c = data.bit(29);
      z = data.bit(30);
      n = data.bit(31);
      return *this;
    }

    uint5 m;    //mode
    boolean t;  //thumb
    boolean f;  //fiq
    boolean i;  //irq
    boolean v;  //overflow
    boolean c;  //carry
    boolean z;  //zero
    boolean n;  //negative
  };


Now I know ... operator uint32_t() is going to be slower than before. But seriously, MSR/MRS instructions (that access the full PSR) are quite uncommon. And in return, getting and setting individual flags, which is *way* more common, is easier. The BitFields in the previous version have to do &~bit | (value ? bit : 0) on each assignment, and &bit on read-out.

Once again, the goal is to get away from trying to be too "clever", which I feel just adds needless complexity.




Instruction decoding

For RISC architectures, decoding instructions is brutal. The ARM7 isn't nearly as bad as the 68K with overlapping instructions, but it definitely still has them, eg MRS/MSR taking place of comparison instructions without the S flag set (they would be no-ops, so it makes sense at least.)

For the old core, I came up with an (instruction&mask)==value system. But identifying MULL/MLAL via
if((instruction & 0x0f8000f0) == 0x00800090)
is as clear as mud, and so I abused some template magic to turn strings into compile-time constants to more clearly define the bit-fields in question.

Code:
  #define decode(pattern, execute) if( \
    (instruction() & std::integral_constant<uint32_t, bit::mask(pattern)>::value) \
    == std::integral_constant<uint32_t, bit::test(pattern)>::value \
  ) return arm_op_ ## execute()

  decode("???? 0001 0010 ++++ ++++ ++++ 0001 ????", branch_exchange_register);
  decode("???? 0000 00?? ???? ???? ???? 1001 ????", multiply);
  decode("???? 0000 1??? ???? ???? ???? 1001 ????", multiply_long);
  decode("???? 0001 0?00 ++++ ???? ---- 0000 ----", move_to_register_from_status);
  decode("???? 0001 0?00 ???? ???? ---- 1001 ????", memory_swap);
  decode("???? 0001 0?10 ???? ++++ ---- 0000 ????", move_to_status_from_register);
  decode("???? 0011 0?10 ???? ++++ ???? ???? ????", move_to_status_from_immediate);
  decode("???? 000? ?0?1 ???? ???? ---- 11?1 ????", load_register);
  decode("???? 000? ?1?1 ???? ???? ???? 11?1 ????", load_immediate);
  decode("???? 000? ?0?? ???? ???? ---- 1011 ????", move_half_register);
  decode("???? 000? ?1?? ???? ???? ???? 1011 ????", move_half_immediate);
  decode("???? 000? ???? ???? ???? ???? ???0 ????", data_immediate_shift);
  decode("???? 000? ???? ???? ???? ???? 0??1 ????", data_register_shift);
  decode("???? 001? ???? ???? ???? ???? ???? ????", data_immediate);
  decode("???? 010? ???? ???? ???? ???? ???? ????", move_immediate_offset);
  decode("???? 011? ???? ???? ???? ???? ???0 ????", move_register_offset);
  decode("???? 100? ???? ???? ???? ???? ???? ????", move_multiple);
  decode("???? 101? ???? ???? ???? ???? ???? ????", branch);
  decode("???? 1111 ???? ???? ???? ???? ???? ????", software_interrupt);


Code:
  decode("0001 10?? ???? ????", adjust_register);
  decode("0001 11?? ???? ????", adjust_immediate);
  decode("000? ???? ???? ????", shift_immediate);
  decode("001? ???? ???? ????", immediate);
  decode("0100 00?? ???? ????", alu);
  decode("0100 0111 0??? ?---", branch_exchange);
  decode("0100 01?? ???? ????", alu_hi);
  decode("0100 1??? ???? ????", load_literal);
  decode("0101 ???? ???? ????", move_register_offset);
  decode("0110 ???? ???? ????", move_word_immediate);
  decode("0111 ???? ???? ????", move_byte_immediate);
  decode("1000 ???? ???? ????", move_half_immediate);
  decode("1001 ???? ???? ????", move_stack);
  decode("1010 ???? ???? ????", add_register_hi);
  decode("1011 0000 ???? ????", adjust_stack);
  decode("1011 ?10? ???? ????", stack_multiple);
  decode("1100 ???? ???? ????", move_multiple);
  decode("1101 1111 ???? ????", software_interrupt);
  decode("1101 ???? ???? ????", branch_conditional);
  decode("1110 0??? ???? ????", branch_short);
  decode("1111 0??? ???? ????", branch_long_prefix);
  decode("1111 1??? ???? ????", branch_long_suffix);


Spiffy. But it's also a chain of ~20 condition expressions in a row.

For the new core, I reused a design I came up with for the 68K core: use lambdas!

For the ARM portion, it's clear that we only really need to care about 12-bits to select which instruction to use. Sans the exception of MRS/MSR, of course. So I created a table of 4096 lambdas, and I generate the table at program startup.

Code:
  #define bind(id, name, ...) { \
    uint index = (id & 0x0ff00000) >> 16 | (id & 0x000000f0) >> 4; \
    assert(!armInstruction[index]); \
    armInstruction[index] = [&](uint32 opcode) { return armInstruction##name(arguments); }; \
    armDisassemble[index] = [&](uint32 opcode) { return armDisassemble##name(arguments); }; \
  }

  #define pattern(s) \
    std::integral_constant<uint32_t, bit::test(s)>::value

  #define arguments \
    opcode.bits( 0, 3),  /* m */ \
    opcode.bits( 5, 6),  /* type */ \
    opcode.bits( 7,11),  /* shift */ \
    opcode.bits(12,15),  /* d */ \
    opcode.bits(16,19),  /* n */ \
    opcode.bit (20),     /* save */ \
    opcode.bits(21,24)   /* mode */
  for(uint2 type : range(4))
  for(uint1 shiftLo : range(2))
  for(uint1 save : range(2))
  for(uint4 mode : range(16)) {
    if(mode >= 8 && mode <= 11 && !save) continue;  //TST, TEQ, CMP, CMN
    auto opcode = pattern(".... 000? ???? ???? ???? ???? ???0 ????") | type << 5 | shiftLo << 7 | save << 20 | mode << 21;
    bind(opcode, DataImmediateShift);
  }
  #undef arguments


And here, I no longer have to declare the table in a certain order. Instead, I can detect the compare-sans-S conditions and skip them.

THUMB is a bit more evil. Technically, only nine bits are needed to distinguish each instruction, but I have further plans for THUMB in the future. And so I made a full 65536-entry lambda table as with the 68K CPU core. My rationale was, "the ARM7 is usually running in THUMB mode, so we might as well optimize THUMB more."

Code:
  #define bind(id, name, ...) { \
    assert(!thumbInstruction[id]); \
    thumbInstruction[id] = [=] { return thumbInstruction##name(__VA_ARGS__); }; \
    thumbDisassemble[id] = [=] { return thumbDisassemble##name(__VA_ARGS__); }; \
  }

  #define pattern(s) \
    std::integral_constant<uint16_t, bit::test(s)>::value

  for(uint3 d : range(8))
  for(uint3 n : range(8))
  for(uint3 m : range(8))
  for(uint1 mode : range(2)) {
    auto opcode = pattern("0001 10?? ???? ????") | d << 0 | n << 3 | m << 6 | mode << 9;
    bind(opcode, AdjustRegister, d, n, m, mode);
  }





Instruction arguments

This is part of decoding, but I felt it deserved a separate section.

In the original core, the bit-test matching does not decode arguments, so this logic has to go into every instruction:

Code:
auto ARM::arm_op_multiply() {
  uint1 accumulate = instruction() >> 21;
  uint1 save = instruction() >> 20;
  uint4 d = instruction() >> 16;
  uint4 n = instruction() >> 12;
  uint4 s = instruction() >> 8;
  uint4 m = instruction();

  if(accumulate) idle();

  r(d) = mul(accumulate ? r(n) : 0u, r(m), r(s));
}


Code:
auto ARM::thumb_op_shift_immediate() {
  uint2 opcode = instruction() >> 11;
  uint5 immediate = instruction() >> 6;
  uint3 m = instruction() >> 3;
  uint3 d = instruction() >> 0;

  switch(opcode) {
  case 0: r(d) = bit(lsl(r(m), immediate)); break;
  case 1: r(d) = bit(lsr(r(m), immediate == 0 ? 32u : (uint)immediate)); break;
  case 2: r(d) = bit(asr(r(m), immediate == 0 ? 32u : (uint)immediate)); break;
  }
}


Note I'm being "clever" here again and using my Natural<T> classes. Usually you'd have eg
uint d = (instuction >> 16) & 15;
, but instead I said
uint4 d = instruction >> 16;
which has the same implicit masking effect, and also conveys integer type size more clearly than not having it. More important when you see the THUMB core using 3-bit (r0-r7) truncated register ranges.

Sometimes, this decoding got extra annoying, when fields were split up in the instruction encoding.

Code:
  uint1 pre = instruction() >> 24;
  uint1 up = instruction() >> 23;
  uint1 writeback = instruction() >> 21;
  uint1 l = instruction() >> 20;
  uint4 n = instruction() >> 16;
  uint4 d = instruction() >> 12;
  uint4 ih = instruction() >> 8;
  uint4 il = instruction();

  uint32 rn = r(n);
  uint32 rd = r(d);
  uint8 immediate = (ih << 4) + (il << 0);


Further, this busy work had to be repeated again inside the disassembler. Which is an absolute requirement if you ever want to have any hope of fixing your CPU bugs in games.

Code:
  if((instruction & 0x0f8000f0) == 0x00800090) {
    uint4 condition = instruction >> 28;
    uint1 signextend = instruction >> 22;
    uint1 accumulate = instruction >> 21;
    uint1 save = instruction >> 20;
    uint4 rdhi = instruction >> 16;
    uint4 rdlo = instruction >> 12;
    uint4 rs = instruction >> 8;
    uint4 rm = instruction;

    output.append(signextend ? "s" : "u", accumulate ? "mlal" : "mull", conditions[condition], save ? "s " : " ");
    output.append(registers[rdlo], ",", registers[rdhi], ",", registers[rm], ",", registers[rs]);

    return output;
  }


This violates DRY to me.

The nice thing I learned with the 68K core, is that we can both handle instruction decoding as well as both instruction+disassembler implementations with only one block of decoding per instruction. And even better, it's done at run-time and the decoded results are cached in the lambdas.

For the ARM core, only some of the instruction state was captured in the lambda, 12-bits to be precise. Plus 4-bits for the condition code that would be discarded. I opted to simply perform the bit masking and combining inside the lambda function call invocations.

Code:
  #define arguments \
    opcode.bits( 0, 3) << 0 | opcode.bits( 8,11) << 4,  /* immediate */ \
    opcode.bit ( 5),     /* half */ \
    opcode.bits(12,15),  /* d */ \
    opcode.bits(16,19),  /* n */ \
    opcode.bit (21),     /* writeback */ \
    opcode.bit (23),     /* up */ \
    opcode.bit (24)      /* pre */
  for(uint1 half : range(2))
  for(uint1 writeback : range(2))
  for(uint1 up : range(2))
  for(uint1 pre : range(2)) {
    auto opcode = pattern(".... 000? ?1?1 ???? ???? ???? 11?1 ????") | half << 5 | writeback << 21 | up << 23 | pre << 24;
    bind(opcode, LoadImmediate);
  }
  #undef arguments


Yay, the weird split immediate value is merged back into a full 8-bit value. Although it is done at run-time.

Aside: it took me several hours to come up with a way to specify the decoding once, and get it inside two separate lambdas: one for the instruction table, one for the disassemble table.

THUMB is nicer, and the decoding is done entirely in advance. This gets us something really special that I'll talk about later: template instructions.

Also really nice, is now our instructions+disassembler no longer needs all that nasty decoding.

Code:
auto ARM7TDMI::armInstructionMultiply
(uint4 m, uint4 s, uint4 n, uint4 d, uint1 save, uint1 accumulate) -> void {
  if(accumulate) idle();
  r(d) = MUL(accumulate ? r(n) : 0, r(m), r(s));
}

auto ARM7TDMI::armInstructionLoadImmediate
(uint8 immediate, uint1 half, uint4 d, uint4 n, uint1 writeback, uint1 up, uint1 pre) -> void {
  ...  //immediate is already in one piece


Code:
auto ARM7TDMI::thumbInstructionShiftImmediate
(uint3 d, uint3 m, uint5 immediate, uint2 mode) -> void {
  switch(mode) {
  case 0: r(d) = BIT(LSL(r(m), immediate)); break;  //LSL
  case 1: r(d) = BIT(LSR(r(m), immediate ? (uint)immediate : 32)); break;  //LSR
  case 2: r(d) = BIT(ASR(r(m), immediate ? (uint)immediate : 32)); break;  //ASR
  }
}


Code:
auto ARM7TDMI::armDisassembleMultiplyLong
(uint4 m, uint4 s, uint4 l, uint4 h, uint1 save, uint1 accumulate, uint1 sign) -> string {
  return {sign ? "s" : "u", accumulate ? "mlal" : "mull", _c, _s, " ",
    _r[l], ",", _r[h], ",", _r[m], ",", _r[s]};
}





Template instructions

See how thumbInstructionLoadImmediate still has three implementations inside of it? One for LSL, LSR, and ASR. Every time the instruction is executed, that's an extra conditional branch.

Well, as with the 68K core, it's easy to make this a template:

Code:
template<> auto ARM7TDMI::thumbInstructionShiftImmediate<0>
(uint3 d, uint3 m, uint5 immediate) -> void {
  r(d) = BIT(LSL(r(m), immediate));
}

template<> auto ARM7TDMI::thumbInstructionShiftImmediate<1>
(uint3 d, uint3 m, uint5 immediate) -> void {
  r(d) = BIT(LSR(r(m), immediate ? (uint)immediate : 32));
}

template<> auto ARM7TDMI::thumbInstructionShiftImmediate<2>
(uint3 d, uint3 m, uint5 immediate) -> void {
  r(d) = BIT(ASR(r(m), immediate ? (uint)immediate : 32));
}


Even nicer, is the disassembler version can still use the switch case for the more compact code. Although it will duplicate the function in the binary.

Oh, and we could also just have the one instruction function with the switch table as it is now. The compiler would eliminate the code that couldn't be accessed anyway. I just like using template overloads.

The way this will work with the table building is like so ... before:

Code:
  for(uint3 d : range(8))
  for(uint3 m : range(8))
  for(uint5 immediate : range(32))
  for(uint2 mode : range(4)) {
    if(mode == 3) continue;
    auto opcode = pattern("000? ???? ???? ????") | d << 0 | m << 3 | immediate << 6 | mode << 11;
    bind(opcode, ShiftImmediate, d, m, immediate, mode);
  }


After:

Code:
  for(uint3 d : range(8))
  for(uint3 m : range(8))
  for(uint5 immediate : range(32)) {
    auto opcode = pattern("000? ???? ???? ????") | d << 0 | m << 3 | immediate << 6;
    bind(opcode | 0 << 11, ShiftImmediate<0>, d, m, immediate);
    bind(opcode | 1 << 11, ShiftImmediate<1>, d, m, immediate);
    bind(opcode | 2 << 11, ShiftImmediate<2>, d, m, immediate);
  }


I haven't done this for the THUMB code yet. It's a real trade-off between code-size and speed gains. You wouldn't want to templatize everything and have literally 65536 functions. But it certainly helps to get rid of some conditional tests, especially mode and size selectors.




New instruction execution

So how do we invoke instructions now with this table? Simple enough.

Code:
  opcode = pipeline.execute.instruction;
  if(!cpsr().t) {
    if(!TST(opcode.bits(28,31))) return;
    uint12 index = (opcode & 0x0ff00000) >> 16 | (opcode & 0x000000f0) >> 4;
    armInstruction[index](opcode);
  } else {
    thumbInstruction[(uint16)opcode]();
  }


This replaces the decode list from before. ARM masks out the 12-bit index into its table, and THUMB can just use the instruction directly, since all 16-bits are represented.




Register lookup

The ARM7 has this "fun" feature where there are processor modes, and each mode can override certain registers with alternate versions.

My old approach was to hold a pointer table to registers, and rebuild the table on each change to the processor mode.

Code:
  GPR* r[16] = {nullptr};
  PSR* spsr = nullptr;


Code:
auto ARM::Processor::setMode(Mode mode) -> void {
  cpsr.m = 0x10 | (uint)mode;

  if(mode == Mode::FIQ) {
    r[ 8] = &fiq.r8;
    r[ 9] = &fiq.r9;
    r[10] = &fiq.r10;
    r[11] = &fiq.r11;
    r[12] = &fiq.r12;
  } else {
    r[ 8] = &usr.r8;
    r[ 9] = &usr.r9;
    r[10] = &usr.r10;
    r[11] = &usr.r11;
    r[12] = &usr.r12;
  }

  switch(mode) {
  case Mode::FIQ: r[13] = &fiq.sp; r[14] = &fiq.lr; spsr = &fiq.spsr; break;
  case Mode::IRQ: r[13] = &irq.sp; r[14] = &irq.lr; spsr = &irq.spsr; break;
  case Mode::SVC: r[13] = &svc.sp; r[14] = &svc.lr; spsr = &svc.spsr; break;
  case Mode::ABT: r[13] = &abt.sp; r[14] = &abt.lr; spsr = &abt.spsr; break;
  case Mode::UND: r[13] = &und.sp; r[14] = &und.lr; spsr = &und.spsr; break;
  default:        r[13] = &usr.sp; r[14] = &usr.lr; spsr = nullptr;   break;
  }
}


Then to access individual registers:

Code:
alwaysinline auto r(uint n) -> GPR& { return *processor.r[n]; }
alwaysinline auto cpsr() -> PSR& { return processor.cpsr; }
alwaysinline auto spsr() -> PSR& { return *processor.spsr; }


I didn't like having to make sure Processor::setMode was called after every modification to CPSR.m, so I moved to a switch table with the new design.

Code:
auto ARM7TDMI::r(uint4 index) -> GPR& {
  switch(index) {
  case  0: return processor.r0;
  case  1: return processor.r1;
  case  2: return processor.r2;
  case  3: return processor.r3;
  case  4: return processor.r4;
  case  5: return processor.r5;
  case  6: return processor.r6;
  case  7: return processor.r7;
  case  8: return processor.cpsr.m == PSR::FIQ ? processor.fiq.r8  : processor.r8;
  case  9: return processor.cpsr.m == PSR::FIQ ? processor.fiq.r9  : processor.r9;
  case 10: return processor.cpsr.m == PSR::FIQ ? processor.fiq.r10 : processor.r10;
  case 11: return processor.cpsr.m == PSR::FIQ ? processor.fiq.r11 : processor.r11;
  case 12: return processor.cpsr.m == PSR::FIQ ? processor.fiq.r12 : processor.r12;
  case 13: switch(processor.cpsr.m) {
    case PSR::FIQ: return processor.fiq.r13;
    case PSR::IRQ: return processor.irq.r13;
    case PSR::SVC: return processor.svc.r13;
    case PSR::ABT: return processor.abt.r13;
    case PSR::UND: return processor.und.r13;
    default: return processor.r13;
  }
  case 14: switch(processor.cpsr.m) {
    case PSR::FIQ: return processor.fiq.r14;
    case PSR::IRQ: return processor.irq.r14;
    case PSR::SVC: return processor.svc.r14;
    case PSR::ABT: return processor.abt.r14;
    case PSR::UND: return processor.und.r14;
    default: return processor.r14;
  }
  case 15: return processor.r15;
  }
  unreachable;
}

auto ARM7TDMI::cpsr() -> PSR& {
  return processor.cpsr;
}

auto ARM7TDMI::spsr() -> PSR& {
  switch(processor.cpsr.m) {
  case PSR::FIQ: return processor.fiq.spsr;
  case PSR::IRQ: return processor.irq.spsr;
  case PSR::SVC: return processor.svc.spsr;
  case PSR::ABT: return processor.abt.spsr;
  case PSR::UND: return processor.und.spsr;
  }
  throw;
}


In all honesty, this is probably a speed hit compared to the earlier model.

What I'd like to consider in the future is performing register swaps on processor mode changes. But that'll need more planning.




Final notes

Lastly, there's the overall style changes.

I've moved from using
under_score
naming to
camelCase
naming, and this was pretty much the last class to not have that done for it yet. This is purely just a style choice. To be honest, I don't like either. If underscore did not require a shift modifier on standard keyboards (no, I'm not modifying my keyboard map, I want to be able to code on PCs that are not my own), I might've chosen under_score. But both require constantly hitting shift, which is a real pain for me.

I'm also using Natural<Size>::bit(), bits() more as opposed to shift+mask combinations. It optimizes at compile-time to the exact same code, but makes it much easier to spot mistakes in my opinion.

Overall, the code ended up being about the same size (very slightly smaller), and about the same speed (very slightly slower.) It wasn't the major win I was hoping for on both size and efficiency, however I have a huge neurosis with consistency and I do find the new design a lot cleaner and more in line with the 68K core.

And again, there's hope for the future. Register swapping could help a lot. And making templates out of some of the ARM/THUMB instructions could help as well.

As always, I'd very much welcome improvements to my new ARM CPU core -- especially bug-fixes!

The entire new ARM7TDMI core can be browsed online here:

https://gitlab.com/higan/higan/tree/mas ... r/arm7tdmi

Hope you all enjoyed this write-up. Thanks for reading!

_________________
What the hell's going on? Can someone tell me please?
Why I'm switching faster than the channels on TV.
I'm black, then I'm white. No, something isn't right.
My enemy's invisible, I don't know how to fight.


2017-08-10 13:58
User avatar

Joined: 2014-09-27 09:30
Posts: 1103
 Re: ARM7TDMI emulation core rewrite overview
I actually understood a reasonable chunk of all this.

_________________
http://helmet.kafuka.org


2017-08-10 21:43

Joined: 2014-09-27 09:29
Posts: 719
 Re: ARM7TDMI emulation core rewrite overview
Good post! It's a nice "guided tour" :)


2017-08-11 00:09