What makes speculative execution in emulators impossible?

0 users browsing Emulation. | 5 bots

Main » Emulation » What makes speculative execution in emulators impossible?

Pages: 1

Duck Penis

Posted on 19-04-10, 14:18

Stirrer of Shit
Post: #184 of 717
Since: 01-26-19

Last post: 1553 days
Last view: 1551 days

My understanding is that accurate emulators are slow because they have to synchronize - at any cycle, theoretically a processor could receive an interrupt, and they have to know this before proceeding with the code.

But this is a very similar situation to what real CPUs face. Why can't the emulators simply emulate all of them at once, save each one's state every N cycles, and roll back when an interrupt is received?

Say you have three processors: p0, p1, and p2.
Each one can send interrupts to any other processor.
You start at t = 0, t goes up by one each cycle.
The instruction that sends interrupts causes some flag to be set, and t to be stored somewhere, and emulation for that core to stop
Go forward N cycles.
If the flag hasn't been set, save registers and RAM somewhere repeat the process.
If the flag has been set, load registers and RAM and emulate t cycles.
Send the interrupt, save registers and RAM somewhere repeat the process.

I get that there's a good reason for it - if it would have been that easy, then why is nobody doing it? But I'm still curious. Because on paper it sounds very easy, at least to me, and a quick search of "speculative execution in emulators" doesn't turn up much of value.

This one guy on reddit suggested it, and nobody really had any answers other than "why don't you do it yourself if it's so easy"
byuu mentions "Algorithms to roll back timed events as endrift does," and appears to say the main issue is that they are difficult to implement/maintain. Is this about speculative execution, or something else?
Some discussion of similar ideas but on a frame-by-frame basis - not exactly what I'm talking about, but the idea is similar

So, I get that it's impossible, but why? Would the value of N be too low to be worth the extra save/load business?

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

BearOso

Posted on 19-04-10, 15:10 (revision 1)

Post: #71 of 175
Since: 10-30-18

Last post: 1240 days
Last view: 1240 days

Posted by sureanem

So, I get that it's impossible, but why? Would the value of N be too low to be worth the extra save/load business?

Exactly. Processors get away with it by throwing extra silicon at the problem. No, let's rephrase that. Processors had spare silicon space from die shrinks, which were no longer helping to increase the clock speed, so the only method available to increase speed was speculative execution.

On emulators, this just consumes the same resource we're trying to conserve--CPU time.
*edit* I'm talking about the overhead, BTW. I realize the extra CPU cores are an unused resource.

creaothceann

Posted on 19-04-10, 15:27

Post: #111 of 449
Since: 10-29-18

Last post: 36 days
Last view: 5 hours

http://forums.nesdev.com/viewtopic.php?f=3&t=10495

- It's not impossible, but it's more complicated than a straight-forward emulator.

- Computers can be really fast when they can concentrate on a small part of a problem. If you can fit your problem into the L1 cache, you're golden. (If an important function in a game engine has to leave L3 and touch main memory, chances are its frame rate will drop. Most programmers don't care (enough) about data-oriented design.) An accurate emulator has to touch many pieces of its data per cycle, which encourages these cache misses.

- If you want to roll back the changes that a CPU (core) makes, you have to store the old state somewhere. This data now competes with all the other data in your caches. Chances are it's larger than the L2 cache (which on my system is 256 KB per core), so it'll be pushed to L3 which is shared with all cores, i.e. all programs that are running on the computer.

- If you're emulating on multiple CPU cores, synchronization delays may force you to synchronize rarely. By the time you actually do synchronize, you might find that you'll have to roll back most of the time anyway.

My current setup: Super Famicom ("2/1/3" SNS-CPU-1CHIP-02) → SCART → OSSC → StarTech USB3HDCAP → AmaRecTV 3.10

Duck Penis

Posted on 19-04-10, 18:21

Stirrer of Shit
Post: #185 of 717
Since: 01-26-19

Last post: 1553 days
Last view: 1551 days

Posted by BearOso

Posted by sureanem

So, I get that it's impossible, but why? Would the value of N be too low to be worth the extra save/load business?

Exactly. Processors get away with it by throwing extra silicon at the problem. No, let's rephrase that. Processors had spare silicon space from die shrinks, which were no longer helping to increase the clock speed, so the only method available to increase speed was speculative execution.

On emulators, this just consumes the same resource we're trying to conserve--CPU time.
*edit* I'm talking about the overhead, BTW. I realize the extra CPU cores are an unused resource.

But the CPU is used pretty inefficiently. It kills the pipelining and everything when you synchronize that often, according to the ArsTechnica article.

Posted by creaothceann
http://forums.nesdev.com/viewtopic.php?f=3&t=10495

- It's not impossible, but it's more complicated than a straight-forward emulator.

- Computers can be really fast when they can concentrate on a small part of a problem. If you can fit your problem into the L1 cache, you're golden. (If an important function in a game engine has to leave L3 and touch main memory, chances are its frame rate will drop. Most programmers don't care (enough) about data-oriented design.) An accurate emulator has to touch many pieces of its data per cycle, which encourages these cache misses.

- If you want to roll back the changes that a CPU (core) makes, you have to store the old state somewhere. This data now competes with all the other data in your caches. Chances are it's larger than the L2 cache (which on my system is 256 KB per core), so it'll be pushed to L3 which is shared with all cores, i.e. all programs that are running on the computer.

- If you're emulating on multiple CPU cores, synchronization delays may force you to synchronize rarely. By the time you actually do synchronize, you might find that you'll have to roll back most of the time anyway.

I read your post, and thought, bah, the SNES' total RAM is just 256KB, that can't take that long to copy on a modern CPU. So I do some benchmarks, turns out it actually takes about 0.05 ms on my CPU. Apparently, they're much slower at copying memory than I thought.

However, that implies it mutates the whole memory. It's enough to copy the parts that have changed. You can either to this by using mmap (which is transparent to most of the underlying application), or by just replicating your writes. That would amount to maybe a few movs per instruction, but at 0.5 CPI that's negligible.

As for the cache, what do you mean? That you'd end up polluting the cache with the backup registers? You could use non-temporal stores to completely mitigate the issue, but is it that bad? Assuming you write as much as you read and you write twice, you're "only" wasting 33% of cache in the worst case scenario. Even assuming that would translate linearly to a performance hit, I'd reckon the gain from multithreading and proper pipelining would be greater.

What do you mean by the third point? Each CPU chooses how often to synchronize, by seeing if the IO flag was set. Of course, you would need some primitives to avoid race conditions, but it's still feasible to implement. According to the ArsTechnica article, most emulators do fine synchronizing once per audio sample, which is maybe once per 50 SNES cycles.

Are there any actual numbers on this, like how often any actual IO is done, how often a console writes to RAM, etc?

Well, I guess I understand now. While it's feasible to implement, it sounds rather unpleasant and wouldn't exactly make for clean code. Probably could give you "fun" exotic bugs introduced too (Are you excited to get Meltdown inside your SNES emulator?) that would be hard to find without automated testing.

There was a certain photograph about which you had a hallucination. You believed that you had actually held it in your hands. It was a photograph something like this.

Near

Posted on 19-04-10, 18:26

Burned-out Genius Developer
Post: #25 of 51
Since: 10-30-18

Last post: 1209 days
Last view: 1132 days

Speculative emulation with rewind for misses is how Nemesis' Exodus emulator for the Sega Genesis works. It still needs equivalent processing power as higan does for the SNES, only now there's also an extremely complicated layer to rewind the emulation core present for no real gain.

It's possible a better implementation of speculative emulation is possible that would result in more impressive speeds. But even if there is, I don't think there's a huge market for it.

Sour will probably produce something rivaling bsnes' accuracy at more reasonable speeds already. And even if that doesn't pan out for some reason, I really think we have our bases covered well with Snes9X < bsnes/fast < higan.

CaptainJistuce

Posted on 19-04-10, 20:43

Custom title here

Post: #396 of 1151
Since: 10-30-18

Last post: 21 days
Last view: 1 day

Posted by sureanem
My understanding is that accurate emulators are slow because they have to synchronize - at any cycle, theoretically a processor could receive an interrupt, and they have to know this before proceeding with the code.

But this is a very similar situation to what real CPUs face. Why can't the emulators simply emulate all of them at once, save each one's state every N cycles, and roll back when an interrupt is received?

Well, they aren't really interrupt-locked, at least not on the Super Nintendo. To my understanding, the VDP and CPU can both access RAM at any time without warning each other. It is more complex to keep track of the timing of every RAM read and write and whether it would've mattered, than to simply track interrupt signals.

--- In UTF-16, where available. ---

hunterk	Posted on 19-04-11, 02:16 Hi.
Post: #21 of 60 Since: 10-29-18 Last post: 1432 days Last view: 1353 days	Posted by byuu It still needs equivalent processing power as higan does for the SNES It's even worse than that, isn't it? That is, equivalent processing power per thread/core? Last time I tried it, at least, it was hitting like 4 cores each as hard as higan.

Near	Posted on 19-04-11, 04:42 (revision 1) Hi.
Burned-out Genius Developer Post: #27 of 51 Since: 10-30-18 Last post: 1209 days Last view: 1132 days	Yes, that's correct. Although to be fair, Nemesis does take abstraction a lot further than I do. For instance, each CPU instruction has its own class. It's definitely cool if you wanna really focus on just one CPU instruction, but I mean ... yeah. It's possible that the rollback implementation is not as fast as it could possibly be.

Pages: 1

Main » Emulation » What makes speculative execution in emulators impossible?

bboard