Previous  1 ... 3, 4, 5, 6, 7, 8  Next
[diagram] New Lagless VSYNC ON Algorithm for emulator devs 
Author Message
User avatar

Joined: 2018-03-15 23:09
Posts: 24
 Re: [diagram] New Lagless VSYNC ON Algorithm for emulator de
SilverShroud wrote:
tl;dr byuu always re-writes the code.

Ha. OK, I hear you there.

Regardless -- my reply is targetted to other readers who may be emu authors that it doesn't necessarily need an emu rewrite -- just a way to add a hook/callback at the location of the emulator's raster scan line plotting.

Not many people know what's happening in the black box after Present() and the photons hitting eyeballs -- so hopefully my diagrams and concepts help to educate other coders.

Cheers!


2018-06-25 17:27

Joined: 2014-09-27 09:59
Posts: 423
 Re: [diagram] New Lagless VSYNC ON Algorithm for emulator de
The problem with this approach is that it doesn't really get you anything other lag reduction methods don't already provide, as long as you're using some method. That's why there's a lack of enthusiasm.

First, games aren't going to poll input and reflect it while the display is being output to. This means we've already hit "next frame" territory, so it doesn't matter how many slices the screen is divided into.

Then we have fencing, which waits for framebuffer completion. This guarantees a worst-case latency of 1 frame. It'll always situate the processing for the next frame as the ready frame is starting to be scanned out on the monitor. When the input is polled during that time duration is what affects the latency, but this doesn't affect the current frame, only the next frame, so we can scan-out however we want. If we can wait a period of time before starting processing (Retroarch's "frame delay" option), we can push this latency very close to 0.

So basically what this beam-following does is stretch the frame's processing over the whole frame. While before the emulator could run the logic in a fraction of a frame, thus putting a potential input poll almost right after the last frame was output, meaning time up until the next vsync is now input lag, it now instead depends on where in the emulated machine's output timing the input poll occurs. But if the time into the emulated machine's frame the poll occurs is shorter than Retroarch's frame delay ms, for instance, then this method is actually worse. Hell, with a frame delay, as long as your host machine can emulate fast enough to finish processing before the vsync, you can push the input poll location further into the frame's duration than the original hardware actually polls it, meaning better latency than the hardware.

So basically, on older machines where scanout timing is relevant, we're fast enough to add some delay to the beginning of processing to reduce the lag, and when we can't add too much delay the processing is spread out enough that it's close to the original hardware.


2018-06-25 17:43
User avatar

Joined: 2014-09-27 09:39
Posts: 2949
 Re: [diagram] New Lagless VSYNC ON Algorithm for emulator de
The purpose is to make it so that, for methods that use vsync for timing but delay until a certain time after vsync to reduce latency, you can start rendering later without spilling over to the next vsync interval. You can also use it to render incomplete frames with vsync-disabled adaptive sync, but doing so is extremely touchy.


2018-06-25 18:30
User avatar

Joined: 2017-11-25 16:43
Posts: 829
 Re: [diagram] New Lagless VSYNC ON Algorithm for emulator de
mdrejhon wrote:
Regardless -- my reply is targetted to other readers who may be emu authors

Confirmed. Promotion.


2018-06-25 22:13
User avatar

Joined: 2018-03-15 23:09
Posts: 24
 Re: [diagram] New Lagless VSYNC ON Algorithm for emulator de
BearOso wrote:
The problem with this approach is that it doesn't really get you anything other lag reduction methods don't already provide, as long as you're using some method.

This has already been discuss ad-infinitum on other boards, so nothing new is said here --

This is simply another lag-reducing tool in the latency reduction sandbox.

1. It uses a lot less processing power to reduce input lag via this method than via the RunAhead method. 4-frameslice beamracing works fine on Androids and Raspberry PI's -- it's simply a memory bandwidth issue (the repeated early-Presents of the progressively-more-complete incomplete emulator frame) so you adjust frameslice count to accomodate the GPU's memory bandwidth limitations, or use single-buffer (front buffer) rendering.

2. More preservationist faithful. It recreates original machine latency when beamracing 60Hz. Preserve the time differential between input read and raster. No other technique can successfully pull this off for all possible input reads (including mid-screen input reads) can correctly duplicate original faithful latency for all input-read-to-photons pairs, as beam racing.

The more frameslices, the less granularity from the frameslices, and the more original-machine matching the input lag becomes.

Today we often have full refresh-cycle lag granularities because surge executions compress all input reads into nearly one instant. This can produce a different lagfeel in some games that do mid-screen input reads, or games that have fluctuating processing that fluctuate from "just-before-VBI versus just-after-VBI" input reads -- creating 16ms latency sawtoothing artifacts. With frameslice beamracing, you eliminate these lag granularity effects.

So, yep, you're right, frameslice beam racing is a double edge sword -- but look at the good edge too.

There is flexibility to deviate away from this, obviously (e.g. fast-beamraces of fast refresh cycles). Also, beamrace margin is tweakable if you need to delay or bring forward input reads. And the methods can be combined, if desired. Beam racing can also reduce RunAhead processing overhead by 1 frame by beam racing the visibly-displayed frames. (I had a discussion where RunAhead and beam racing can run in conjunction to reduce surge-processing overhead very slightly).

SilverShroud wrote:
mdrejhon wrote:
Regardless -- my reply is targetted to other readers who may be emu authors

Confirmed. Promotion.

Yes. And nothing wrong promoting an open source in good old fashioned personal enthusiasm -- I'm no longer the only one promoting it.

BearOso wrote:
That's why there's a lack of enthusiasm.

Understandably, not all emu authors are. There's quite a bit of excitement by some interested.

It's still a worthy additional lag-reducing tool. It takes only 2-3 days to add the first basics frameslice beam racing (Tony of WinUAE) -- the big leap is simply understanding the concept correctly. Perfecting it (e.g. adding features and whatnot) does take a lot more time, but getting experimental 60Hz frameslice beamracing to an existing cyclexact/rasterexact emulator, tends to be easier than the emu author expected, since it's just a piggyback onto the existing offscreen progressively-rasterplotted framebuffer.

Quotes like "holy grail" was already given by some emu authors (e.g. Toni of WinUAE) in another forum, and Tommy (author of CLK, with CRT emulator) is excited about adding this feature. And 4 emulator authors are collaborating -- and coming up with a list of best practices that saves new authors time.

There is nothing more preservationist-friendly than reproducing exact original input-lag mechanics via precision beam racing. It catchesall the input-read-to-photons of the original machine, if you're beamracing 60Hz for 60Hz (or 50Hz for 50Hz). That's why some preservationists are more excited than others. It all depends on an emulator's priorities.


2018-06-26 18:41
User avatar

Joined: 2018-03-15 23:09
Posts: 24
 Re: [diagram] New Lagless VSYNC ON Algorithm for emulator de
wareya wrote:
The purpose is to make it so that, for methods that use vsync for timing but delay until a certain time after vsync to reduce latency, you can start rendering later without spilling over to the next vsync interval. You can also use it to render incomplete frames with vsync-disabled adaptive sync, but doing so is extremely touchy.

The great news with the jittermargin technique we've come up with is that frameslice beamracing is less touchy than VBI-synchronized Presents() done by software-based "VSYNC ON" while using VSYNC OFF.

The jittermargins we use is much bigger than the VBI so we can hide tearing more easily with fully-debugged frameslice beamracing.

One best practice that we've come up with (Calamity, Toni, myself, etc) is instead of keeping the bottom part blank, instead to rasterplot new emu scanlines onto the existing offscreen emulator refresh cycle buffer of the previous refresh cycle. So new emulator refresh cycles overwrite the old emulator refresh cycles. So one refresh cycle minus one frameslice. For 10 frameslices, it is 1080p creating 108-pixel-tall frameslice. So jitter margin is 1080 minus 108 = 972 scanlines. Plus VBI scanlines (usually ~45).

This gives us a generous >1000 scan line jitter margin (about 15ms of beam racing error with zero artifacts until tearlines show up). Present too soon (same frameslice as one scanning out), you get tearing. Present too late (more than one refresh cycle delay) and you get tearing. Anywhere in between, it looks like VSYNC ON fully tearingless.

Yes, it's even fully circular, the margin maintains itself regardless of position within refresh cycle. You can even present top half of new refresh cycle even while the realraster is still scanning out the bottom half (previous refresh cycle), because of the way the emu buffer is a wraparound buffer (new scanlines overwriting the old), so all presents in the now-generously-tall jittermargin are of duplicate data relative to where the realraster is doing the previous present, so no tearline.

This jittermargin soaks up computer performance issues like there's no tomorrow, so tearing artifacts are less likely to show up in properly-implemented fully-debugged frameslice beamracing. It's the most tearline-resistant VSYNC OFF mode ever made for a fixed-Hz monitor, if you use a forgiving beamracing margin.

Ideally you do want consistent input lag and minimize it, as close to the original machine (aka faithful latency reproduction), so you want to present as late as possible. A common recommendation is a 2-frameslice beamracing margin. Which makes it extremely forgiving -- any raster-mistimings can safely jitter ahead 1 frameslice, or jitter backwards 8 frameslices -- and it still looks like tearingless VSYNC ON -- that jittermargin is much huger than the size of VBI, and thusly more glitchfree than the fullscreen method of emulating VSYNC ON via VBI-synchronized presents during VSYNC OFF (to hide tearlines between refresh cycles).

Often the error is sub-millisecond on modern systems. But the forgivingness means it can scale to rather old systems now (whereas previous techniques were not as forgiving).

In practice, on a modern system, jitter can be controlled to under 2 scanlines or so, permitting sub-millisecond difference between emulator system and real system in an "original latency reproduction" preservationist scenario. And still scale all the way back to slower systems with simpler less-demanding 3-4-5-or-6-frameslice beamracing producing only a few milliseconds lag on Android/PI type systems.

Performance fluctuations (e.g. computer freezes) does show brief instantaneous reappearance of tearlines, but disappears instantly as beamracing falls back into sync. There's been moments of sustained one-hour emulator operation with never a single appearance of a tearline ever, though. Certainly, debugging this can be rather glitchy. But we've learned lots collectively.

As you can see, we are collaborating and coming up with clever/creative ideas together! (If you've been reading the approx 6 or 7 forum threads elsewhere on the Net, especially the more entbhusaic threads).

Shoutout to: Calamity, Unwinder, Toni, Tommy for a few of the brainstorms being merged into improving some now-extant implementations of frameslice beamracing.


2018-06-26 18:54
User avatar

Joined: 2014-09-27 09:39
Posts: 2949
 Re: [diagram] New Lagless VSYNC ON Algorithm for emulator de
"You can also use it to render incomplete frames with vsync-disabled adaptive sync" means you get screen tearing like normal double buffering. You just don't have to render the entire frame each time. Only the part that's going to be displayed before the next time you finish redrawing the buffer. This is desirable for 3d games with no fullscreen postprocessing like Quake 1/2/3.

For 3d games with framerates higher than the monitor's refresh rate, tearlines are simply proof of (room for) low latency. If a method doesn't have them, then you're losing scantime/2 latency to the middle of the screen updating at least scantime/2 later than the frame was generated.


2018-06-26 23:31
User avatar

Joined: 2017-11-25 16:43
Posts: 829
 Re: [diagram] New Lagless VSYNC ON Algorithm for emulator de
Man, this guy would make an awesome door-to-door salesman.
"Yes, folks. It's the renderer that outraces the rest!"


2018-06-27 07:37
User avatar

Joined: 2018-03-15 23:09
Posts: 24
 Re: [diagram] New Lagless VSYNC ON Algorithm for emulator de
SilverShroud wrote:
Man, this guy would make an awesome door-to-door salesman.
"Yes, folks. It's the renderer that outraces the rest!"

Ha. I would not be, because I am nearly totally deaf since birth. My disability prevents me from being a door-to-door salesman or telephone salesman. Also, it's good research/science for the public good, too.


2018-06-27 17:49
User avatar

Joined: 2018-03-15 23:09
Posts: 24
 Re: [diagram] New Lagless VSYNC ON Algorithm for emulator de
wareya wrote:
"You can also use it to render incomplete frames with vsync-disabled adaptive sync" means you get screen tearing like normal double buffering.

We (me and the willing emulator authors elsewhere) have made tearing disappear because the buffer is unchanged at the real raster (front buffer).

I'll cross post a good explanation I made in another galaxy far away.
Quote:
**Remember, cable scanout is sometimes totally different from panel scanout.** There's no concept such as "clearing the screen" on the monitor side. So forget about it, the emulator can't do anything about it. (In reality, impulse display will automatically clear and sample-and-hold LCD displays will hold until next refresh cycle -- but the considerations are exactly identical regardless whether you connect an original machine to it, or an emulator to it. So, thusly, this discussion is irrelevant here. Stop guessing displayside mechanics for now, we're only comparing emulator-vs-original connected to the SAME display. Whether it be same CRT or same LCD.

**We just want the cable to behave identically where possible.** (internal builtin displays also serialize "ala cable-like" too, phones, tablets, and laptops sequential scan too).

Focus on cable scan POV, and ignore display scan POV. So let's focus on cable scan-out -- and the GPU act of reading one pixel row at a time from its front buffer into the output at exact horizontal scanrate intervals).

Also, on the GPU side, Best Practice #9 recommends against clearing the front buffer between emulator refresh cycles, in order to keep the jitter margin huge (wraparound style).

If you’re an oldtimer – Another metaphor (if it is easier to understand frameslice beamracing) is an old reel-to-reel video tape that runs through a record head and a playback head simultaneously.

**The Tape Delay Loop Metaphor Might Help**

Technically, nothing stops an engineer from putting two heads side by side feeding a tape through both – to record and then playback simultaneously – that’s what an old “analog tape delay loop” is – a record head and a playback head running simultaneously on a tape loop.

Metaphorically, the tape delay loop represents one refresh cycle in our situation. In our beamracing case, the metaphorical “record head” is the delivery of new scanlines (even if it’s surged frameslicefuls at a time) to the front buffer, ahead of the “playback head”, the one-scanline-at-a-time readout of the front buffer into the graphics output (at exact horizontal scanrate intervals).

The front buffer isn’t onscreen instantly, it’s still being readout one pixel row at a time into the graphics output at exact constant rate (horizontal scanrate), so you always can keep changing the undelivered portions of the front buffer (including undelivered portions of a frameslice), ad-infinitum, as long as your real raster (the pixel row readout to output) stays ahead of the emu raster (new frame buffer data being put into the front buffer one way or another). This is a great way to understand why we have a full loop of a wraparound jitter margin (full refresh cycle minus one frameslice worth).

Decreasing input lag is by putting the playback head as close as possible to the record head. That's tightening the metaphorical beam race margin.

* The jitter margin is the tape between the playback head and the record head.
* The race margin is the tape between the record head and the playback head.

So, a new looped safety jitter margin of one full refresh cycle minus one frameslice.

The entire tape loop represents one refresh cycle, looping around. So for 1080p, you can have a >900-scanline jitter margin with zero tearing, if you use the wraparound-refresh-cycle technique as described above in step 9 of Best Practices. Ideally you want to race with tight latency, though, but that means with 10-frameslice beamracing, you have gotten a 1-frameslice verboten region (tearing risk), 8-frameslice race-too-fast safety margin, and 1-frameslice race-too-slow safety margin – before tearing appears. That's 15ms of random beam race error you can get with zero tearing!!

In our case, metaphorically, frameslice beam racing is simply the record head surging batches of multiples scanlines onto the metaphorical tape loop. (e.g. a movable record head that intermittently records faster than the playback head). The playback head's playback speed is totally merrily unchanged!! (i.e. the pixel row readout from front buffer to GPU's output jack). As long as the record head never falls behind and collides with the playback head (aka tearing artifact) -- thankfully this is just a metaphor, and tearlines won't wreck the metaphorical tape mechanicals and tape loop permanently (ha!) -- beam racing can recover during the next refresh cycle (aka only a 1-refresh-cycle appearance of tearing artifact). Metaphorically, front buffer rendering (adding one scanline at a time) means the record head doesn't have to surge ahead (it can record at the same velocity as the playback head).

You can adjust the race margin to somewhere far enough back that your margin is never breached. That is the metaphorical equivalent of the distance between the tape record head (adding new emu lines to front buffer) and the tape playback head (GPU output jack beginning transmission of 1 pixel row at a time)

That’s why it’s so forgiving if properly programmed, and thus can be made feasible on 8-year-old GPUs, Android GPUs, and Raspberry PI GPUs, especially at lower frameslice counts on lower-resolution framebuffers (which emulators often are), so we’ve found innovative techniques that surprised us why it hasn’t been used before now – it’s conceptually hard for someone to grasp until they go “Aha!”. (like via the user-friendly Blur Busters diagrams, etc).

I can conceptualize this visually in a totally different way if you were not born in the era of analog tape loop, but this should help (in a way) to conceptualize that we've successfully achieved a 900+ scanline safety jitter margin for 1080p beam racing, even with wraparound (e.g. Present()ing bottom half while we're already scanning top half, and Present()ing top half while we're already scanning bottom half -- both situations have NO tearing, because of the way we've cleverly done this, with Best Practice #9 two posts ago...) -- making it super-forgiving and much more usable on slower-performing systems. Smartphone GPUs can easily do 240 duplicate-frames a second, it's only extra memory bandwidth to append new frameslices, anyway.

**So that is how a 900+ scanline fully-looped across-refresh-cycle wraparound jitter margin is achieved with 1080p frameslice beamracing. At 60Hz, this means up to ~15ms range of beam race synchronization error before tearing appears! This helps soak up peformance imperfections very well during transient beamrace out-of-sync, e.g. background software. And the beamace margin can also be a configurable value, as a tradeoff between latency and tearline-apparance-during-duress-situations.**

Modern systems can easily do submillisecond race margins flawlessly, while Android/PI might need a 4ms race margin - still subframe latency!

Yes, in the extreme case frameslices can become one pixel row with no jitter margin (like how my Kefrens Bars demo turns a GeForce 1080 into a lowly Atari TIA with raster-realtime big-pixels at nearly 10,000 tearlines per second) but emulators like the jittermargin technique that hides tearing by simply keeping graphics unchanged at realraster and keeping emuraster ahead of realraster. (Like the tape delay loop metaphor explained above).

This is my partial answer. I have to go back to work, but hopefully this helps you understand better...


2018-06-27 17:54
Previous  1 ... 3, 4, 5, 6, 7, 8  Next