Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Futureproof RetroArch with precision frame pacing presenter thread (for VRR, for BFI, for beamracing, for 120Hz/180Hz/240Hz/VRR BFI, rolling-scan CRT emulators, etc) #11390

Open
mdrejhon opened this issue Sep 29, 2020 · 3 comments
Labels
feature request New enhancement to RetroArch.

Comments

@mdrejhon
Copy link

mdrejhon commented Sep 29, 2020

EDIT: Easy TL;DR Version Of This Feature Request

  1. RetroArch frame-presenter (swapchain) thread made separate from emulator/renderer thread
  2. Frame presenter thread (thread responsible for swapchain) is higher priority than renderer thread.
  3. There is never a draw call (not even to draw a single pixel or erase framebuffer) in the frame presenter thread.
    (Note: For BFI, any use of black frame buffers is a pre-rendered black buffer, to prevent GPU/CPU stalls to framepacing)

This solves a hell of a lot of problems with framepacing -- and improves many algorithms. It will makes perfect BFI flicker possible during CPU-stutter situations (it'd just look like a CRT stuttering -- with no modification to flicker rate).

Even those hard cores like bsnes would still BFI perfectly at 240Hz, even if bsnes runs at 57fps and stutters a bit. It may temporarily cease to be beamraceable (Lagless VSYNC #6984 that optionally might want to run in sync with beam simulators like BFIv3 #10757) but the BFI would still flicker at a constant rate like a CRT, despite the underlying stutters.

It would allow a lot of feature additions, including zero-latency emulation (lagless VSYNC like WinUAE) to make RetroArch match the latency of an FPGA (to within one frameslice, ala #6984)

Long Version:

Note: Crossposted feature suggestion from semi-related 240Hz BFI pull request at #11342
This was too major an ask to be inside a github pull request so I am creating a new thread. To pave groundwork for improving RetroArch compatibility with improving display emulation (improving variable refresh rate to reduce lag/stytters, improving BFI with 240Hz BFI to better emulate a CRT, etc).

This is a universal generic algorithm that should eventually become a best-practice for emulators in the next ten years.

Goals

  • Reduce existing stutters (VRR, triple buffer, VSYNC OFF, DWM, non-60Hz dislays)
  • Reduce existing flicker and eliminate artifacts (improved software-based BFI)
  • Improve user-friendliness (make things work more automatically on non-60Hz displays with fewer settings-fiddling)
  • Futureproofing to future display refreshing algorithms

Problem: Existing frame pacing algorithm not future proof enough

Currently, G-SYNC uses a software-based frame pacing algorithm in the emulator, but it is not currently optimized in a future-proof way yet:

The upgrade to existing frame pacing algorithm

I propose a separate thread responsible for frame presenting (e.g. Present() or glxxSwapBuffers() or whichever API) that does the following:

  • Allows frame presents to continue independently of rendering
    Presents will have no jitter even if emulator modules use darn near 100% CPU
  • Allows frame presents to be optionally ultra precise
    This can be done via busywait instead (or in addition to) of timer events -- because some algorithms (beamrace or BFI+VRR) have visibile artifacts with sub-millisecond errors. Also, one can also timer-event to 0.5ms prior, then busywait on high-precision-clock the rest of way.
  • Allows same frame pacing algorithms to work with all sync technologies
    (VSYNC ON, VSYNC OFF, DWM, triple buffer, AMD Enhanced Sync, NVIDIA Fast Sync, FreeSync, G-SYNC, VESA Adaptive-Sync, BFI, etc), making it much easier to combine them (e.g. BFI during VRR)
  • Allows easier future addition of new algorithms
    (e.g. rolling BFI CRT emulators, or lagless VSYNC beam racing), with little or no modification to emulator rendering
  • More future proof

Largely a Streamlining of Existing Workflow

The existing present call would be replaced by a wrapper that passes the frame to a frame presentation thread. The presentation thread will time the presentation itself.

Some of the workflow already exists, it just needs to be re-jigged into an official unified workflow with capability of improved precision.

  • "Rendering Thread" = Thread that runs the emulator and generates the emulator frames;
  • "Presenting Thread" = Thread that is now permanently responsible for frame presentation;
  • "Present Wrapper" = This replaces the existing frame present method (e.g. glxxSwapBuffers() or Present() or whatever platform API is used to pass frame to the graphics drivers). So that Rendering Thread can transfer (or copy) frame to the Presenting Thread

Suggested Stage 1 Workflow

  1. Presenting Thread only purpose is presenting frames (no rendering)
  2. Presenting Thread is always higher priority than rendering thread, for purpose of timing precision. Most of the time, presenting thread uses 0% CPU since it's just timing pre-rendered frames, so high precision becomes harmless to Rendering Thread
  3. Presenting Thread can optionally be forced to present immediately so it is backwards compatible with existing present workflow (this can ease iterative development) or for platforms not stable with separate-thread presenting (Fortunately, I don't think there's are any left).
  4. Rendering Thread should do all rendering, including CRT filters
  5. Present Wrapper inside the Rendering Thread can be a good way to hide/centralize all the final processing. Such as adding CRT filters, or rendering a whole sequence of BFI framebuffers (low emulator Hz on high real-display Hz). This hides the implementation details of many refreshing algorithms, and makes cross-platform easier.
  6. Presenting Thread will make sure that the time intervals between consecutive presents are as exact as possible. When this is achieved, the algorithm suddenly become universal (works with all sync technologies).
  7. Present Wrapper can still emulate the behavior of a 60Hz VSYNC ON waitable swapchain (regardless of whether underlying hardware is doing VSYNC ON or VSYNC OFF or VRR or BFI or whatever) by waiting for a heartbeat from the Present Thread

Metaphorically, this workflow is a metaphorically software-based VSYNC ON emulator, hiding the quirks of GPU drivers or destination displays away from emulator rendering. While simultaneously improving user-friendliness (things just works automatically upon startup) and making things less buggy (no VRR stutters, no BFI flicker) and future proofing (even BFI made VRR compatible, hardware-based beamrace, software-based beamrace, CRT beam emulators, not-yet-invented display algorithms).

In a 60Hz VSYNC ON scenario, this is just defacto passthrough behavior (Present Thread will immediately present), while allowing one framepacing algorithm to work with ALL sync technologies more reliably. And it adds no extra workflow lag.

Don't worry about BFI for now (#10754 and/or #10757), don't worry about beamracing for now (#6984 and/or #10757); those are solvable in future (e.g. wrappers for PresentScanLine() can be added later to pass one pixel row between Rendering Thread to the Present Thread, as an example). For now, just focus on generic crossplatform full-frame workflow.

Easy Debugging Tip for 60Hz-Only Developers: VSYNC OFF

Testing without VRR can be done via 60Hz VSYNC OFF while using CPU-heavy emulation/emulation settings. Use 60Hz VSYNC OFF, and use tearline jitter as a timing-precision debugger. If the tearline erratically moves or jitters/vibrates massively, your present timing is not "best-effort microsecond-accurate". If the tearline is stationary or rolls slowly up/down, your present timing is nearly microsecond-accurate.

1080p 60Hz is a horizontal scanrate of 67.5 kilohertz (approx 67500 pixel rows per per second, including VBI). So a 1/67500th second delay moves a VSYNC OFF tearline downwards by 1 pixel. Modern displays still scan from top-to-bottom (high speed videos) and VSYNC OFF tearlines are a raster artifact.

So if your tearline is vibrating by 50 pixels up/down, that means you've got a 50/67500th second imprecision in your Present() or glxxSwapBuffers() timing. Thusly, VSYNC OFF 60Hz is an excellent timing debugger, since VSYNC OFF tearline is a real-display raster where the new real GPU framebuffer splices into the destination display's scanout position. Run a horizontal-panning videogame (such as a platformer) to find the tearline.

If you have a high-Hz display, you can also test 60fps at VSYNC OFF 120Hz or VSYNC OFF 240Hz for more sensitive timing-precision debugging (1/135000th second for a 120Hz tearline moving downwards by 1 pixel, for example).

When you succeed in generating a stable VSYNC OFF tearline, it automatically translates to VRR users get amazing framepacing, and BFI users getting artifactless flicker-free operation (even if you never test VRR or BFI) Thus, use 60 Hz VSYNC OFF as a clever easy debugger for frame-present timing precision if you don't have 144Hz or VRR or BFI!

@mdrejhon
Copy link
Author

mdrejhon commented Sep 29, 2020

For more advanced reading about the presenter thread idea, please read the comments section of the pull request

LINK: Pull Request Talk: "Variable BFI" on Presenter Thread Idea

There is MANY, MANY ideas there too -- and why the presenter thread does a big universal future-proofing move for RetroArch for all non-fixed-60Hz workflows (including VRR and BFI, plus future workflows such as beamracing).

However, please be warned, they are BIG WALLS OF TEXT there. This github item simplifies it into an easier-to-read algorithm.

Also, for those unfamiliar, this github item is a useful improver / pre-requisite for all the following:

However, you don't need to understand those fully in order to implement this github item; the Present Thread concept.

@mdrejhon
Copy link
Author

mdrejhon commented Oct 2, 2020

Napkin Exercise: Use Cases of a Universal Precision Frame Pacing Thread

There will be cases where emulator may need to execute faster/slower than the display, so in theory, the present thread or another thread may need to provide synchronization services -- to govern the speed of the speed of emulator at a ratio higher/lower than the actual display refresh cycle itself -- for different reasons.

(Note: For RunAhead workflows, the "x speed emu execute" applies only to the final frame of the RunAhead per emulator frame. All other rewound RunAhead frames can run at max speed in all workflows below)

  • Classical VSYNC 60 Hz Operation: fastest speed emu execute for 1/60sec; 1 hardware vsync per emu vsync
  • Hardware beamrace 60Hz emu onto hardware 60Hz: 1x speed emu execute for 1/60sec; 1 hardware vsync per emu vsync
  • Hardware beamrace 60Hz emu onto 120Hz: 2x speed emu execute for 1/60sec; 1 hardware vsync per emu vsync
  • Hardware beamrace 60Hz emu onto 240Hz: 4x speed emu execute for 1/60sec; 1 hardware vsync per emu vsync
  • Software beamrace 60Hz emu onto 240Hz: 1x speed emu execute for 1/60sec; 4 hardware vsync per emu vsync
  • Software beamrace 60Hz emu onto 360Hz: 1x speed emu execute for 1/60sec; 6 hardware vsync per emu vsync
  • Full screen global BFI 60Hz emu onto 120Hz: fastest emu execute for 1/60sec; 2 hardware vsync per emu vsync
  • Full screen global BFI 60Hz emu onto 180Hz: fastest emu execute for 1/60sec; 4 hardware vsync per emu vsync
  • Full screen global BFI 60Hz emu onto 240Hz: fastest emu execute for 1/60sec; 6 hardware vsync per emu vsync
  • (VRR) Classical 60 fps Operation on >60Hz+ VRR/triplebuffer/etc: fastest speed emu hsync for 1/60sec; 1 simulated "hardware" vsync per emu vsync
  • (VRR) Hardware beamrace 60Hz emu onto 180fps on >180Hz+ VRR: 1x speed emu hsync for 1/60sec; 1 simulated "hardware" vsync per emu vsync
  • (VRR) Hardware beamrace 60Hz emu onto 240fps on >240Hz+ VRR: 1x speed emu hsync for 1/60sec; 1 simulated "hardware" vsync per emu vsync
  • (VRR) Software beamrace 60Hz emu onto 180fps on >180Hz+ VRR: 1x speed emu hsync for 1/60sec; 3 simulated "hardware" vsync per emu vsync
  • (VRR) Software beamrace 60Hz emu onto 240fps on >240Hz+ VRR: 1x speed emu hsync for 1/60sec; 4 simulated "hardware" vsync per emu vsync
  • (VRR) Software beamrace 60Hz emu onto 300fps on >300Hz+ VRR: 1x speed emu hsync for 1/60sec; 5 simulated "hardware" vsync per emu vsync
  • (VRR) Full screen global BFI 60Hz emu onto 180Hz on >180Hz+ VRR: fastest emu hsync for 1/60sec; 3 simulated "hardware" vsync per emu vsync
  • (VRR) Full screen global BFI 60Hz emu onto 240Hz on >240Hz+ VRR: fastest emu hsync for 1/60sec; 4 simulated "hardware" vsync per emu vsync

Many workflows already exist (e.g. WinUAE can already hardware beamrace a VRR refresh cycle), this napkin exercise simply provides the developer to correctly think a universal futureproof workflow. It shows we need all of them: emuHz<realHz, emuHz=realHz, emuHz>realHz -- for the purposes of precision frame presenting -- for different use cases.

Glossary

@mdrejhon
Copy link
Author

mdrejhon commented May 7, 2024

Now that #15299 (minor tweak) is solved...

TL;DR Version Of This Feature Request

  1. RetroArch frame-presenter (swapchain) thread made separate from emulator/renderer thread
  2. Frame presenter thread (thread responsible for swapchain) is higher priority than renderer thread.
  3. There is never a draw call (not even to draw a single pixel or erase framebuffer) in the frame presenter thread.
    (Note: For BFI, any use of black frame buffers is a pre-rendered black buffer, to prevent GPU/CPU stalls to framepacing)

This solves a hell of a lot of problems with framepacing -- and improves many algorithms. It will makes perfect BFI flicker possible during CPU-stutter situations (it'd just look like a CRT stuttering -- with no modification to flicker rate).

Even those hard cores like bsnes would still BFI perfectly at 240Hz, even if bsnes runs at 57fps and stutters a bit. It may temporarily cease to be beamraceable (Lagless VSYNC #6984 that optionally might want to run in sync with beam simulators like BFIv3 #10757) but the BFI would still flicker at a constant rate like a CRT, despite the underlying stutters.

It would allow a lot of feature additions, including zero-latency emulation (lagless VSYNC like WinUAE) to make RetroArch match the latency of an FPGA (to within one frameslice, ala #6984)

This solves a hell of a lot of problems with framepacing -- and improves many algorithms. It will makes perfect BFI flicker possible during CPU-stutter situations (it'd just look like a CRT stuttering -- with no modification to flicker rate).

This will improve reliability even further and add some magical powers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New enhancement to RetroArch.
Projects
None yet
Development

No branches or pull requests

2 participants