Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Emulator Latency (byuu.org)
114 points by panic on Aug 4, 2016 | hide | past | favorite | 62 comments


The only real way round this is not to do the emulation on a commodity multi-user system (with all its tremendous advantages!), but in some kind of hardware or bare metal software. Where you can access the original gamepads with the original sub-millisecond latency.

Still can't quite work around the latency imposed by the display, unless you go back to CRT or build your own LVDS driver board.

I had the opportunity to play a Tempest arcade machine recently, and the combination of low latency and the weighted spinning controller was really tangible.


Yes, the best thing that could happen to emulation latency reduction would be the creation of a skeleton OS framework. Something that piggybacks and can use existing Linux or BSD hardware drivers, but puts the emulator in kernel space and lets it manage the hardware as directly as is practical.

I talk about this briefly in my infeasible section. But I'm mostly referring to myself there: I just don't have the skill and the time to write my own real-time OS for this.

However, I do hope that one day we see a serious project that isn't just "lightweight Linux" (eg Lakka, etc) or something running on top of DOS.

I presume that 98% of emulator users aren't going to be willing to boot into an "emulator OS" just to play games; so it's going to be a lot of work for a little gain.

In that vein, the ultimate latency reduction would be an entire FPGA emulator, like kevtris' FPGA NES (http://kevtris.org/Projects/console/video/page1.html), but now you're approaching extreme effort for a userbase of maybe ten people :/


Massalin's Synthesis was a research OS designed for low-latency media processing. It's been decades since I read the thesis, so don't trust me to summarize it, but I think it was something like: set up layers in kernel space, but collapse them with runtime code generation (fusing loops), keep switching overhead low instead of buffering a lot, and schedule with a phase-locked loop in software.

http://valerieaurora.org/synthesis/SynthesisOS/ch1.html


That's pretty amazing OS, binary fusion, etc. Something like that would be optimal for current generation hardware, modern CPUs would benefit a lot more from that.


I think the C64 Direct-to-TV by Jeri Ellsworth was quite a success in terms of userbase. Of course the problem with making and selling hardware is increased IP scrutiny.

I think the Raspberry Pi or Beaglebone would make good candidates for baremetal emulation. The Pi has a formidable but barely documented vector processor, and the Beaglebone has the (also under-supported) PRU realtime subsystems which would be ideal for controllers and audio.

Both also have enough ecosystem to make it worth doing. Retropie is already popular, although it's a straight port of the normal userland emulators and doesn't attempt to do anything fancy with the hardware. I think there is a "market" (not the kind that pays money, but a decent userbase) which will happily dedicate a cheap SBC to booting straight into emulators.


Well, debatable on the C64 instance. Indeed, if you can manage to commercially market your design, then you could get a userbase in the thousands. But compare that to the millions of downloads something like Dolphin gets.

And the ARM-based mini systems could work for retro emulators in general, sure. Presuming someone actually writes a true bare-metal OS for them instead of just a lightweight Linux instance. But they're not fast enough for people like me that are obsessive about exacting emulation accuracy. I'm still hoping we'll get to a point where ARM-based computers are taken more seriously and we start getting a real market for products like the Jetson TX1 to compete against budget AMD systems.

All of this stuff involves sacrifices. The more you chase latency, accuracy, etc ... the more you alienate your potential market. Even when we're not looking for money, more users is such a potent motivator to validate that our efforts aren't being wasted.


Last time I used retropie it wouldn't correctly render vibrato effects in the music, so all of FFVI's wonderful score came out flat and awful. That was over a year agohowever, maybe its been updated.


The Raspberry Pi just isn't powerful enough for cycle-accuracy with the SNES. Maybe if a wizard like blargg attempted it by writing the whole thing in pure ARM assembly, but for mere mortals like myself, as awesome as it'd be, it's just not going to happen :(


Retropie is Linux right? That will have the same latency issues discussed. A more realtime OS is needed.


retropie is an emulator frontend for linux. he was likely using Snes9x.

Also, linux has realtime kernel patches, which help a lot. Distributions like KXStudio ship them by default.


I wonder if you can't do a jolly good job by having emulators run under Linux with high RT priority and bypassing X by using the framebuffer? You can still have your other cores doing regular multitasking stuff in the background.


It would help even more, certainly. With really good setups, we're battling probably around ~100ms of latency when you combine everything I talked about together. If you were to do as you said, we could maybe drop another 10-20ms of that off.


> I presume that 98% of emulator users aren't going to be willing to boot into an "emulator OS" just to play games

They could run it inside VirtualBox.


And thereby introduce more latency that they are trying to reduce in the first place?


It was a joke … I thought it was obvious.


You're not allowed to use humor on HN! This is a place for serious discussion!


> Still can't quite work around the latency imposed by the display, unless you go back to CRT or build your own LVDS driver board.

According to popular consensus answer was SED which eventually lost to LCD.

https://en.wikipedia.org/wiki/Surface-conduction_electron-em...


OLED displays also have very low latency as long as there is no scaler in the way. The Oculus Rift CV1 uses them and has a motion-to-photons latency of 9 ms[0], which includes not just the display but the entire stack.

https://www.reddit.com/r/oculus/comments/43p3a6/does_anybody...


Yeah, scalers and OSDs are going to be the hardest things to give up.

I will admit ... it's painful using my ZR30w with no scaler. I can't connect my PSP's component out to it, I can't hook up my Wii U, I can't connect my XRGB Mini, my Raspberry Pi, etc ... because none of them output at 2560x1600.

Even worse, the DisplayPort connector stopped working on it (hooray, $1300 monitor quality!!), so all I have is one DVI port left, which can't even output at the 30-bit color depth which was a big part of why I wanted this monitor :(


The ironic thing about OSDs causing input lag is that the very systems we emulate prove that video overlays are trivial to implement without buffering.


There are well-known methods of significantly reducing latency (well, usually the methods are exclusive to certain emulators), but often the implication is that you're not sticking to faithful emulation of the hardware/software any more.

FWIW: I think byuu overreacted to the last few questions regarding latency-related issues.

http://board.byuu.org/phpbb3/viewtopic.php?f=8&t=1058&start=...

Apparently he won't even look at the test results...

Besides all that, while I completely agree that the easiest way to reduce latency is by getting better hardware etc, in terms of competitive gaming 1f of lag is quite a big deal at high levels of play for some games. For example, games often utilize the concept of "just frame" inputs, meaning that inputs are required on very a specific frame (the definition has been slightly relaxed in recent years). Now there are two types of ways to do these: Muscle Memory (the easiest, you just repeat the move non stop in practice mode till it becomes second nature), and using an external cue (eg. audio/video). When it comes to muscle memory, input lag isn't that much of a deal breaker because once the first move is inputted, everything else will follow correctly, even if the entire sequence is 1f or 2f off. However, when it comes to using external cues, even 1f latency can cause the sequence to fail.


> I think byuu overreacted to the last few questions regarding latency-related issues.

You're right, I did. I'm sorry about that to all involved. As per my analogy, I got asked "are we there yet?" one too many times and snapped. But it wasn't the last person to ask's fault; and they didn't deserve that. My apologies.

It wasn't just that forum post. I really have dealt with this all the time. Here's another recent one just in the past 30 days:

http://arstechnica.com/civis/viewtopic.php?p=31515637#p31515...

http://arstechnica.com/civis/viewtopic.php?p=31515881#p31515...

It's like someone with no science education watching Cosmos, and then deciding to tell Stephen Hawking how he's totally wrong about black holes. I'm not saying I am perfect; but give me some credit for my experience here, please.

> Apparently he won't even look at the test results...

You're right, I won't. My article explains why in extreme detail.


I understand your argument, but I don't think anybody is playing SNES games competitively, on a non-negligable scale. Moreover if such a game exists, it would probably fall into the pathological cases he mentioned.


The thing of course is that the so-called minority are in fact the people who raise the issue of input lag in the first place. For example, the entire Speed Running community considers this an important topic.

While they do play on "real" hardware, when runners practice for runs they generally load up a particular scenario in an emulator and repeat the practice a particular section that may require just frame inputs over and over until they are comfortable enough that it can be considered a valid strategy.

So while it's true that the SR community only forms a minority of gamers, any sort of argument that relies on the experience of the majority effectively rules them out of a conversation that affects them the most.

But besides that, it could be argued that in today's day and age, SNES Speed Runners probably do form a non-negligent subgroup, considering that the majority of gamers don't play games like SMW, Metroid or Megaman any more.

For those who aren't aware, the Speed Running community has charity events that have in the past raised over $1m for cancer research etc. They may be "small", but they are far from insignificant in terms of their gaming presence.

https://gamesdonequick.com/


> The thing of course is that the so-called minority are in fact the people who raise the issue of input lag in the first place.

And I consider input lag the achilles heel of emulation replacing real hardware. And if it's some kind of qualifier, I've done speedrunning stuff on real hardware and emulation too (Ninja Gaiden especially.) Point being: I not only take it very seriously, I'm in the position to actually do something about it. And have. I've spent a ton of time with a lot of ideas like this. The one I wasted the most time on was probably this one: http://www.ouma.jp/ootake/delay.html

Before you consider me unreasonable on this topic, dig through all the posts about input latency on the "bsnes megathread" on ZSNES forums; the bsnes subforums on the ZSNES forums; the five years of posting history on my own forum that InMotion Hosting corrupted with a botched MySQL upgrade; the three years of new posting history on my new forum instance; and all of the discussions I've had on all of the other sites on the internet over the past twelve years; and read my article in full please.

See? Look, this is me, talking about this issue, in 2008: http://board.zsnes.com/phpBB3/viewtopic.php?p=168556&sid=f27...

See how polite I used to be on this topic? Now tell me you wouldn't be tired and agitated after twelve years of new people you've never heard of before popping up and telling you that you're doing it all wrong and they totally have a revolutionary new way to vastly improve latency.


I know about the ootake "fiasco", and to be honest, it, and everything you've been asked about input lag in the past really doesn't matter. (If you've ever been in a support job, you'll know this is one of the first things you learn). You implied to a poster in that thread that if he upset you, you would ban him. Now I'm not about to tell you how to run your forum (or write your emulator!), but I don't consider that very reasonable. If you don't want to answer questions about latency, why participate in the first place?

Secondly, I really don't mean any disrespect, but your article completely misses the point. Competitive gamers don't care about those types of reasons, and not because they are being obtuse or unreasonable - there exists a healthy overlap between software developers after all. We too have been studying the topic for years.

For example, you care about faithful emulation more than anything - we don't. We don't mind playing without the sound, or turning off graphics layers, or even turning off a complete subsystem if it improves latency and maintains a faithful framerate (it's no good if it doesn't slow down when it's supposed to - see Cave Shmups - or if it runs too fast - see SF2:HF). We often have different emulators for different games simply because one handles a certain case better than another. We'll use emulators like Shmupmame that use tricks to make the input lag closer to what it's like on the arcade cab. And we certainly don't think in terms of milliseconds (and of course we understand the issues mentioned in your article...). We only think in terms of frames and everything around that. What is the FPS? How often does the game loop run? Once/multiple times every frame? How often is the game state rendered? Can the emulator process everything before the game's internal loop ends? What about if I use tricks like running @ 144hz? What about if we alter the input handler so that it disregards every but player 1's inputs, or "impossible" combinations, or buttons that don't exist in that game? Etc etc etc.

And of course we're no stranger to resistance from emulator developers. That's why so many forks exist. I myself have got my own forks of Mame. And, as with everything else, sometimes we are wrong. Sometimes the devs are wrong (recently: see Hunter K's discovery that shaders can in fact introduce latency, like players have been claiming for years). Often both parties are wrong and everyone learns something new. That's just the way it goes.

My point is that many of the people you interact with actually do know what they are talking about. They might not know the minutiae about the inner workings of Higan, but often this is exactly why they post on your board, as evidenced by the thread you locked. And in terms of your article, you skip the known methods already used by developers (and hackers), and you don't offer anything else besides something else we already know - getting as close to the bare metal as possible. Is there really anything that you added to the conversation, besides describing how Higan works? Don't get me wrong, it's a useful for someone who doesn't know anything about the topic. But for everyone else? That's debatable.

It's completely understandable that you are tired of the subject. But in that case you should just not respond instead of alienating yourself from your users. People will figure it out eventually, or just do what they've been doing for years - use the method that works for them even though they don't fully understand it.


The article talks about the OS audio buffering (for mixing streams from multiple applications) and application audio buffering (for mixing streams from within the application), each adding 10–40 ms of latency.

Why is the application audio buffer necessary? Can't the application just send all its streams on to the OS separately to let the OS do all the mixing with a single-layer approach? Is there some unrealistically low bound on the number of possible streams or something that makes this impractical? Could that be fixed?

This seems like a simple way to save 10–40ms, without giving up mixing audio across applications (as is necessary with the "WASAPI exclusive mode" the author described).


It's much more efficient to work on an audio buffer than to do it one sample at a time, and the larger the buffer the more efficient. An analogy would be sending 500 POST requests with a single JSON object as the body, or a single POST request containing an JSON array of 500 objects.

Higan does allow you to reduce the application latency. But they make the point many times in the webpage that doing so can cause audio glitches, pops and distortion due to buffer underruns. My guess is they want to settle on the best default settings. Something that won't have a noticeable latency to most players, but also won't provide audio glitching when played on, say, an older laptop.


this is a really good question, and the more I think about it, the less I'm able to come up with an answer

assuming each in-application sound generator is pull-based (like the OS sound APIs typically are) surely each one could write to a single accumulator buffer which is then sent to the OS as-is, eliminating the need for any special mixing

I'd like to know more about the requirements that led to this extra in-application mixing latency

edit: obviously the presence of push-based sound generation in the application would add another layer of mixing and latency, requiring previously prepared buffers to be mixed into a signal for the OS milliseconds after the source buffers were initially filled. however, this would be a self-imposed limitation that I still don't see a reason for


> Why is the application audio buffer necessary?

It's more about the APIs used. With something like OSS, you just write samples to a (virtualized to many devices) /dev/dsp handle. So you have to take your background music, and each sound effect channel, and mix the values together.

But if you're writing a PC game on Windows, you'd open separate IXAudio2SourceVoice handles for each, and I'm presuming that XAudio2 will handle all the mixing and when to optimally push samples to the hardware for you. You're working at a higher level here when you can provide the hardware/OS audio API with more information.

But an emulator just gets a solid stream of pre-mixed samples. Sometimes the game software have their own mixers (like Tales of Phantasia on the SNES), and these tend to fall back on game console hardware mixers, that then pass a rendered stream of audio out to the speakers (hence, indirection.)

With audio, how you handle it depends on the backend API. With DirectSound and WASAPI exclusive, it acts like a ring buffer in RAM that loops around and you poll the cursor (sample) position to know when and where to write. With XAudio2 and WASAPI shared, you build up tiny blocks of samples and throw them at the sound API. In both cases, even at 44KHz, it's not a smart idea to do this for every single sample output. So you're going to have a software queue that accumulates these samples. And the size of these queues aren't just about minimizing calls to the sound API, but it's also about the API's own precision and at what resolution it can report when a pushed buffer is consumed or where the cursor is at. This varies wildly between PCs. So the best you can really do is offer the user a latency selection and let them see how low they can set the value without audio distortion from misses.

A final complexity with emulation is that we can still have multiple audio streams. When you plug a Super Game Boy cart into an SNES, or a CD attachment onto a Genesis; the emulation of these usually results in two streams.

As mentioned in the article, generally you can merge two streams in software in near-realtime. Presuming both sound sources run at the same frequency, you only need one sample queued from both sources before you can mix them and send them to your sound API output buffer. Though keep in mind we are making a parallel process serial, and the context switches between components is the biggest source of slowdown in all of emulation, so sometimes you'll have quite a few samples from one component queuing up before we get any from another. It gets a bit fancier when they vary frequencies (and they do in emulation) ... and then you get to pull in anti-aliasing, resampling audio filters (I use butterworth biquad type-2 IIR with cubic resampling in higan), so the queues have to get a little tiny bit larger.

On anti-aliasing, that's yet more fun ... a lot of these old game consoles produced sound in the megahertz range, so you have to strip out the frequencies above the Nyquist limit to prevent aliasing (annoying buzzing) when you resample it to rates PCs can handle (in the kilohertz range.)


What kind of "stream" did you have in mind?

For the OS to do software mixing, it is basically going to run a "for" loop over the audio buffers and add them up to produce the audio buffer that goes to the hardware. When is it going to run that loop?


Good catch ... I didn't account for that in my article.

In emulation, an audio stream is an infinitely long queue of audio samples, for which you can only have a certain amount of them known ahead of time (and that amount is itself a source of audio latency.) You usually have one, but can sometimes have more. And they can vary in their frequencies, including being pathological (the Game Boy outputs sound at 2MHz.)

In PC gaming, you probably have some kind of looping for background music; but sound effects are definitely things you can push in the smarter audio APIs (like XAudio2) all at once and have all the data available to consume whenever the audio API->OS mixer->sound card needs the data.


> What kind of "stream" did you have in mind?

The same kind the author means. I didn't coin this term or introduce it to the discussion.

> For the OS to do software mixing, it is basically going to run a "for" loop over the audio buffers and add them up to produce the audio buffer that goes to the hardware. When is it going to run that loop?

Right, the author is saying that the OS is already doing this, and additionally the application does the same thing before sending a combined stream to the OS. So to answer your question of "when is [the OS] going to run that loop?" My answer is the same time it does now. I'm not proposing any changes to how the OS works.

My question is: why does the application need to do that work? Why can't it send each stream to the OS to have them be combined in only one place?


Ah, I see what you are asking. Sorry I misunderstood.

I doubt that the application mixing is actually adding any latency compared to if the application only had a single logical stream internally.

Regardless of the application's internal structure, you will at every moment have a set of audio buffers:

1. the one the sound card is currently playing

2. the one the OS is preparing for the sound card (it gets one buffer's worth of time to synthesize this).

3. the one the application is preparing for the OS (it gets one buffer's worth of time to synthesize this).

If the OS isn't doing any mixing, then you could make buffer 2 and 3 the same and save a buffer's worth of latency. But if the OS is doing mixing, then it needs a chance to add up all the application buffers before they go to the sound card (that is its "synthesize" step). So the application can't be writing directly into the OS's buffer.

You might ask "why can't the application and OS both do their work within a single buffer's worth of time?

Hmm, I guess it is an interesting question whether the OS could, inside the write() call, do the mixing immediately. I'm reaching the point where I'd have to speculate: I'm not exactly sure how existing OSs design their mixing and whether this would be feasible or not.


> My question is: why does the application need to do that work? Why can't it send each stream to the OS to have them be combined in only one place?

Two reasons.

One, the audio API used is a simple push buffer design (eg OSS) and so you have to do your own mixing first. XAudio2 and friends will spare you this detail.

Two, you're emulating a piece of gaming hardware that has its own internal mixer. And when that internal mixer doesn't have enough sound channels, the video games running on said emulated system can include their own software mixers to make it even more fun.


An alternative is to do mixing on the sound card, but in that case you still need enough audio to fill up the hardware buffers.


Great article.

>This process can incur quite a bit of CPU time as well. Attempting to poll the keyboard state, mouse state, and all attached gamepads can easily eat several milliseconds per call. So it's just not possible to poll every millisecond.

If input software layers mean it's not possible to poll every millisecond, why bother polling the hardware at 1kHz? Is it a just-in-case solution to increase hardware polling to maximum?

I'm also curious if there are any harder numbers available, maybe by triggering a USB key input and measuring time for a test program to register the change. I know this sort of thing is done to compare CRTs and LCDs, I've never seen it done for a whole PC.


Thank you! I was really worried about the tone being too harsh or know-it-all. Wasn't really meant for audiences that weren't aware of the context. But as an emulator author, you get people presenting new zany latency reduction ideas that defy the laws of physics all the time, and they completely dismiss your own experience in the field, and it's like kids constantly saying, "are we there yet?" on a long car ride. Eventually you lose your cool, and then well ... you sound like me in that article >_>

> why bother polling the hardware at 1kHz? Is it a just-in-case solution to increase hardware polling to maximum?

Yes, pretty much. More of a because-we-can and to combat the cumulative effects of latency (death by a thousand papercuts) however possible. If you were to push it to 200Hz (5ms), then it becomes possible that your OS API returns states immediately after and it stacks with your emulator latency of 5ms to form a 10ms latency. Push it to 1000Hz and that drops to a 6ms maximum latency.

It is indeed silly. No one is going to perceive a worst-case 4ms difference. (And I say worst-case because these sorts of misses tend to average between best-case and worst-case, so in practice it's probably half that bad.)

We're trying to chase the emulation latency of a gamepad that you literally tell it exactly when to poll the inputs and within mere cycles on a 21MHz clock, start reading out the results from its shift-register.

> I'm also curious if there are any harder numbers available

That would be fun. I'll admit that many of the numbers are estimated. In the end, we can only observe the net total of all latency by pressing a button and seeing how quickly the sprites respond visually and aurally. But it's probably possible to isolate similar test cases for each source of latency. In a lot of cases (kernel audio mixing, keyboard responses), we're probably talking much smaller latencies than CRT vs LCD monitors, so you'd need a huge amount of precision.


> as an emulator author, you get people presenting new zany latency reduction ideas that defy the laws of physics all the time, and they completely dismiss your own experience in the field, and it's like kids constantly saying, "are we there yet?" on a long car ride.

Yep, that seems to be the usual for nearly every emulator scene.

> But it's probably possible to isolate similar test cases for each source of latency. In a lot of cases (kernel audio mixing, keyboard responses), we're probably talking much smaller latencies than CRT vs LCD monitors, so you'd need a huge amount of precision.

I was thinking of a test program that inverts/cycles the screen colour on registering a keypress. Put a photodiode in front of the monitor and use an electrical switch across the controller button contacts. Cheap parts and an oscilloscope will give microsecond resolution for the end-to-end net case. Unfortunately not everyone has this kind of setup lying around.

I seem to remember someone marketing a commercial device that did something like this for timing CRTs vs LCDs but can't remember the name of it.


> I was thinking of a test program that inverts/cycles the screen colour on registering a keypress. Put a photodiode in front of the monitor and use an electrical switch across the controller button contacts. Cheap parts and an oscilloscope will give microsecond resolution for the end-to-end net case.

This, essentially, is one of the ways we measure end-to-end touch latency on modern phones.



Thanks :)


If the OS polls at 1kHz, then when the process reads the state from the OS, the OS will give the process data that is at most 1ms old. If the OS polls at 125Hz, the data that the process reads from the OS can be up to 8ms old.


In terms of competitive gaming, most serious players only care about the input lag that they can't control (they typically try to use hardware similar to the tournament standard).

So for example in the Street Fighter community, players are really upset that the inherent input lag in SFV is much higher compared to previous versions.

http://www.eventhubs.com/news/2016/may/08/community-begins-r...

This has an impact on how the game is played.


Been following byuu since my rom hacking days. I always stop by to read his articles and this is another good one.


If the input is steady could you not do some sort of predictive rendering where a frame or more is prerendered?


Well, I never feel that ZSNES on DOS (on a 486) was unresponsive compared against my NES clone, when I was playing Super Mario 3 on NES and on ZSNES.


In case you aren't trolling, the reason you are getting downvoted is probably because you are comparing apples to oranges. ZSNES is not an accurate emulator. byuu's goal is to emulate the systems perfectly. Here is an article he wrote about it:

http://arstechnica.com/gaming/2011/08/accuracy-takes-power-o...

So yes, ZSNES feels a lot more responsive than higan. But that's because it cuts a lot of corners with regards to accuracy of emulation. It is probably the least accurate emulator for this reason. It prioritizes speed. In light of that, your comment doesn't really add anything to the discussion.


If you can run higan at 100% speed, then ZSNES has more lag than higan does (due to double buffering, polling only once per frame, etc.)

For me, the issue with the OP is a) comparing to clone NES hardware, and b) it's subjective. Different people react differently to latency. The numbers I talk about may be largely estimated, but they're objectively real latencies that really do exist, even if you can't observe them personally.

It's a very good thing if you can't. Makes gaming a lot more fun. Testing myself, I don't really detect a change in latency under emulation alone until I simulate adding about 75ms more than is already there. However, I did notice a problem when I moved from playing Ninja Gaiden Trilogy (I know, bad port) on my SNES to higan: all of my timed moves were failing (it's a game that requires pixel-perfect movements); but I adapted pretty quickly and was able to beat the game. But again, this is all subjective stuff, so it's not really adding to the technical discussion any.


> If you can run higan at 100% speed, then ZSNES has more lag than higan does

Yes, I should have said, "ZSNES feels a lot more responsive than higan on less powerful systems."


Perhaps the delay from user input to actual action on the screen should be considered when we talk about how accurate emulators are. After all, it's something that can noticeably affect how games actually play.


Neither was ESNES or SNES96. they also weren't very accurate.

Were you playing SMB3 on NESticle, as well? because accurate NES emulation didn't happen until Nestopia.


I've just thought up a conceptually simple but tremendously difficult to implement and CPU-consuming way of working round this: avoid latency by seeing into the future.

Modern GPUs give you a ton of parallel processing. Old gamepads are binary with a fairly limited number of buttons not all of which can be pressed at once by someone with two thumbs. The emulated RAM state is also fairly small - kilobytes.

So, run lots of copies of the emulator. One for each possible change in button press status from the current state. At each frame (50/60Hz), look at the current actual inputs and pick (frame,20ms audio samples) from the available precomputed choices. Start calculating the next frames based on the winning version of the emulator state, and discard the rest.

(This is effectively branch prediction at the macrostate level).


Mentioned in the article:

> There are magic tricks beyond this, such as emulating every possible input one frame into the future, to cut out a single frame of latency. But with only one controller, this would require higan to emulate up to 4096 simultaneous SNES systems and well ... higan just isn't that fast, sorry.


Why does it need to emulate every possible input, rather than emulating a single path assuming that the controller state remains unchanged? You can only display one future frame to the user, after all, and that seems like the best one to display.


I realize a lot of the 4096 states are extremely unlikely (especially Up+Down or Left+Right ... not physically possible on an unmodified controller); but if the inputs you predict end up wrong, then you have no choice but to run the frame again normally. This is going to cause an extreme jittering where the input lag doubles for some frames.

Imagine audio stuttering from a scratched CD, or a framerate that suddenly dips from 60fps to 30fps for a moment and then resumes. With input, the effect is going to be even more jarring.

If you're going to do this, you absolutely can't have a miss. Ever.


Yep, makes total sense to me now. I was thinking of optimistic prediction, not deterministic precomputation.


The point would be to render the frame(s) before the input arrives, to reduce latency. You can only display one frame, but you don't know which frame until the user actually inputs something.


Oooh, I see! I was thinking of running the simulation forward by N frames optimistically, and then fixing it up when something else comes in, but the idea here is just to precompute every possible result. Nifty!


Missed that, it's near the bottom. :(


I do wonder how many states simpler systems like the GB or NES would have since they have fewer buttons, simpler audio, and less ram. Still probably a tough call -- for an accuracy-focused emulator that requires a lot of processing power this sort of solution might be out of reach for some years. You could probably do it on less accurate emulators, but then you're already relying on hacks & cheats so you might as well use other solutions for lowering input lag.


I think it's just barely possible for the fastest NES and GB emulators to use this trick for one single frame of lag reduction.

Here, you have eight inputs (assumes one player on the NES.) But there's a trick: up+down and left+right aren't physically possible due to a rocker in the D-pad. So as a whole, there are nine possible D-pad states instead of sixteen. With four buttons, that's sixteen possible states. So the total is 144 states.

Now look at QuickNES, blargg's masterpiece NES emulator that runs fullspeed on a 66MHz PowerPC system. If you had the fastest, overclocked octacore CPU Intel makes, you may be able to pull off 144 instances of the emulation. Especially since you don't have to actually output the audio and video for all the missed predictions to the real hardware.

Or perhaps more sanely ... most games don't have you using start+select while playing. I think people would be willing to accept a large jitter when pressing those two buttons. So now you're talking 36 instances, which starts to look very reasonable. At least, for emulators that are okay with sacrificing accuracy.

The most popular NES emulators today require 800MHz (Nestopia), 1.6GHz (Nintendulator), and I'm not sure on puNES, but presumably just as high. 36 instances of those are unlikely.

Still, this would be a fun experiment if anyone were willing! :D

Of course, the end result is a 16ms lag reduction. And again, I need to stress ... I get that latency stacks, but ... you're really going to have a hard time even telling the difference between the two. Even if you're one of those people who can beat Dodonpachi or Ninja Gaiden or Battletoads on one life.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: