Hacker Newsnew | past | comments | ask | show | jobs | submit | kierank's commentslogin

FFmpeg is developed almost entirely by volunteers. We have no "customers".


There are people who use and depend on ffmpeg. Maintainers seem to go out of their way to solve issues these folks face. If you don't care, then ignore the bug reports and force them to solve their own problems by contributing.


These people are not customers though. The maintainers do their best, but overall the project seems to be understaffed though, so customers (for example Google, as it seems they occasionally chip in) get priority.


Then you and your security friends will create lots of FUD about FFmpeg being "insecure" with lots of red and the word "critical" everywhere.


Why complain about pressure from customers then?


We in FFmpeg need all the instructions and we often need to do register allocations by hand.


Absolutely fair! FFmpeg does fall into the category of scenarios where skipping to the very last mile optimizations is reasonable. And thank you for your work on FFmpeg!

Most code paths out there aren't like that however and compilers are not too bad at instruction selection nowadays (you'd be right to mention that they sometimes have odd regressions, I've definitely seen that being a problem in LLVM, GCC and RyuJIT).


I am the author of these lessons.

Ask me anything.


As a user of an ARM Mac, I wonder: how much effort does it take to get such optimized code to work the same in all platforms? I guess you must have very thorough tests and fallback algorithms?

If it's so heavy in assembly, the fact that ffmpeg works on my Mac seems like a miracle. Is it ported by hand?


> If it's so heavy in assembly, the fact that ffmpeg works on my Mac seems like a miracle. Is it ported by hand?

Not ported, but rather re-implemented. So: yes.

A bit more detail: during build, on x86, the FFmpeg binary would include hand-written AVX2 (and SSSE3, and AVX512, etc.) implementations of CPU-intensive functions, and on Arm, the FFmpeg binary would include hand-written Neon implementations (and a bunch of extensions; e.g. dotprod) instead.

At runtime (when you start the FFmpeg binary), FFmpeg "asks" the CPU what instruction sets it supports. Each component (decoder, encoder, etc.) - when used - will then set function pointers (for CPU-intensive tasks) which are initialized to a C version, and these are updated to the Neon or AVX2 version depending on what's included in the build and supported by this specific device.

So in practice, all CPU-intensive tasks for components in use will run hand-written Neon code for you, and hand-written AVX2 for me. For people on obscure devices, it will run the regular C fallback.


While the instructions are different, every platform will have some implementation of the basic operations (load, store, broadcast, etc.), perhaps with a different bit width. With those you can write an accelerated baseline implementation, typically (sometimes these are autogenerated/use some sort of portable intrinsics, but usually they don't). If you want to go past that then things get more complicated and you will have specialized algorithms for what is available.


Hi, thanks for your work!

I have a question, as someone who can just about read assembly but still do not intuitively understand how to write or decompose ideas to utilise assembly, do you have any suggestions to learn / improve this?

As in, at what point would someone realise this thing can be sped up by using assembly? If one found a function that would be really performant in assembly how do you go about writing it? Would you take the output from a compiler that's been converted to assembly or would you start from scratch? Does it even matter?


You're looking for the tiniest blocks of code that are run an exceptional number of times.

For instance, I used to work on graphics renderers. You'd find the bit that was called the most (writing lines of pixels to the screen) and try to jiggle the order of the instructions to decrease the number of cycles used to move X bits from system RAM to graphics RAM.

When I was doing it, branching (usually checking an exit condition on a loop) was the biggest performance killer. The CPU couldn't queue up instructions past the check because it didn't know whether it was going to go true or false until it got there.


Don’t modern or even just not ancient cpus use branch prediction to work past a check knowing that the vast majority of the time the check yields the same result?


All the little tricks that the CPU has to speed things up, like branch prediction, out of order execution, parallel branch execution, etc, are mostly more expensive than just not having to rely on them in the first place. Branch prediction in particular is not something that should be relied on too heavily either, since it is actually quite a fragile optimization that can cause relatively large performance swings with seemingly meaningless changes to the code.


Branch prediction is great for predictable branches, which is often what you have, or a good approximation to it. I forget the exact criteria, but even quite old chips could learn, e.g., all repeating patterns of length up to 4, most repeating patterns of length up to 8 and fixed-length loop patterns (n YESes followed by 1 NO) of any length.

Quite often, though, you don't have predictable branches, and then you'll pay half the misprediction cost each time on average. If you're really unlucky, you could hit inputs where the branch predictor gets it wrong more than 50% of the time.


The best answer to your question is some variant of "write more assembly".

When someone indicates to me they want to learn programming for example, I ask them how many programs they've written. The answer is usually zero, and in fact I've never even heard greater than 10. No one will answer a larger number because that selects out people who would even ask the question. If you write 1000 programs that solve real problems, you'll be at least okay. 10k and you'll be pretty damn good. 100k and you might be better than the guy who wrote the assembly manual.

For a fun answer, this is a $20 nand2tetris-esque game that holds your hand through creating multiple cpu architectures from scratch with verification (similarly to prolog/vhdl), plus your own assembly language. I admittedly always end up writing an assembler outside of the game that copies to my clipboard, but I'm pretty fussy about ux and prefer my normal tools.

https://store.steampowered.com/app/1444480/Turing_Complete/


This is one heck of a question.

I don't know assembly, but my advice would be to take the rote route by rewriting stuff in assembly.

Just like anything else, there's no quick path to the finish line (unless you're exceptionally gifted), so putting in time is always the best action to take.


What's your perspective on variable-width SIMD instruction sets (like ARM SVE or the RISC-V V extension)? How does developer ergonomics and code performance compare to traditional SIMD? Are we approaching a world with fewer different SIMD instruction sets to program for?


Var-width SIMD can mostly be written using the exact same Highway code, we just have to be careful to avoid things like arrays of vectors and sizeof(vector).

It can be more complicated to write things which are vector-length dependent, such as sorting networks or transposes, but we have always found a way so far.

On the contrary, there are increasing numbers of ISAs, including the two LoongArch LSX/LASX, AVX-512 which is really really good on Zen5, and three versions of Arm SVE. RISC-V V also has lots of variants and extensions. In such a world, I would not want to have to implement per-platform implementations.


How does FFmpeg generate SEH tables for assembly functions on Windows? Is this something that x86asm.inc handles, or do you guys just not worry about it?


As someone who wrote x86 optimization code professionally in the 90s, do we need to do this manually still in 2025?

Can we not just write tests and have some LLM try 10,000 different algorithms and profile the results?

Or is an LLM unlikely to find the optimal solution even with 10,000 random seeds?

Just asking. Optimizing x86 by hand isn't the easiest, because to think through it you start to have to try and fit all the registers in your mind and work through the combinations. Also you need to know how long each instruction combination will take; and some of these instructions have weird edge cases that take vastly longer or quicker to run that is hard for a human to take into account.


I guess your question could be rephrased as "couldn't we come up with better compilers?" (LLM-based or not, brute force based or not).

I don't have an answer but I believe that a lot of effort has been put in making (very smart) compilers already, so if it's even possible I doubt it's easy.

I also believe there are some cases where it's simply not possible for a compiler to beat handwritten assembly : indeed there is only so much info you can convey in a C program, and a developer who's aware of the whole program's behavior might be able to make extra assumptions (not written in the C code) and therefore beat a compiler. I'm sure people here would be able to come up with great practical examples of this.


While using a LLM might not be the best approach, it would be interesting to know if there are some tools these days that can automate this.

Like, I should be able to give the compiler a hot loop and a week, and see what it can come up with.

One potential pitfall I can see is that there are a lot of non-local interactions in moderns systems. We have large out-of-order buffers, many caching layers, complex branch predictors, and an OS running other tasks at the same time, and a dozen other things.

What is optimal on paper might not be optimal in the real world.


> Like, I should be able to give the compiler a hot loop and a week, and see what it can come up with.

There are optimization libraries which can find the optimum combination of parameters for an objective, like Optuna.

It would be enough to expose all the optimization knobs that LLVM has, and Optuna will find the optimum for a particular piece of code on a particular test payload.



Collaborators have actually superoptimized some of the more complicated Highway ops on RISC-V, with interesting gains, but I think the approach would struggle with largish tasks/algorithms?


I have tried with Grok3 and Claude. They both seem to have an understanding of the algorithms and data patterns which is more than I expected but then just guess a solution that's often nonsensical.


You would need to be very careful about verifying the output. Having an LLM generate patterns and then running them through a SAT solver might work, but usually it's only really feasible for short sequences of code.


Hacker News is such a cool website.

Hi thank you for writing this!


It's crazy that you have to use this custom "embedded" tooling when the vendor should be implementing support in vanilla Linux distros.


It is not "custom embedded tooling"! It is tooling you run on your main machine to build a custom distro. Imagine you follow the "Linux from Scratch" tutorial, then start writing scripts to automate parts of that, and eventually create a framework that allows you to create a custom Linux from Scratch. After years of work you may end up with something that looks like Yocto.

The whole point of using Yocto is that you want a custom distro. You could build a totally "standard" distro with Yocto but... at this point you can also just use Gentoo or Debian or whatever works.


I agree that vendors should upstream, but into packages that are upstream of even the distros. It's not like x64 where the same binaries can support multiple systems. My product OS can't even boot on the chip maker's devkit and it's impossible to make a binary that works on both.

A vanilla distro doesn't want to support a platform with a few hundred thousand units and maybe a few dozen people on the planet that ever log into anything but the product's GUI. That's the realm of things like OpenWRT, and even they are targeting more popular devices.

I understand the hobbyist angle, and we don't stand in their way. But it's much cheaper to buy a SBC with a better processor. For the truly dedicated, I don't think expecting a skill level of someone who can take our yocto layer on top of the reference design is asking too much.


This is way harder to do with SBCs than you would think. You don't have a BIOS.


There's a lot more to Yocto than just building the kernel. It's still useful when kernel support is upstreamed, such as including vendor tooling and test programs.

Upstreaming also takes a very long time and is usually incomplete. Even when some upstream support is available you will often have to use the vendor specific kernel if you want to use certain features of the chip.

Nobody can wait around for upstream support for everything. It takes far too long and likely won't ever cover every feature of a modern chip.


This comment makes zero sense. It's a meta-distribution: it builds a custom one for you. Professional custom embedded distros are a different beast altogether from the vanilla distros.


This is premature optimisation. The bus bandwidth and latency needed to get a few Mbps of compressed video to the PC is microscopic. It's completely unnecessary to lock yourself into NVIDIA just to create some UDP packets.


I was talking about Nvidia's Cloud gaming offer (GeForce Now). For them it's certainly not a premature optimization.


Because latency is a distribution and these photos are often selected at the best-case P0 end of all the encode/decode processes whereas actually what matters is the worst case P99.

A proper implementation will make sure the worst-case latency is accounted for and not cherry-pick the best case.


Backstage feeds go in uncompressed via HD-SDI and have a latency of lines (microseconds to single digit milliseconds). Totally different ballgame.


I wrote a blog post about how numbers like "30ms latency" are thrown around called "How to lie about latency": https://www.obe.tv/how-to-lie-about-latency/

It's left as an exercise to the reader which methods of lying are being used in this case.


I actually saw numerous issues with segments being one frame out of sync with the rest.


Imagine Ethernet was designed like this and you had to implement mandatory congestion control and other cruft. The layer of the stack that has knowledge of the content should be implementing the congestion control.


Ethernet was designed like this :-)

https://en.m.wikipedia.org/wiki/IEEE_802.2

Back in the olden times, the Windows 3.11 and Windows 95 could even run services atop it using another protocol: https://en.m.wikipedia.org/wiki/NetBIOS_Frames

In fact, the link layer protocols that dealt with congestion and access to media more meticulously, eg Token Ring or FDDI, were generally much more efficient with the use of the media - using near 100% of the potential bandwidth, whereas Ethernet on a shared medium already gets quite inefficient at 30-40% utilization due to retries caused by collisions.

However, the trade off between additional complexity (and thus bigger cost and difficulty of troubleshooting) was such that the much simpler and dumber Ethernet has won.

However, a lot of similar principles are there in Fibre Channel family protocols, which are still used in some data center-specific protocols.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: