I chose the filesystem paths organically as I was iterating with it: starting from the ideas of having a root dir and a simple unique "home" dir.
I also wanted to allow the visitor to hack around stuff like the system config through the filesystem (as if updating the `~/.config/` or `/etc/` directories in the linux world), so I also planned to have the desktop's own config files reachable in there and make it user-editable (though for now that's only the case for the list of desktop icons).
I found out that having a few root directories that were fully virtual[1] (actually in-memory) and marked a "read-only", and the others that were actually real IndexedDB-backed storage[2] was very simple to manage.
In the end it seemed to me like a much more simplified unix-like way of doing things which is what I'm the most used to.
I'm not that familiar with windows so I'm not sure why it also looks windows-like though :D, is it about the naming of paths?
[1] `/apps` for apps and `/system32` for default configs - the name was `/system` at first, then I thought renaming it `system32` instead was ""funny"" (as the directory name became kind of a meme), so I did that! Both store "files" that are actually JS objects that can be stringified to JSON when reading them.
[2] For now `/userconfig` for the desktop's own update-able config and `/userdata` for the "home" directory. Something like `/appconfig` is also planned though I'm not sure yet. In contrast those contain data really stored locally, in ArrayBuffer form (a blob of bytes).
I liked this article very much (and I do share many points it's making).
However there are some claims in that article that bothered me:
> if you’re watching 1080p video and your network takes a dump, well you still need to download seconds of unsustainable 1080p video before you can switch down to a reasonable 360p.
A client can theoretically detect a bandwidth fall (or even guess it) while loading a segment, abort its request (which may close the TCP socket, event that then may be processed server-side, or not), and directly switch to a 360p segment instead (or even a lower quality).
In any case, you don't "need to" wait for a request to finish before starting another.
> For live media, you want to prioritize new media over old media in order to skip old content
From this, I'm under the impression that this article only represents the point of view of applications where latency is the most important aspect by far, like twitch I suppose, but I found that this is not a generality for companies relying on live media.
Though I guess the tl;dr properly recognizes that, but I still want to make my point as I found that sentence not precize enough.
On my side and the majority of cases I've professionally seen, latency may be an important aspect for some specific contents (mainly sports - just for the "neighbor shouting before you" effect - and some very specific events), but in the great majority of cases there were much more important features for live contents: timeshifting (e.g. being able to seek back to the beginning of the program or the one before it, even if you "zapped" to it after), ad-switching (basically in-stream targeted ads), different encryption keys depending on the quality, type of media AND on the program in question, different tracks, codecs and qualities also depending on the program, and surely many other things I'm forgetting... All of those are in my case much more important aspects of live contents than seeing broadcasted content a few seconds sooner.
Not to say that a new way of broadcasting live contents with much less latency wouldn't be appreciated there, but to me, that part of the article complained about DASH/HLS by just considering the ""simpler"" (I mean in terms of features, not in terms of complexity) live streaming cases where they are used.
> You also want to prioritize audio over video
Likewise, in the case I encountered, we do largely prefer re-buffering over not having video for even less than a second, even for contents where latency is important (e.g. football games), but I understand that twitch may not have the same need and would prefer a more direct interaction (like other more socially-oriented media apps).
> LL-DASH can be configured down to +0ms added latency, delivering frame-by-frame with chunked-transfer. However it absolutely wrecks client-side ABR algorithms.
For live contents where low-latency is important, I do agree that it's the main pain point I've seen.
But perhaps another solution here may be to update DASH/HLS or exploit some of its features in some ways to reduce that issue. As you wrote about giving more control to the server, both standards do not seem totally against making the server-side more in-control in some specific cases, especially lately with features like content-steering.
---
Though this is just me being grumpy over unimportant bits, we're on HN after all!
In reality it does seem very interesting and I thank you for sharing, I'll probably dive a little more into it, be humbled, and then be grumpy about something else I think I know :p
> A client can theoretically detect a bandwidth fall (or even guess it) while loading a segment, abort its request (which may close the TCP socket, event that then may be processed server-side, or not), and directly switch to a 360p segment instead (or even a lower quality). In any case, you don't "need to" wait for a request to finish before starting another.
HESP works like that as far as I understand. The problem is that dialing a new TCP/TLS connection is expensive and has an initial congestion control window (slow-start). You would need to have a second connection warmed and ready to go, which is something you can do in the browser as HTTP abstracts away connections.
HTTP/3 gives you the ability to cancel requests without this penalty though, so you could utilize it if you can detect the HTTP version. Canceling HTTP/1 requests especially during congestion will never work through.
Oh and predicting congestion is virtually impossible, ESPECIALLY on the receiver and in application space. The server also has incentive to keep the TCP socket full to maximize throughput and minimize context switching.
> From this, I'm under the impression that this article only represents the point of view of applications where latency is the most important aspect by far, like twitch I suppose, but I found that this is not a generality for companies relying on live media.
Yeah, I probably should have went into more detail but MoQ also uses a configurable buffer size. Basically media is delivered based on importance, and if a frame is not delivered in X seconds then the player skips over it. You can make X quite large or quite small depending on your preferences, without altering the server behavior.
> But perhaps another solution here may be to update DASH/HLS or exploit some of its features in some ways to reduce that issue. As you wrote about giving more control to the server, both standards do not seem totally against making the server-side more in-control in some specific cases, especially lately with features like content-steering.
A server side bandwidth estimate absolutely helps. My implementation at Twitch went a step further and used server-side ABR to great effect.
Ultimately, the sender sets the maximum number of bytes allowed in flight (ex. BBR). By also making the receiver independently determine that limit, you can only end up with a sub-optimal split brain decision. The tricky part is finding the right balance between smart client and smart server.
Interesting points and insight, thanks. I should probably look more into it.
My point was more about the fact that I found the article unfairly critical towards live streaming through DASH/HLS by mainly focusing on latency - which I understand may be one of the (or even the) most important points for some use cases, but weren't much on the cases I've worked on, where replacing DASH would be very complex.
This was kind of acknowledged in the tl;dr, but I still found the article unfair at this level.
From what I understand they seem to be more at a lower level and research level. It seems that their main goal is to find out if machine learning can help to improve some central issues with media streaming: ABR - or finding out which quality to serve to the user -, network-level matters such as congestion control etc.
Those are very very central matters in the industry but I, like I would say most player developers, leave that area to researchers, and mainly only implement their ideas after they make enough noises and seem proven enough haha. This is kind of evoked in their FAQ page and I found it to be true with my personal experience.
What we are generally focused on is much more software development-oriented: maintainable and readable software architecture, performance in terms of CPU and memory usage, respect of the various specifications in place, and yes some usage-related tricks (e.g. in the same vein than the mouse over trick you're speaking of).
---
Pre-loading the next content on mouse over is a nice trick, though depending on the company and the type of content (i.e. long VoD contents, live contents etc.), the milliseconds you might save at content load might not be worth the cost (server load mainly).
Here it could be performed by e.g. creating another wasp-hls player in an application and pre-loading that content into it, switching to it if the corresponding content is effectively clicked on. If done like that, this would be an optimization at the application level, and not at the player library level.
Specifically for this issue, the application-side approach for this seems to be more common from what I've seen.
Sometimes smarter less heavy tricks are implemented in a player library to load a content that has a high chance to be played next faster (like preparing DRM matters, pre-loading a future Manifest/Playlist, grouping some requests together, loading media data of the next program when the current one is completely loaded and so on), but those are often ad-hoc to a specific application's need. It is still something that may be implemented through an optional API if one of them benefits an application though.
---
But here for the wasp-hls player, I mainly meant "efficient" in a different manner: I am here not talking about optimal ABR, better video quality, or even shorter loading times, but more about ensuring that even in worst case scenarios (very interactive UI, devices with poor performance and memory constraints, low latency streaming - meaning a very small data buffer), the player will still be able to perform its main job of pushing new media segments, smoothly.
The main goal here is thus to avoid the possibility of rebuffering (the situation of being stalled due to not having enough data in the buffer(s)) due to performance-related issues.
Also, and that's something I observed at my current job, the goal is also to prevent the reverse situation where a long blocking task that is being performed in the player (it's often parsing code for example) blocks the UI for enough time that the user can perceive it. Both of those scenarios leave a very bad experience on users.
That's where the worker part (which means here that the main streaming loop of loading and pushing segments is done concurrently to the UI/application) and the WebAssembly part (where CPU and memory-related optimizations are much easier to develop) come into play.
Here for example the code for parsing playlists, deciding which media segments to load next, processing media-related information and ensuring that playback happens smoothly are all implemented in WebAssembly through Rust. There's still the costly part of transmuxing segments which is for now still in TypeScript, but a rewrite of it in Rust is pending.
---
On that matter, the player used by Twitch's website is also WebAssembly-based and in-worker, and it may not be a coincidence that it is a platform with a relatively rich UI and playing mainly contents with a very short latency (which means less preloaded data in buffers, which means less room for stall-avoidance).
This is typically one of the cases where this kind of approach would be the most useful.
One other case I've seen a lot at my current job would be some low-end embedded devices (smart TV, set-top boxes) where a rich JS application has to live alongside the player. Here the in-worker (mainly) and WebAssembly (a plus) approach would probably mean a better experience.
It does transmux mpeg2-ts files into fmp4 to be able to play them on Chrome/android, though it only does so when those ts files are referenced in HLS (https://en.wikipedia.org/wiki/HTTP_Live_Streaming) playlists, as this player only loads HLS contents, not fmp4 nor mpeg2-ts directly.
The transmuxer is declared in the src/ts-transmux directory if you're curious. It's based on a fork of mux.js's one (https://github.com/videojs/mux.js/). To just transmux ts files to fmp4, you may prefer relying directly on mux.js instead.
Only one of the three playlists is linked to ts segments, the others rely on fmp4 segments (another segment format authorized in HLS).
Note that this is the same HLS format you'll usually find on streaming websites. I've been playing for example a lot of twitch contents through that player.
I don't know about twitch but I work for a streaming company and we're using the page visibility API mainly to lower video quality when the user is not actively watching the corresponding tab for more than a minute.
This allows to keep audio while lowering bandwidth, CPU and memory usage.
So yes it may be used for nefarious purposes but it also provides very nice features in a media streaming case.
I have no doubt there are good uses. Of course, that kind of screws people who have say, two monitors, with the video open in one while they are in another window. You could certainly argue they are not actively watching the tab, but they could be watching a show while they work or chat with family.
Also, how long before the video streaming services use the api to adjust streamer compensation rates based on active/not active watchers?
As far as I know, it shouldn't be triggered when you lose focus due to not being the active monitor, though I cannot really test it right now as I've only one monitor at home.
> Also, how long before the video streaming services use the api to adjust streamer compensation rates based on active/not active watchers?
Good remark, that wouldn't surprise me that they don't already do this on such services by the way, though I'm not at all familiar with stream compensation rules.
You made me want to check if they collected it server-side on twitch by curiosity and they do seem to send regularly some engagement-related metrics (others than what I would assume would be useful to monitor if the player is doing a good job, such as bitrate and frame-drop-related matters). For example a base64-encoded JSON with properties such as "minutes_logged" (which I guess are the minutes since I logged in), a "chat_visible" boolean and more interesting here "time_spent_hidden" seems to be POSTed at intervals. That whole object is also conveniently associated to an "event" name called "minute-watched".
What's strange though is that the URL makes it look like they're requesting an usual ".ts" media segment though it is a POST request, it returns an HTTP 204 No Content and more importantly, the request is not performed by their usual media player script but by another mysterious p.js script, which makes it seem that this is not at all actually loading a media segment.
Maybe this camouflage is here to prevent people from messing with it, as wrong data would probably mess with their internal logic, but I guess that it indicates that they do collect that data on their servers, though I don't know what they do with it.
Moreover, they already have features to influence users so they stay active on the tab (for example the bonus points you win by clicking regularly on the treasure chest on the bottom of the chat), so that's not that surprising that they monitor this.
I'm a developer on a web media player and remember that we had at some point an issue with picture in picture mode: We had a feature that lowered the video bitrate to the minimum when the page that contained the video element was not showing for some amount of time.
That made total sense... until the Picture in Picture mode was added to browsers, where you would see a very-low quality after watching your content in that mode in another page long enough (~1 minute).
The sad thing is that because I was (still am) developing an open-source player and because the API documentation described clearly the aforementioned implementation, I had to deprecate that option and re-create a new one (with a better API documentation, talking more about the intent than the implementation!) instead, which would have the right exceptions for picture in picture mode.
Seeing that part made me remember this anecdote, we should just have asked for quirks :p
I get what you're saying about the implementation of the CDM internals not being documented, but to me - a MSE/EME player developer - it seems that the main chunk of the Encrypted Media Extension API does also apply when Widevine, PlayReady and so on are used.
The user-agent is still relying on the same interfaces and concepts (MediaKeySession, key id, same events, same methods, same way to persist licenses) and the application generally can work transparently without doing exceptions for each key system (well, in reality, there are a lot of compatibility bugs in a lot ot EME implementations, but those are not on purpose).
For example the server certificate concept generally does not apply to clear key implementations at all but is frequently used when using Widevine.
Clearkey is thoroughly documented because that is the only key system that SHOULD be implemented in all user agents with an EME implementation, though nobody, even the writers of that specification, think that it is a good idea for real production usage.
In my work we rely a lot on the EME specification, and refered to it many times in exchanges with our partners when we thought that their CDM implementations (e.g. on specific set-top boxes, smart tvs, game consoles etc.) lead to non-compliance to that spec (e.g. unusable keys not being properly announced etc.).
_PS: off-topic but I came across your projects and comments online many times, IIRC mostly about image encoders / decoders and on Rust (both subjects on which I love to read). That was always very interesting, thanks!_
Yeah good point, that actually brings a lot of complexity in our player library.
There is already bugs linked to specific DRM systems, but it only becomes worse when we talk about specific devices, like STBs or connected TVs. Some who have completely different systems than those three, some who lie on more robust implementation of those three. I ran into multiple kind of bugs, sometimes CDM-related, sometimes browser-related (CEF), sometimes a subtly different EME workflow was needed on the JS-side (we could argue that in that case the bug would also be more on the browser/CDM side if it does not just follow the EME specification).
I hesitated to talk about EME-related matters in that article but here (on an introduction) it would have become too technical and less interesting for most common folks.
But it would definitely be a good subject to talk about for another article.
I also wanted to allow the visitor to hack around stuff like the system config through the filesystem (as if updating the `~/.config/` or `/etc/` directories in the linux world), so I also planned to have the desktop's own config files reachable in there and make it user-editable (though for now that's only the case for the list of desktop icons).
I found out that having a few root directories that were fully virtual[1] (actually in-memory) and marked a "read-only", and the others that were actually real IndexedDB-backed storage[2] was very simple to manage.
In the end it seemed to me like a much more simplified unix-like way of doing things which is what I'm the most used to. I'm not that familiar with windows so I'm not sure why it also looks windows-like though :D, is it about the naming of paths?
[1] `/apps` for apps and `/system32` for default configs - the name was `/system` at first, then I thought renaming it `system32` instead was ""funny"" (as the directory name became kind of a meme), so I did that! Both store "files" that are actually JS objects that can be stringified to JSON when reading them.
[2] For now `/userconfig` for the desktop's own update-able config and `/userdata` for the "home" directory. Something like `/appconfig` is also planned though I'm not sure yet. In contrast those contain data really stored locally, in ArrayBuffer form (a blob of bytes).