Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not explicit, but from the descriptions it seems to suggest it's performing recognition of the audio streams of what's being sung, and even supports dual streams for duets.

So I wouldn't be surprised if it relies on a particular hardware chip that the older Apple TV simply doesn't have. That has definitely been the case for everything Apple has launched with regards to Spatial Audio.



They may also use ML to sync the lyrics to the vocals, meaning they don’t need prior timestamp metadata for the syllables.


I hope they don't do this. It would be so wasteful to have every end user device running an ML model every time a song is played. Just run the model once in the datacenter and then distribute the time stamp metadata.


> It's not explicit, but from the descriptions it seems to suggest it's performing recognition of the audio streams of what's being sung, and even supports dual streams for duets.

Just curious: what in the article makes you think that?


Not OP, but...

Look at the list of what Apple Sing includes:

Adjustable vocals: Users now have control over a song’s vocal levels. They can sing with the original artist vocals, take the lead, or mix it up on millions of songs in the Apple Music catalog.

Real-time lyrics: Users can sing along to their favorite songs with animated lyrics that dance to the rhythm of the vocals.

Background vocals: Vocal lines sung simultaneously can animate independently from the main vocals to make it easier for users to follow.

Duet view: Multiple vocalists show on opposite sides of the screen to make duets or multi-singer tracks easy to sing along to.

------

The part of the article where they state these things explicitly...


> Adjustable vocals: Users now have control over a song’s vocal levels. They can sing with the original artist vocals, take the lead, or mix it up on millions of songs in the Apple Music catalog.

I think this only requires pre-making two audio files per track, and simultaneously streaming these.

Real-time lyrics, Background vocals and Duet view are all nice features too, but the hardest part processing-wise is analysing how loud you sing into the microphone. It's just karaoke with a good UI.


> I think this only requires pre-making two audio files per track, and simultaneously streaming these.

That’s the understatement of the century. “this only requires […] simultaneously streaming these”


Don't movies already have stereo sound? Is going from two audio channels to four really that difficult?


Well, it is indeed tricky but not requires-dedicated-chip tricky :-)


They say it supports millions of songs so I doubt that’s how they are doing this.

They are likely using a sophisticated ML version of what old karaoke machines did, and removing the vocals in real-time.


Even if they're using ML, I don't imagine they'd do it on-device. They don't even do voice recognition on-device by default.


> [Apple says it is] relying on an on-device machine learning algorithm that processes the music in real-time. The tech builds on Apple’s noise-cancellation expertise and other developments it’s made for FaceTime, the company said.

Source: https://techcrunch.com/2022/12/06/apple-music-is-getting-a-n...


I stand corrected as well!

Wonder why they take this approach though, as it is clearly over-engineering (if I correctly understand that the goal is just to make vocals volume adjustable).


> Wonder why they take this approach though, as it is clearly over-engineering (if I correctly understand that the goal is just to make vocals volume adjustable).

Depends what the other non-functional requirements were. i.e. if the NFRs were as follows:

* Cannot increase bandwidth / mobile data usage.

* Cannot impact music quality / bitrate.

* Has to work offline.

* Cannot increase on-device storage.

* Has to be responsive.

Then two audio streams might not work.

Another advantage of doing it on-device is that it doesn't actually change any of the backend architecture too. It might be a lot of change to a lot of systems for a feature which only adds a small amount of functionality - i.e. architecting your entire backend and streaming around seperating audio tracks might not be the right focus.


Maybe it's licensing? I can imagine copyright holders being squeamish about Apple processing, permanently storing, and serving heavily altered versions of their music. The difference is silly and pedantic, but by processing it in real-time during playback, one might argue it's just a filter effect like EQ.


Ah, I stand corrected! I wonder why they took that approach...


Not sure - although I would imagine that it would effectively double the storage and bandwidth/data requriement for Apple Music in general if they had to send two files with equal bitrate.


They don't state anything explicitly referring to real-time processing of the songs.

As a matter of fact, calling out "millions of songs in the Apple Music catalog" actually makes it seem like the adjustable vocals will only be available on certain songs that they've added support for.


It's hard for me to imagine they'd do something special to support "millions of songs" while excluding others, since the entire catalog is ~100m songs.

My guess is that it's entirely dynamic. It's hard to imagine the complexity of doing batch processing to render each song in the library, and maintain that as new songs are uploaded, and update the renders for software improvements. Better to just do it realtime.

And since classical, instrumentals, esoteric ambient stuff, death metal, etc, will probably not be supported by the algorithms, I think the "millions" refers to those that can be processed in realtime.


They do something special: get the necessary sign offs from legal. Apple‘s license for some content may bar it from being used that way (that’s another reason for doing it on device: could be different situation as far legal & licensing are concerned)


Sure, and a good point. I should have said they wouldn't do anything special on a song-by-song basis for millions of songs. It's not like someone's pushing the button for each song, or building a list of songs. Those that meet the criteria will be included, whether that is 10 or 10 million of them.


I’d guess it’s more probable that lots of songs simply have no lyrics at all, so the claim „all songs in the Apple Music catalog“ would be factually wrong.


Why would you need a dedicated chip for any of that?


Seems like it is powered by:

> [...] an on-device machine learning algorithm that can process music in real time

Their latest apple TV includes the A15 which includes a 'neural engine' for ML, and this is also included in their latest iPhone / iPad, so that might be part of it.


The divide in expectations is funny. Non-Apple-user: "ML stuff? Must be 'in the cloud'." Apple user: "ML stuff? Must use a special chip in the device."


And then there's me, asking why do we even need ML for that. :D


iOS > Settings > Accessibility > Live Captions (beta) > Toggle Live Captions


That makes sense when there's audio that Apple hasn't seen before. With Apple Music Sing, it makes more sense to do that processing once in the datacenter.


it makes more sense to do that processing once in the datacenter.

Since Apple is all about on-device processing with so many of its features, going back-and-forth to the data center doesn't seem to be its style these days.

That's more of a Google thing.

And no one can accuse Apple of telling its advertisers that you start your day with Funky Cold Medina.


Android has had on-device captions long before Apple did.


There's a reason Apple's preferring on-device processing: user privacy. This doesn't make sense for music (stems, lyrics) since it's not listener's data.


> There's a reason Apple's preferring on-device processing: user privacy.

Is that the actual reason though? My personal impression has been that it's a combination of reasons that benefit Apple. The increased user privacy being a nice bonus for users, but not the primary reason:

1. Producing phones powerful enough for on-device ML both justifies the high price point to the general public and is a good marketing point (along with increased user privacy)

2. Avoid backend infrastructure costs. Why spend extra money on servers, maintenance, and compliance when they can just offload the work to the devices themselves since they're capable?

3. Bonus: The unplanned obsolescence for new features like the one announced is also a side effect that benefits Apple.

I do not get the impression that Apple's primary focus is to benefit users and their privacy.


>There's a reason Apple's preferring on-device processing:

It makes it so you have to buy new devices sooner.

The privacy thing is a nice side-benefit and PR thing, but let's be realistic here.

EDIT: Just to remind everyone, we are literally in a thread about a new device feature that is trivial to do in the cloud, which Apple chooses to do on-device, which makes it only a feature for its newest generation of products...


There's no extra back and forth. You have to fetch the songs from the datacenter in the first place, right? So you fetch the additional data at the same time.

If Apple wanted to support this for user-provided mp3s then on-device would make sense. It doesn't sound like they support that though.


A lot of Android processing is done on the device now.


As usual, orange site users automatically assume the worst of other people




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: