Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But they behave just like models which use text tokens internally, which is also pointed out at the end of the above article.


we don't know if that's due to inherent limitations of the tokenisation of audio, or a byproduct of reinforcement learning. In my own usage, I noticed a significant degradation in capabilities over time from when they initially released advanced voice mode. The model used to be able to sing, whisper, imitate sounds and tone just fine, but I imagine this was not intended and has subsequently been stunted via reinforcement learning.

I don't find the articles argument that this is due to tokenisation convincing.


They didn't say it's due to tokenization.

> This is likely because they’re trained on a lot of data generated synthetically with text-to-speech and/or because understanding the tone of the voice (apparently) doesn’t help the models make more accurate predictions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: