I think the discussion around "exponentials" with top-end LLM (think 3.5 sonnet, gpt-4 not the smaller models) scaling is really pointless. The heuristic we have for what to expect from performance is just scaling, which has worked pretty well. These benchmarks are imperfect in lots of ways, aren't necessarily sensitive to showing exponential progress and it is difficult to predict step changes in capability in advance.
If you zoom out on the first graphic from December 2023 back to 2020, the capabilities of models released at that time on these benchmarks would be much much lower. The best lens for future performance of large models is uncertainty.
> The best lens for future performance of large models is uncertainty.
100% agree. I think to better way to phrase my argument there would be to reject the notion that LLMs are destined to get exponentially smarter (twitter fallacy). This is not to say I believe they are not going to get any smarter in the future. We simply don't know and building a company/product on the expectation of another Moore's Law is dangerous.
In the specific case of the Drake track, hateful is an appropriate word.
He was using Tupac's voice on the TaylorMade freestyle in a really disrespectful way that would borderline on hateful of his artistic legacy. Just read these lyrics...
> Verse 1: 2Pac (AI)]
Kendrick, we need ya, the West Coast savior
Engraving your name in some hip-hop history
If you deal with this viciously
You seem a little nervous about all the publicity
Fuck this Canadian lightskin, Dot
We need a no-debated West Coast victory, man
Call him a bitch for me
Talk about him likin' young girls, that's a gift from me
I think your post is taking these AI voices out of their original context.
First let's consider CD -> streaming as a media change. Streaming didn't really exist when Tupac was around. But nobody would say putting Tupac's catalog on streaming is inherently disrespectful to his artistic legacy because it's preserving (more or less) the same artistic product.
Here are a couple other examples that I do think are more analogous than improvements in recording technologies:
Posthumous releases with material not created by the artist. Sometimes record labels will try to capitalize on the brand of an artist and release material that really only has snippets of random recordings that an artist made. In my view, this is disrespectful to the artist because it's not a piece of artistic material they wanted to release.
Another example is colorizing black and white movies. Similarly, this action changes the actual artistic product in a way that's disrespectful to the creators of those films.
Creating AI voices of artists is similar to these examples because it's changing the artistic output of an artist and disrespecting their artistic legacy. It's creating content under their name without the ability for them to say no or have any input into the output.
I really like the point about getting AI to ask you questions.
The focus in the AI tutor world is basically a chatbot to ask questions of. But if you're trying to learn something, it's really helpful to have targeted questions asked of you!
That's a really surprising thing to hear, where did you see that? The only quote I've seen is this one:
>“One hypothesis was that coding isn’t that important because it’s not like a lot of people are going to ask coding questions in WhatsApp,” he says. “It turns out that coding is actually really important structurally for having the LLMs be able to understand the rigor and hierarchical structure of knowledge, and just generally have more of an intuitive sense of logic.”
Make Sense, they want better interaction whit users for Whatsapp, Instagram and Facebook marketers, content creation and moderation,and their glasses(ai /ar) I don't see in that context why the should push more effort into llm coding, is sad anyways
I actually really like Anki (and think it's a great tool!) but this is one of the biggest problems I see for spaced repetition to get in the hands of more people.
You can change the number of review items but it doesn't change the fact that you have an impossible backlog to get through. Then people just get bored and churn.
I think with a different approach to review UX could make opportunities to mix up the reviews in a way that doesn't feel like you have an impossible backlog to go through.
aha good question, well the neural scoring function doesn't "know" that it is making a routing decision, we just predict which LLM would give the highest performing output on the given prompt, based on LLMs-as-a-judge at training time. However, my guess is that this specification in the input prompt miiight mean that the cheaper models are deemed to be worse performing than GPT4 (for example), and so maybe it would route to the best models. Feel free to give it a try and see!
If you zoom out on the first graphic from December 2023 back to 2020, the capabilities of models released at that time on these benchmarks would be much much lower. The best lens for future performance of large models is uncertainty.