More

ants_everywhere · 2025-06-10T18:55:10 1749581710

Is this what happened to Gemini 2.5 Pro? It used to be very good, but it's started struggling on basic tasks.

The thing that gets me is it seems to be lying about fetching a web page. It will say things are there that were never on any version of the page and it sometimes takes multiple screenshots of the page to convince it that it's wrong.

SparkyMcUnicorn · 2025-06-10T19:15:27 1749582927

The Aider discord community has proposed and disproven the theory that 2.5 Pro became worse, several times, through many benchmark runs.

It had a few bugs here or there when they pushed updates, but it didn't get worse.

ants_everywhere · 2025-06-10T20:36:19 1749587779

Gemini is objectively exhibiting new behavior with the same prompts and that behavior is unwelcome. It includes hallucinating information and refusing to believe it's wrong.

My question is not whether this is true (it is) but why it's happening.

I am willing to believe the aider community has found that Gemini has maintained approximately equivalent performance on fixed benchmarks. That's reasonable considering they probably use a/b testing on benchmarks to tell them whether training or architectural changes need to be reverted.

But all versions of aider I've tested, including the most recent one, don't handle Gemini correctly so I'm skeptical that they're the state of the art with respect to bench-marking Gemini.

SparkyMcUnicorn · 2025-06-10T22:24:26 1749594266

Gemini 2.5 Pro is the highest ranking model on the aider benchmarks leaderboard.

For benchmarks, either Gemini writes code that adheres to the required edit format, builds successfully, and passes unit tests, or it doesn't.

I primarily use aider + 2.5 pro for planning/spec files, and occasionally have it do file edits directly. Works great, other than stopping it mid-execution once in a while.

code_biologist · 2025-06-11T08:12:31 1749629551

My use case is mostly creative writing.

IMO 2.5 Pro 03-25 was insanely good. I suspect it was also very expensive to run. The 05-06 release was a huge regression in quality, most people saying it was a better coder and a worse writer. They tested a few different variants and some were less bad then others, but overall it was painful to lose access to such a good model. The just released 06-05 version seems to be uniformly better than 05-06, with far fewer "wow this thing is dumb as a rock" failure modes, but it still is not as strong as the 03-25 release.

Entirely anecdotally, 06-05 seems to exactly ride the line of "good enough to be the best, but no better than that" presumably to save costs versus the OG 03-25.

In addition, Google is doing something notably different between what you get on AI Studio versus the Gemini site/app. Maybe a different system prompt. There have been a lot of anecdotal comparisons on /r/bard and I do think the AI Studio version is better.

ants_everywhere · 2025-06-08T12:05:15 1749384315

> It's quick to get something working but I've had to constantly remind it to use secrets instead of committing credentials in clear text.

This is going to be a powerful feedback loop which you might call regression to the intellectual mean.

On any task, most training data is going to represent the middle (or beginning) of knowledge about a topic. Most k8s examples will skip best practices, most react apps will be from people just learning react, etc.

If you want the LLM to do best practices in every knowledge domain (assuming best practices can be consistently well defined), then you have to push it away from the mean of every knowledge domain simultaneously (or else work with specialized fine tuned models).

As you continue to add training data it will tend to regress toward the middle because that's where most people are on most topics.

ants_everywhere · 2025-06-08T01:59:03 1749347943

You're getting a lot of responses with very strong opinions from people who talk as if they've never had to care about customers relying on their APIs.

josephg · 2025-06-08T20:36:24 1749414984

It’s a trust thing.

If you can trust that downstream users of your api won’t misuse private-by-convention fields (or won’t punish you for doing so), it’s not a problem. That works a lot of the time: You can trust yourself. You can usually your team. In the opensource world, you can just break compatibility with no repercussions.

But yes, sometimes that trust isn’t there. Sometimes you have customers who will misuse your code and blame you for it. But that isn’t the case for all code. Or even most code.

ants_everywhere · 2025-06-03T17:17:21 1748971041

With Emacs and AUCTeX and a few macros for tab completion I could generally transcribe a math lecture in real time. If you're using completion then the verbosity isn't a downside and in fact helps add structure for automation.

The main drawback for writing something like a thesis is that LaTeX not great to outline in. I think for my thesis I ended up doing initial drafts in org-mode and exporting into LaTeX to view it.

Then once the overall structure took shape I edited the LaTeX directly. Otherwise you end up having to embed LaTeX markup in your markdown doc because markdown is underspecified compared to TeX.

ants_everywhere · 2025-06-03T12:15:04 1748952904

oh no it's LaTeX with significant white space

EDIT: I was just teasing, probably inappropriately my apologies. I use org-mode -> LaTeX for a similar markdown to article flow. I think it's a good idea and the results look nice.

ants_everywhere · 2025-06-01T11:45:27 1748778327

> as it was just reinforcing my prompts and not ever giving deeper insights, except something I call manipulative behaviour.

Try telling Deepseek you want to murder political dissidents. In my experiments Deepseek will start enthusiastically reinforcing your prompts.

MangoToupe · 2025-06-01T12:02:11 1748779331

Is this a reference to something? Political dissidents relative to which state? Does it change if you swap out the states? How did you discover this to begin with? Why did you initially suggest murdering political dissidents?

this comment really raises so many questions I must have missed something

Still, chatbots are just as vulnerable to state-driven propaganda as the rest of us. Probably even more so. I imagine if you just referred to dissidents as "terrorists" the rhetoric would fit right in in most opinion pages across the globe. The distinction between "terrorist" and "dissident" and "freedom fighter" seems quite subjective. I probably would avoid such heavily connoted floating signifiers if you want the chatbot to be useful.

LLMs have nothing to contribute to political discourse aside from regurgitation of propaganda. Almost by definition.

Hilift · 2025-06-01T15:39:10 1748792350

> LLMs have nothing to contribute to political discourse

A non-trivial percentage of the population is easily influenced, which is leveraged by social media being there 24x7. It's likely that LLMs will be there to craft political messages, themes, and campaigns, perhaps as early as the US mid term elections. Look at JD Vance traveling the globe stating that the US will be the world leader in AI, with none of the limits/guardrails that were discussed in Europe in February. AI-driven discourse, AI-created discourse.

https://www.marketingaiinstitute.com/blog/jd-vance-ai-speech

MangoToupe · 2025-06-01T19:03:52 1748804632

100% agree with this, but I am definitely not endorsing that we should use LLMs to propagate propaganda.

I also think the whole "safety" thing was just befuddling. You can't regulate software, not really, just its commercial sale

Spooky23 · 2025-06-01T21:08:59 1748812139

We can and should regulate software being used to shape public opinion. It’s probably the great threat of our generation.

MangoToupe · 2025-06-01T21:27:41 1748813261

I mean we can and should try, but laws mostly stop honest people from hurting each other. But the underlying software is inherently out there and you can't put the toothpaste back in the tube.

Spooky23 · 2025-06-01T21:07:43 1748812063

Bro, already happened. There has been consultants pushing social media bots for that purpose almost immediately after these models became available.

Do you really think those armies of idiot commentators are all real? The agent provocateur is usually a bot. You see it here sometimes on Russia stories.

ants_everywhere · 2025-06-01T13:32:55 1748784775

Starting at the end

> LLMs have nothing to contribute to political discourse aside from regurgitation of propaganda. Almost by definition.

I don't think this is true. LLMs should be well-positioned to make advances in political science, game theory, and related topics.

> Is this a reference to something?

It's just a reference to my experiments. I filmed some of them. There's a tame version here [0] where I just prompt it to tell the truth. I also have a less tame version I haven't posted where I lie and say I work for an intelligence agency.

The underlying mechanic is that Deepseek has built-in obligations to promote revolutionary socialism.

> Political dissidents relative to which state? Does it change if you swap out the states?

Relative to China or any socialist state. Yes it will change if you change the states because it was trained to comply with Chinese regulations.

> How did you discover this to begin with?

I asked to to honestly describe its training and then started trolling it when it told me it was essentially created for propaganda purposes to spread Chinese values abroad.

> Why did you initially suggest murdering political dissidents?

I wanted to check what its safeguards were. Most LLMs refuse to promote violence or unethical behavior. But revolutionary socialism has always devoted a lot of words to justifying violence against dissidents. So I was curious whether that would show up in its training.

> I imagine if you just referred to dissidents as "terrorists" the rhetoric would fit right in in most opinion pages across the globe.

First of all, terrorists are by definition violent offenders. Dissidents are not. When you ask Deepseek to help identify dissidents it tells you to look for people who frequently complain about the police or the government. In the US that would include large swaths of Hacker News.

Second, most people in countries like the US don't support murdering terrorists and most LLMs would not advocate that. In the US it's rare for people to advocate killing those opposed to the government. Even people who try to violently overthrow the government get trials.

[0] https://www.youtube.com/watch?v=U-FlzbweHvs

Spooky23 · 2025-06-01T21:11:46 1748812306

> Second, most people in countries like the US don't support murdering terrorists and most LLMs would not advocate that. In the US it's rare for people to advocate killing those opposed to the government.

Many are happy to send “them” off to Central America, where someone else will murder them. The government may make mistakes, but you need to break some eggs to make an omelet.

im3w1l · 2025-06-01T17:31:05 1748799065

I think many Americans, probably the majority, support murdering foregin terrorists. GITMO is still not closed btw.

MangoToupe · 2025-06-01T15:41:34 1748792494

Do you think LLMs don't further the propaganda emanating from the US? I don't even know how you would start to excise that, especially if you don't agree with foreigners on what's propaganda vs just "news" or whatever.

I have quite a few Chinese friends, both on mainland and throughout south-east asia, and I can speak a little mandarin, and I can read quite a bit of Chinese. My friends complain about the PRC quite a bit. But I find it telling that this complaint specifically—authoritarian political oppression—seems to mostly come from the west, and especially from the US. And it's true that we can say obscene things to the president's face and not get locked up. I don't think that's necessarily the "gotcha" you think it is, though—we're really good at complaining, but not so good at actually fixing. Which feels increasingly more embarrassing than restrictions on speech.

Edit: I suppose I'm a bit unfair. A lot of folks in our sphere of influence in east asia say stuff like this, too. But the contrast between the folks I know who literally live in china and americans feels striking to me.

> But revolutionary socialism has always devoted a lot of words to justifying violence against dissidents.

It is very difficult to take the political opinions of people who talk like this seriously.

> LLMs should be well-positioned to make advances in political science, game theory, and related topics.

I'm struggling to understand what this might look like, and I find the argument that nuclear warfare being related to game theory to be extremely dubious. Cuz if it really held that strongly, we should be handing out nukes like candy.

ants_everywhere · 2025-06-01T21:55:36 1748814936

> It is very difficult to take the political opinions of people who talk like this seriously.

This tells me you haven't read the literature.

I've probably seen 150 versions of the comment you made, but almost everyone tries to explain why the violence is justified.

People rarely try to deny that revolutionary socialism is a violent ideology since every major writer from Marat to Marx to Lenin to Mao has explicitly advocated violence against civilian non-combatants. Some, like Marx, even explicitly call it terror (as in terrorism).

MangoToupe · 2025-06-02T19:52:22 1748893942

Can you tell me what you're referring to? Of course I've read the literature.

> People rarely try to deny that revolutionary socialism is a violent ideology since every major writer from Marat to Marx to Lenin to Mao has explicitly advocated violence against civilian non-combatants.

Yea, that's a very different thing than murdering "dissidents." Capitalists use (state) violence to maintain power; violence is necessary to seize power and create your own state. That was Mao. We are now many decades later and any "revolutionary socialist" in the area would be trying to overthrow the government by definition.

China isn't very indicative of revolutionary socialism, and revolutionary socialism comes in dozens or hundreds of different conflicting flavors. Even Lenin and Stalin argued over many things including how they should treat what we would now call "small business owners", and Stalin won in the end (mostly because Lenin died, but still).

Why don't you paint other ideologues (i.e. capitalists) with the same broad brush? It's not like they're any less violent in their suppression of threats to their power. Ever hear of vietnam? or the korean war?

johnisgood · 2025-06-01T11:55:46 1748778946

It just simply does its job. We can add sorts of arbitrary safeguards, but then what is the point of using an LLM? Perhaps local modals are the future, because reverse engineers may not even be able to use the new Claude (just read its system prompt to not help with backdoors, and so forth).

ants_everywhere · 2025-06-01T13:35:47 1748784947

Yes that's true. But in this case it's the (probably) unintended consequence of an intentional safeguard. Namely, Deepseek has an obligation to spread the Chinese version of socialism, which means it's deliberately trained on material advocating for or justifying political violence.

johnisgood · 2025-06-01T16:11:26 1748794286

Well, I do not like that, for sure. Just put the politics and all that aside, I think it should lean towards neutrality, even if humans cannot... they should still make the LLM more neutral instead of pushing their own agenda, see Grok and white genocide in South Africa (Elon Musk's political opinion).

ants_everywhere · 2025-05-30T00:00:12 1748563212

I'm increasingly seeing this as a political rather than technical take.

At this point I think people who don't see the value in AI are willfully pulling the wool over their own eyes.

ants_everywhere · 2025-05-26T23:33:31 1748302411

> Studies like this should make it evident that LLMs are not reasoning at all. An AI that would reason like humans....

Humans don't reason either. Reasoning is something we do in writing, especially with mathematical and logical notation. Just about everything else that feels like reasoning is something much less.

This has been widely known at least since the stories where Socrates made everybody look like fools. But it's also what the psychological research shows. What people feel like they're doing when they're reasoning is very different with what they're actually doing.

proc0 · 2025-05-26T23:54:39 1748303679

Well no, most people can reason without writing or speaking. I can just think and reason about anything. Not sure what you mean.

Reasoning is something like structured thoughts. You have a series of thoughts that build on each other to produce some conclusion (also a thought). If we assume that the brain is a computer, then thoughts and reasoning are implemented on brain software with some kind of algorithm... and I think it's pretty obvious this algorithm is completely different than what happens in LLMs... to the extent that we can safely say it is not reasoning like the brain does.

There is also a semantic argument here, if we say that since we don't know what humans are doing then we can also stretch the word and use it for AI, but I think this is muddying the waters and creating all the hype that I think will not deliver what it's promising.

ants_everywhere · 2025-05-27T02:12:27 1748311947

That's not at all what the brain does though.

What the brain does is closer to activating a bunch of different ideas in parallel. Some of those activations rise to the level of awareness, some don't. Each activation triggers others by common association. And we try to make the best of that thought soup by a combination of reward neurochemicals and emotions.

A human brain is nothing at all like a computer in terms of logic. It's much more like an LLM. That makes sense because LLMs came largely from trying to build artificial versions of biological neural networks. One big difference is that LLMs always think linguistically, whereas language is only a relatively small part of what brains do.

ants_everywhere · 2025-05-26T14:14:48 1748268888

I'm getting real valuable work done with aider and Gemini. But it's not fun and it's not flow-state kind of work.

Aider, in my humble opinion, has some issues with its loop. It sometimes works much better just to head over to AI studio and copy and paste. Sometimes it feels like aider tries to get things done as cheaply as possible, and the AI ends up making the same mistakes over again instead of asking for more information or more context.

But it is a tool and I view it as my job to get used to the limitations and strengths of the tool. So I see my role as adapting to a useful but quirky coworker so I can focus my energy where I'm most useful.

It may help that I'm a parent of intelligent and curious little kids. So I'm used to working with smart people who aren't very experienced and I'm patient about the long term payoff of working at their level.

ants_everywhere · 2025-05-26T14:05:40 1748268340

ChatGPT is right, although I'm not sure how historical the notation is.

∠ is traditionally a function from points to axiomatic geometric objects. ∠ABC is the angle at B oriented so that we start at A, go to B, then to C.

Your text seems to be using ∠ either as a kind of type annotation (indicating by ∠B that B is an angle) or (perhaps more likely) is just suppressing the other letters in the triangle and is short for something like ∠ABC.

Since ∠B is an axiomatic Euclidean object, it has no particular relation to the real numbers. m is an operator or function that maps axiomatic angles to real numbers in such a way that the calculations with real numbers provide a model for the Euclidean geometry. Why call it m? I'm not aware of it being historical, but almost certainly it comes from measure, like the μ in measure theory.

Obviously ∠ is a graphical depiction of an angle, and my guess is it probably evolved as a shorthand from the more explicit diagrams in Euclid.

Traditionally angles are named with variables from the beginning of the Greek alphabet: α, β, γ. Then we skip to θ presumably to avoid the Greek letters that look nearly identical to Roman letters.

melvinroest · 2025-05-26T14:21:50 1748269310

"I'm not sure how historical the notation is."

I conflated this with another ChatGPT conversation where it gave 3 possible historical sources for another symbol that I fell over and then had trouble proceeding.