Hacker Newsnew | past | comments | ask | show | jobs | submit | more COAGULOPATH's commentslogin

I've heard rumors that GPT4's training data included "a custom dataset of college textbooks", curated by hand. Nothing beyond that.

https://www.reddit.com/r/mlscaling/comments/14wcy7m/comment/...


Yes, this only helps multi-step reasoning. The model still has problems with general knowledge and deep facts.

There's no way you can "reason" a correct answer to "list the tracklisting of some obscure 1991 demo by a band not on Wikipedia." You either know or you don't.

I usually test new models with questions like "what are the levels in [semi-famous PC game from the 90s]?" The release version of GPT-4 could get about 75% correct. o1-preview gets about half correct. o1-mini gets 0% correct.

Fair enough. The GPT-4 line aren't meant to be search engines or encyclopedias. This is still a useful update though.


o1-mini is a small model (knows a lot less about the world) and is tuned for reasoning through symbolic problems (maths, programming, chemistry etc.).

You're using a calculator as a search engine.


It's actually much worse than that and you're inadvertently down playing how bad it is.

It doesn't even know mildly obsecure facts that are on the internet.

For example last night I was trying to do something with C# generics and it confidently told me I could use pattern matching on the type in a switch statwmnt, and threw out some convincing looking code.

You can't, it's impossible. It wàa completely wrong. When I told that this, it told me I was right, and proceeded to give me code that was even more wrong.

This is an obscure, but well documented, part of the spec.

So it's not about facts that aren't on the internet, it's just bad at facts fullstop.

What it's good at is facts the internet agrees on. Unless the internet is wrong. Which is not always a good thing with the way the language it uses to speak is so confident.

If you want to fuck with AI models as a bunch of code questions on Reddit, GitHub and SO with example code saying 'can I do X'. The answer is no, but chatgpt/codepilot/etc. will start spewing out that nonsense as if it's fact.

As for non-proframming, we're about to see the birth of a new SEO movement of tricking AI models to believe your 'facts'.


I wonder though, is the documentation only referenced a few places on the Internet, and are there also many forums with people pasting "Why isn't this working?" problems?

If there are a lot of people pasting broken code, now the LLM has all these examples of broken code, which it doesn't know are that, and only a couple of references to documentation. Worse, a well trained LLM may realise that specs change, and that even documentation may not be considered 100% accurate (for it is older, out of date).

After all, how many times have you had something updated, an API, a language, a piece of software, but the docs weren't updates? Happens all the time, sadly.

So it may believe newer examples of code, such as the aforementioned pasted code, might be more correct than the docs.

Also, if people keep trying to solve the same issue again, and keep pasting those examples again, well...

I guess my point here is, hallucinations come from multi-faceted issues, one being "wrong examples are more plentiful than correct". Or even "there's just a lot of wrong examples".


Its not always the right tool depending on the task. IMO using LLMs is also a skill, much like learning how to Google stuff.

E.g. apparently C# generics isn’t something its good at. Interesting, so don’t use it for that, apparently its the wrong tool. In contrast, its amazing at C++ generics, and thus speeds up my productivity. So do use it for that!


> For example last night I was trying to do something with C# generics and it confidently told me I could use pattern matching on the type in a switch statwmnt, and threw out some convincing looking code.

Just use it on an instance instead

  var res = thing switch {
    OtherThing ot => …,
    int num => …,
    string s => …,
    _ => …
  };


>>>As for non-proframming, we're about to see the birth of a new SEO movement of tricking AI models to believe your 'facts'.

This is kinda crazy to think about.


If you ask Google Gemini right now for the name of the whale in half moon bay harbor it will tell you it’s called Teresa T.

That was thanks to my experiment in influencing AI search: https://simonwillison.net/2024/Sep/8/teresa-t-whale-pillar-p...


My experience is the opposite: laypeople are excessively pessimistic on LLM progress ("AI is so dumb. It tells you to put glue on pizza and eat rocks)", usually due to a remembered anecdote that's either years old or reflects worst-case performance (only egregiously bad AI mistakes make the news).

Frontier models are better than they were and "feel" fairly reliable, although all the AI problems of 2021-2022 conceptually do still exist.


What he means is that if you search for "Diminished by its artsiness" + "Pauline Kael" you won't find any results (except for ones related to this news story).

Google is polluted with AI generated content but not this specific bit of AI generated content.


Google is also polluted with sponsored content and ads framed as search results which are many times not at all what you're actually searching for. It's totally unreliable for finding things these days.


Sanitarium is one of those Bad Mojo-esque games that's worth playing for how unique it is.

The isometric viewpoint never really worked for me, and undercuts the immediacy of the horror. Things aren't happening to you, but to a little human figure the size of a game piece. Maybe this is a hot take but horror games generally need to be first person.


I'd say it depends. Horror movies work without being in first person. The idea that you likely need to relate with/worry about the character is at least mostly true, but if the game can make you want to keep the character safe as a separate entity you can do it w/o the first person stuff.


Most guides on detecting AI images are from 2022 and have aged like dinosaur milk. “AI can’t draw hands.” “AI can’t draw straight lines.” “AI can’t spell words.”

We now live in an age of photorealistic fake media. It is no longer true that AI images have such obvious mistakes.

However, there are still some signs of AI imagery—many are strangely getting worse as the technology advances—but often they’re not errors so much as they’re “conceptual tension.” At a high level, an AI image has several different goals (fulfill the user’s prompt, look coherent/attractive, satisfy a moderation policy, etc.) and if the goals clash (ie, the user prompts for something ugly or incoherent), the image can get subtly pulled in different directions. I show many examples of what, exactly, to look for.

These are my personal heuristics only. There is currently no foolproof way to identify an AI image. Be careful out there.


Gemini Ultra gets this right. (Usually it's worse at GPT4 at these sorts of questions.)


Yeah, it's funny. I used to think "Demis Hassabis...where have I heard that name before?" And then I realized I saw him in the manuals for old Bullfrog games.


(They patched Gemini an hour after I finished writing this. My complaints of excessive refusals and model deception may no longer apply.)

Witness: - A chess match between Ultra and GPT4 (the first one ever, as far as I'm aware) - A Gemini vs GPT4 rap battle - Tests of general knowledge, recall, abstract reasoning, and code generation - Head-to-head contests of poetry and prose, plus style imitations of famous authors/bloggers

I also investigate VERY IMPORTANT things such as:

- which model can create a more realistic ASCII cat? - which model is better at stacking eggs? - which model plays Wordle better? - which model SIMULATES Wordle better (with me playing)?

Obviously, a lot of my tests are a bit silly. We already know Ultra's benchmarks, I'm trying to probe the gaps BETWEEN benchmarks, and figure out what the models are like "on the ground".

Conventional wisdom holds that Ultra is another GPT4: this was not my experience. Switching from GPT4 to Ultra feels like switching character classes in an RPG; they are quite different, with distinct strengths and weaknesses.


This reminds me of Matt Lakeman's "An Attempt at Explaining, Blaming, and Being Very Slightly Sympathetic Toward Enron"

https://mattlakeman.org/2020/04/27/explaining-blaming-and-be...

He makes the point that although Enron was clearly doing shady things, it's possible for a legitimate business to do many of the same stuff ("mark-to-market" accounting, tricky SPEs, and so on). Try to categorically eliminate Enron-style fraud and you might take down the next Google in the crossfire.


You say that like everyone ubiquitously believes Google is a net-positive.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: