> Or an A.I. history tutor that ingests A.I.-generated propaganda and can no longer separate fact from fiction.
This...is already the case. Nothing about LLM architecture is biased towards facts. If they're factual at all, it's only because most things written by people are true (to the best of their knowledge, anyway).
There are certainly attempts to constrain LLMs to be more factual, but we still have a ways to go on those.
I've had this conversion recently with someone, as people start complaining about those AI companies having really aggressive crawlers hammering websites with 10+ requests a second per bot. I guess they all rush to have a slight competitive edge. Obviously, a lot of them ignore robots.txt...
So what if someone were to buy a bunch of domains, more of less mirror/proxy Wikipedia, and have a bunch of simple regexes that swap out dates, common names, places, and what not... Could you make an impact?
> We demonstrate that when using Gemma-based verifiers on algorithmic and grade-school math reasoning tasks, GenRM outperforms discriminative verifiers and LLM-as-a-Judge, showing a 16-64% improvement in the percentage of problems solved with Best-of-N. Furthermore, we show that GenRM scales favorably across dataset size, model capacity, and inference-time compute.
If the person using LLM isn't asking questions but helping with certain tasks like refining their own ideas they're putting in it can be a different experience.
Maybe LLMs aren't meant to be factual to someone who doesn't understand the facts, and that's ok.
a good example of people not understanding what they're asking is where people showcase the whole "which is bigger? 9.11 or 9.9?" and of course a LLM will return 9.11 because in most contexts (chapters, versions etc.) it is the bigger number, but if you specify the context is fractional numbers then most models will correctly say 9.9 is bigger. A while ago I kept seeing people tout the conventional "9.11 is bigger" answer as proof that LLM's don't understand numbers and are bad at math. While this is still empirically true, I don't think most people understand these things as well as they believe they do either. Specifically, people don't understand how broad and ill defined their questions often are, and how much of their preconceptions need to be communicated for their questions to mean what they think they mean.
Numbering wise in a text document, 9.11 can be higher.
There's a chance that LLMs might be trained on more documents than mathematical info, despite the clear understanding that 9.9 is a higher mathematical value.
I ask which is a higher mathematical value vs a higher numbering value in a document and it seems to be better when I have come across this. Happy to learn from your and others experiences.
To most people in most life situations these series of symbols do not represent chapters or versions, but numbers. If you have to always give this context to an AI then the I is not there yet.
Same for me, it has also made me a much better communicator, because I have learned that while I often assume everyone has the context I have, this is too often not the case and becomes the basis of most of the miscommunication issues / conflicts that I encounter. Another issue is a miss-match in the precise definitions of words I have learned over time vs what others have learned. It is honestly a marvel that LLM's could learn to plausibly reproduce human communication as well as they have given how utterly insane it is that we even manage to communicate effectively among one another in the first place.
Clarifying context before communicating or reviewing something is critical.
Especially since GPT-4o has become an over-exuberant and eager intern with a bias towards action to solve.
It needs reminders every 2-3 messages to not jump to solutions until the problem has been defined and potential approaches beyond the one it wants to spin it's propellor on. Statistically this approach must work for a good chunk of the prompts and queries and I assume my need otherwise may not be the common sie.
Well, I for one wouldn't assume unless the conversation up to that point indicates as much. If a random on the street asked me the same question I would ask clarifying questions first. An LLM I fine tuned on my own conversations / dialogues has the same habit and when I asked it this question it did exactly that and then given how short the answer was it gave a breif answer (paraphrasing) "if the context is chapters/versions etc. then 9.11, but if it is decimal numbers then 9.9" which made me realise most LLM's are probably having similar ambiguity issues but are trained to just give an answer without asking for clarification when equally directions for answering are possible. So I tested it, and found that was exactly the case.
that's less a problem of the AI missing the I and more a problem of ill defined expectations.
> Nothing about LLM architecture is biased towards facts.... If they're factual at all, it's only because most things written by people are true
That's like saying "if you observe an LLM perform well, it's only because the idea as a whole is good, rather than any one aspect in isolation." Yeah, duh.
I don't know how technology people or the press can on the one hand say that these AI systems have biases, but on the other hand, that they are incapable of ever having biases towards e.g. facts. Of course you can make them do anything, all these problems are surmountable. Tough cookie NYTimes! But I think they will have the last laugh, because one thing that is insurmountable is getting an LLM a Columbia Journalism Masters or an important dad from Manhattan, which is the only thing that really matters to the NYTimes.
I guess the idea is that facts corroborate, and therefore, the most efficient way to learn to reproduce all things is to preferentially learn things which fit in with the rest of the things you are learning. This of course assumes that the majority of the training corpus represents truthful behavior, otherwise it becomes more efficient to become really good at making things up.
I had an idea a couple of weeks ago -- vintage data.
Data that you could guarantee was not generated by AI. Stuff like scans of old microfiche archives, scans of rare but unpopular books found in old boxes and attics. Things like that.
Free startup idea! Go to garage sales and look for boxes of books that maybe have never been digitized.
Just be careful that the entire training set doesn't have the moral values of 100 years ago!
I mean, I'm sure curated datasets that are guaranteed (within some margin of error, probably) to not have been AI-generated will be a valuable commodity, or rather already are. Maybe that would include old reddit posts/comments, not sure how far back you'd need to go. (You would know better than I).
Either Daniel Dennett or one of his friends wrote a thought experiment story about the worldwide brain science drive where every participating scientist gets one neuron from the disected brain to keep alive in a tank, observing its firing patterns. One day one of them notices his neuron has died, so he quietly wires the input to the output through a circuit which fires the same way..
This is a nice gesture, but true to my form, I'm going to spring board off of it with negativity:
This is the absolute stupidity of pay walls in order to force a small segment of users into third party dynamic ad insertion. It's greedy and lazy while somewhat giving away editorial control of articles to whatever javascript gets shoved down the pipe at you.
That newspapers haven't build their own ad serving network or their own analytics network in 2024 is something I seriously didn't expect. It's so mind boggling, I refuse to even take their freebies at this point, I'd much rather get the article through a third party that strips everything out.
It sends the correct message, you're greedy, and now you get _nothing_.
I don't quite follow your thought process but admire your vociferousness.
In this case I got this link from a newsletter, who presumably have a deal with NYT in that from time to time they are able to offer a special "back door" through the NYT paywall. NYT expects this will increase their paid audience, and the newsletter gets valuable content to increase their own value to their readers so a win-win for them both.
This particular article doesn't render properly through the parent archival site's link, but it does though the link I provided.
Wishing you luck in your crusade against paywalls! Personally I find them annoying but don't see how I can legitimately have an issue with them, it seems like simple capitalism at work.
So, you don't think quality journalism is worth paying for, upfront, direct to the creators, but rather through a proxy system based around advertising
Color me unsurprised that the New York Times is writing an article drumming up fear around synthetic data and training on the internet, and concludes that the solution is for "A.I. companies to pay for this data instead of scooping it up from the internet, ensuring both human origin and high quality . . . there’s no replacement for the real thing."
This reads like an extended sales pitch rather than an article. It's true that if you train the same model on nothing but a limited set of its own output for 30 epochs, you're gonna get a shitty model (although 30 epochs of finetuning will result in a pretty bad model on most datasets smaller than the original one). But paying the New York Times isn't going to change that: you can already mark articles pulled from its domain as being high-quality, human-sourced data, even if you don't pay them. If they win their lawsuits, they might be able to force model trainers to pay them for it, but that doesn't have anything to do with model collapse.
OMG, this the startup idea I have been pitching for 10 years, and everybody laughs at me when I pitch it.
Newspapers need to pivot away from "telling stories about things that happen" in articles and headlines, to organizations that gather information into some large interconnected database. They should collect and record testimony from witnesses, press releases and statements. Court recordings and findings. They should be able to take any "fact" and trace back to who claimed it, where, and when.
Then, when you have a solid foundation, you can put a front end on it with commentators musing on various topics, or digging deeper looking for meaning, or cause and effects.
Newspapers could be the organizations we go to when we want to know who said what and when, and what happened as a result. They should be neutral, and trustworthy.
The role of newspapers should be to record history as it happens.
“‘Early in the Reticulum—thousands of years ago—it became almost useless because it was cluttered with faulty, obsolete, or downright misleading information,’ Sammann said.
“‘Crap, you once called it,’ I reminded him.
“‘Yes—a technical term. So crap filtering became important. Businesses were built around it. Some of those businesses came up with a clever plan to make more money: they poisoned the well. They began to put crap on the Reticulum deliberately, forcing people to use their products to filter that crap back out. They created syndevs whose sole purpose was to spew crap into the Reticulum. But it had to be good crap.’
“‘What is good crap?’ Arsibalt asked in a politely incredulous tone.
“‘Well, bad crap would be an unformatted document consisting of random letters. Good crap would be a beautifully typeset, well-written document that contained a hundred correct, verifiable sentences and one that was subtly false. It’s a lot harder to generate good crap. At first they had to hire humans to churn it out. They mostly did it by taking legitimate documents and inserting errors—swapping one name for another, say. But it didn’t really take off until the military got interested.’
“‘As a tactic for planting misinformation in the enemy’s reticules, you mean,’ Osa said. ‘This I know about. You are referring to the Artificial Inanity programs of the mid-First Millennium…’”
There's a similar theme running in his "Fall; or Dodge in Hell" book where a fake nuclear attack and subsequent social media storm leads to a collapse of the internet and the emergence of its replacement based on human or automated editors filtering out all the "crap". This then leads to confirmation bubbles where people believe whatever they want to believe because their edit streams lock them into whatever they want to hear. Ku clux klan like militias taking control of rural parts of "Ameristan" and people keep on remembering the victims of a nuclear attack that never happened.
He published that book just before the first Trump administration. A lot of the stuff that probably seemed a bit overly dramatic at the time has actually happened. People disagreeing about the last election outcome would be a prime example of some alternate truth that seems hard to weed out. And a lot of that is of course fueled by quite intentional media coverage sponsored by the likes of North Korea, Russia, and China who have definitely been trying to militarize misinformation for quite some time and run bots on a large scale on social media platforms.
As a social commentary, both books are spot on. Anathem is probably my favorite NS novel though I love them all. Also don't miss out on the definition of Bullshytt. Classic Neal Stephenson.
It's going to be fine. The tool is fantastic and still getting better. We have enough content already for the base models and then we can take care of the new stuff. It's going to be non-trivial but it'll work.
AI's novelty is wearing off where it starts to fail to produce immediate growth to investors. Today's investors are caught up exploiting latest thing then burning out because they released something half baked.
Humanity has bootstrapped itself out of a lot of BS over the centuries. There's a mechanism for discarding bad ideas. For example:
Badly-designed boats just don't return.
Ill-designed protection of cities means they'll be conquered.
Scientific ideas that do not corroborate, will be discarded.
etc.
Our current approach to AI doesn't have this mechanism. In the past, humanity just implemented ideas: a city was built according to some weird idea and lasted centuries. So the original idea would spread and be refined by further generations.
I guess we need to bring such a mechanism into the loop.
Disappointed they didn't even mention intentional data poisoning - we simply have no idea what trapdoors are being laid in training data yet to be ingested...
This...is already the case. Nothing about LLM architecture is biased towards facts. If they're factual at all, it's only because most things written by people are true (to the best of their knowledge, anyway).
There are certainly attempts to constrain LLMs to be more factual, but we still have a ways to go on those.