This could be the weirdest kind of moat yet. If you crawled all the things and built a model before everything became bot-generated, you can get clean post-2024 human data from the human inputs to your tool. If you haven't, then maybe you're stuck with the 2023-and-earlier crawls, limiting your models' relevance. We've already seen that the feedback loops of training models on model outputs isn't nearly as valuable, and can get wacky fast. It'll be weird to see how that plays out.
I'm wondering now, does the same effect apply to regular HN readers? In the sense that, we're contaminated (for lack of a better word), and are unable to see things out there without having equivalent connections pop into our heads! :)
The analogy I've been using is an ouroboros of bullshit, consuming ai generated bullshit to generate ai bullshit to consume to generate ai bullshit ad infinitum
Forests that never decayed because nothing could break down liglin molecules for millions of years. Many buried underground and I believe turned into coal/oil.
The shadow libraries are the largest collection of human knowledge to date, and completely untainted by AI. Any search engine that crawls and indexes them will have a tenfold increase in quality and be as revolutionary as the invention of the internet. No LLM model needed.
On top of that, there is no incentive for AI generated content to enter the shadow libraries at all.
> On top of that, there is no incentive for AI generated content to enter the shadow libraries at all.
I think you underestimate just how many people/entities/forces that exist that would love to see further decline, division, and discord in the Anglosphere...
Beyond just western destabilization, there are just flat out people who cause issues just because. Not to mention people who are anti-AI are motivated to weaken AI.
There's no reason people wouldn't taint any source of AI-free information if it became clear that is what it was.
Except that human generated doesn't really seem to matter, all that seems to matter is some basic guard rails on the data you choose. Meta has models generating training data then grading it and select the best examples to reincorporate into the training set, and it's improving benchmarks.
The problem with model collapse is reinforcing means at the costs of the edges of your distribution curve, particularly on repeat.
One of the things that is being overlooked is that offsetting the job loss from AI replacing mean work is that there's going to be new markets for edge case creation and curation.
Jackson Pollock and Hunter S Thompson for the AI generation with a primary audience of AI vs humans, sponsored by large tech and data companies like the new Renaissance Vatican.
Another way they can use this is to log the generated text, and when crawling pages if they find text that Chrome didn’t generate, there’s a chance it was a human, or another tool. But I doubt if people have access to this on Chrome they will really use another tool, so Google can probably differentiate between sources.