Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This could be the weirdest kind of moat yet. If you crawled all the things and built a model before everything became bot-generated, you can get clean post-2024 human data from the human inputs to your tool. If you haven't, then maybe you're stuck with the 2023-and-earlier crawls, limiting your models' relevance. We've already seen that the feedback loops of training models on model outputs isn't nearly as valuable, and can get wacky fast. It'll be weird to see how that plays out.




That is such a fantastic comparison and this is the first place I've heard it made. I'll be stealing it, thank you :)


I was immediately reminded of this too.

I'm wondering now, does the same effect apply to regular HN readers? In the sense that, we're contaminated (for lack of a better word), and are unable to see things out there without having equivalent connections pop into our heads! :)


The analogy I've been using is an ouroboros of bullshit, consuming ai generated bullshit to generate ai bullshit to consume to generate ai bullshit ad infinitum


Very cool - I wonder what else fits the analogy. No plastic meat?


Forests that never decayed because nothing could break down liglin molecules for millions of years. Many buried underground and I believe turned into coal/oil.



The shadow libraries are the largest collection of human knowledge to date, and completely untainted by AI. Any search engine that crawls and indexes them will have a tenfold increase in quality and be as revolutionary as the invention of the internet. No LLM model needed.

On top of that, there is no incentive for AI generated content to enter the shadow libraries at all.


> On top of that, there is no incentive for AI generated content to enter the shadow libraries at all.

I think you underestimate just how many people/entities/forces that exist that would love to see further decline, division, and discord in the Anglosphere...


Beyond just western destabilization, there are just flat out people who cause issues just because. Not to mention people who are anti-AI are motivated to weaken AI.

There's no reason people wouldn't taint any source of AI-free information if it became clear that is what it was.


In seriousness, are other languages faring any better or differently in all this?


What does the "anglosphere" have to do with online libraries? Will I regret asking that?

There is no incentive for AI content or spam in shadow libraries, because why would anybody risk prison to illegally copy spam.


What makes you assume they have not already been used by OpenAI, Google, or Baidu, etc?


I don't assume that and I haven't said anything to the likeness of it.


>there is no incentive for AI generated content to enter the shadow libraries

I interpreted this phrase much differently if that is the case.


What in the darnest? What do you interpret from that phrase?


Except that human generated doesn't really seem to matter, all that seems to matter is some basic guard rails on the data you choose. Meta has models generating training data then grading it and select the best examples to reincorporate into the training set, and it's improving benchmarks.


The problem with model collapse is reinforcing means at the costs of the edges of your distribution curve, particularly on repeat.

One of the things that is being overlooked is that offsetting the job loss from AI replacing mean work is that there's going to be new markets for edge case creation and curation.

Jackson Pollock and Hunter S Thompson for the AI generation with a primary audience of AI vs humans, sponsored by large tech and data companies like the new Renaissance Vatican.


That problem only exists as long as benchmarks don't sample problem space enough, and it can be quickly rectified once identified.


The industry has a much bigger issue with benchmarks and Goodhart's Law right now as it is. I'm skeptical benchmarks are the solution here in turn.


Another way they can use this is to log the generated text, and when crawling pages if they find text that Chrome didn’t generate, there’s a chance it was a human, or another tool. But I doubt if people have access to this on Chrome they will really use another tool, so Google can probably differentiate between sources.


>We've already seen that the feedback loops of training models on model outputs isn't nearly as valuable, and can get wacky fast.

IIRC this is less true with the very largest SOTA models, and that OpenAI is now using synthetic data with success.


Reminds me of how they need to raise sunken wwi ships to get clean steel for certain applications after all the nuclear weapon testing happened.


It still helps build synthetic data.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: