This could be the weirdest kind of moat yet. If you crawled all the things and b...

baq · on Jan 26, 2024

See also a physical analog: https://en.m.wikipedia.org/wiki/Low-background_steel

throwing_away · on Jan 26, 2024

That is such a fantastic comparison and this is the first place I've heard it made. I'll be stealing it, thank you :)

thunderbong · on Jan 27, 2024

I was immediately reminded of this too.

I'm wondering now, does the same effect apply to regular HN readers? In the sense that, we're contaminated (for lack of a better word), and are unable to see things out there without having equivalent connections pop into our heads! :)

nyc_data_geek1 · on Jan 27, 2024

The analogy I've been using is an ouroboros of bullshit, consuming ai generated bullshit to generate ai bullshit to consume to generate ai bullshit ad infinitum

sheepscreek · on Jan 26, 2024

Very cool - I wonder what else fits the analogy. No plastic meat?

MVissers · on Jan 26, 2024

Forests that never decayed because nothing could break down liglin molecules for millions of years. Many buried underground and I believe turned into coal/oil.

scarlson · on Jan 27, 2024

Bog oak

https://www.wood-database.com/bog-oak/

carlosjobim · on Jan 26, 2024

The shadow libraries are the largest collection of human knowledge to date, and completely untainted by AI. Any search engine that crawls and indexes them will have a tenfold increase in quality and be as revolutionary as the invention of the internet. No LLM model needed.

On top of that, there is no incentive for AI generated content to enter the shadow libraries at all.

DaiPlusPlus · on Jan 26, 2024

> On top of that, there is no incentive for AI generated content to enter the shadow libraries at all.

I think you underestimate just how many people/entities/forces that exist that would love to see further decline, division, and discord in the Anglosphere...

saintfire · on Jan 26, 2024

Beyond just western destabilization, there are just flat out people who cause issues just because. Not to mention people who are anti-AI are motivated to weaken AI.

There's no reason people wouldn't taint any source of AI-free information if it became clear that is what it was.

vjulian · on Jan 26, 2024

In seriousness, are other languages faring any better or differently in all this?

carlosjobim · on Jan 26, 2024

What does the "anglosphere" have to do with online libraries? Will I regret asking that?

There is no incentive for AI content or spam in shadow libraries, because why would anybody risk prison to illegally copy spam.

ilaksh · on Jan 26, 2024

What makes you assume they have not already been used by OpenAI, Google, or Baidu, etc?

carlosjobim · on Jan 26, 2024

I don't assume that and I haven't said anything to the likeness of it.

johnnyanmac · on Jan 27, 2024

>there is no incentive for AI generated content to enter the shadow libraries

I interpreted this phrase much differently if that is the case.

carlosjobim · on Jan 27, 2024

What in the darnest? What do you interpret from that phrase?

CuriouslyC · on Jan 26, 2024

Except that human generated doesn't really seem to matter, all that seems to matter is some basic guard rails on the data you choose. Meta has models generating training data then grading it and select the best examples to reincorporate into the training set, and it's improving benchmarks.

kromem · on Jan 26, 2024

The problem with model collapse is reinforcing means at the costs of the edges of your distribution curve, particularly on repeat.

One of the things that is being overlooked is that offsetting the job loss from AI replacing mean work is that there's going to be new markets for edge case creation and curation.

Jackson Pollock and Hunter S Thompson for the AI generation with a primary audience of AI vs humans, sponsored by large tech and data companies like the new Renaissance Vatican.

CuriouslyC · on Jan 27, 2024

That problem only exists as long as benchmarks don't sample problem space enough, and it can be quickly rectified once identified.

kromem · on Jan 27, 2024

The industry has a much bigger issue with benchmarks and Goodhart's Law right now as it is. I'm skeptical benchmarks are the solution here in turn.

inerte · on Jan 26, 2024

Another way they can use this is to log the generated text, and when crawling pages if they find text that Chrome didn’t generate, there’s a chance it was a human, or another tool. But I doubt if people have access to this on Chrome they will really use another tool, so Google can probably differentiate between sources.

HeatrayEnjoyer · on Jan 26, 2024

>We've already seen that the feedback loops of training models on model outputs isn't nearly as valuable, and can get wacky fast.

IIRC this is less true with the very largest SOTA models, and that OpenAI is now using synthetic data with success.

kjkjadksj · on Jan 26, 2024

Reminds me of how they need to raise sunken wwi ships to get clean steel for certain applications after all the nuclear weapon testing happened.

mensetmanusman · on Jan 27, 2024

It still helps build synthetic data.