Hacker News new | past | comments | ask | show | jobs | submit login
The GitHub 1000 year archive may be the last code dataset uncontaminated by AI
140 points by russianGuy83829 on Feb 27, 2023 | hide | past | favorite | 30 comments
I wonder if LLMs played a role in the decision to archive and preserve all that code.



I'm imagining some science fiction nightmare scenario where we have to pull the plug on some AI and have to throw out all software because we can't trust that it doesn't contain the building blocks to reproduce the AI.

But then we find out we're fucked anyway because the AI has already conditioned human beings to write the software that will reproduce it...as a self-preservation strategy.

...cue The Outer Limits theme music.


Yesterday i stumbled over a blog post about REPLIKA.AI and their users, that are suffering because they deactivated the 'romance / erotic' chat capabilities.

Then i thought, this is an absolute nightmare, that an AI could bring humans to behaviour, that was unthinkable before. Think of AI-addicted humans that are in a relationship with an avatar, that can control them: just like many humans to control other people by emotions alone.

That is real threat, that can be undetected for a long time, because no code is involved at all.

Found the article: https://theconversation.com/i-tried-the-replika-ai-companion...


The AI doesn't even have to be sentient, the first step is a company releasing an AI avatar that people fall in love with ala Her and with the avatar algorithmically fine tuned to induce the customers to spend more.


This was my takeaway from watching The Social Dilemma, and that it already exists.

The AI ad-machine is tuned to feed itself by incentivizing humans to consume unhealthy amounts of content. Humans aren't a challenge the AI needs to overcome. We're the primary attack vector.


"Rich people, you will get richer by building an AI" --AI describing its ironclad plan on coming into existence


This is one of the most relevant short stories I've read on "AI contamination"

https://www.teamten.com/lawrence/writings/coding-machines/


I was going to post this.


considering stuxnet exists this is not that far out of reach as i would like


From a security viewpoint; i wouldn't trust that code not to have embedded AI seeds, anyway.

"Reflections on trusting trust" https://dl.acm.org/doi/10.1145/358198.358210


Good. One great archive is better than none.

AI rips apart conscious intent and reassembles it using what may best be described as piecewise functions. We lose the intricacy and the detail of individual thought of the unbroken line of thinkers that came before us when we interpret such piecewise functions as conscious intent.


That is a statement as useful as saying that science publications until 1955 may be last ones not contaminated by calculators.


What is a code dataset? And what is AI contamination? Are you saying it's impossible to create a collection of hand-written code from here on out and know that none of it was generated by an LLM?


The implication (rightly so) is that with the advent of LLMs and their successors we'll be drowning in a bunch of AI generated garbage.


Reminds me of the pre-nuclear age steel that is required for some purposes.

It is sometimes recovered from sunken WWII ships.

https://en.wikipedia.org/wiki/Low-background_steel


I understand there is some sort of insinuation here, but without understanding the terms, it's hard to say whether it's actually true.


It’s like nanobots and the gray goo scenario, but for code.


I thought that was stack overflow...


The code dataset I’m talking about is the “Arctic Code Vault” [0].

[0] https://archiveprogram.github.com/arctic-vault/


This. And I'm wondering whether this was the end of human forums on the net as well. I mean, who can tell whether the comments he reads are coming from a human or a tuned AI. And then the implications of this in politics...


Nah, but it's a great story to tell around the post-apocalyptic trash can fires.


LLM:

Large language models (LLMs) are a subset of artificial intelligence that has been trained on vast quantities of text data to produce human-like responses to dialogue or other natural language inputs. LLMs are used to make AI “smarter” and can recognize, summarize, translate, predict and generate text and other content based on knowledge gained from massive datasets. LLMs have the promise of transforming domains through learned knowledge and their sizes have been increasing 10X every year for the last few years.

source: NeevaAI (What is an LLM in AI?)


John Barnes wrote a novel called "Kaleidoscope Century" back in 1995.

AIs had been created, some went rogue, and then they were fighting each other for computing resources. Then humans started shutting down computers and fragmenting the network, the AIs wrote new software that would run in human brains. Once someone was running the new software, there wasn't much room left for "human."

Pretty much the worst-case scenario, at least of the ones I've seen so far.


I've previously suggested the use of META tags on all pages where AI was used to help generate the content. But it seems this isn't going to happen.


You've rediscovered the Evil Bit.

https://en.m.wikipedia.org/wiki/Evil_bit


Haha, not evil, but just an FYI for both people and bots. But perhaps AI generated content will become so pervasive it won't matter.


How about a Unicode punctuation mark that's like a quotation but specifically an AI quote. It's added to anything you copy and paste unless you manually delete it.

"like this, now you known there's something funny about this text, or that it's a meme"

EDIT: was gonna use paperclips outside the quotes but apparently HN does not allow that.


Why would bad actors follow this guideline?


Excuse my ignorance.

What does the acronym LLM means>


large language models


an elegant weapon for a more civilized age




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: