Hacker News new | past | comments | ask | show | jobs | submit | halfadot's comments login

For almost everyone in the world, the answer is the latter.


> Leaning on an LLM to ease through those tough moments is 100% short circuiting the learning process.

Sounds like "back in my days" type of complaining. Do you have any evidence of this "100% reduction" or is it just "AI bad" bandwagoning?

> But you're definitely not learning how to write.

How would you know? You've never tested him. You're making a far-reaching assumption about someone's learning based on using an aid. It's the equivalent of saying "you're definitely not learning how to ride a bicycle if you use training wheels".


> Don’t let a computer write for you! I say this not for reasons of intellectual honesty, or for the spirit of fairness. I say this because I believe that your original thoughts are far more interesting, meaningful, and valuable than whatever a large language model can transform them into.

Having spent about two decades reading other humans' "original thoughts", I have nothing else to say here other than: doubt.


Luckily for Mistral, capital also exists in countries other than the USA.


> Three months later in April when this tagged data is used to train the next iteration, the AI can successfully learn that today's date is actually January 29th.

Such an ingenious attack, surely none of these companies ever considered it.


Yeah, it's great comedy.

> Aaron clearly warns users that Nepenthes is aggressive malware. It's not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an "infinite maze" of static files with no exit links, where they "get stuck" and "thrash around" for months, he tells users.

Because a website with lots of links is executable code. And the scrapers totally don't have any checks in them to see if they spent too much time on a single domain. And no data verification ever occurs. Hell, why not go all the way? Just put a big warning telling everyone: "Warning, this is a cyber-nuclear weapon! Do not deploy unless you're a super rad bad dude who totally traps the evil AI robot and wins the day!"


And they are hilarious, because they ride on the assumption that multi-billion dollar companies are all just employing naive imbeciles who just push buttons and watch the lights on the server racks go, never checking the datasets.


AI models don't assume anything. AI models are just statistical tools. Their data is prepared by humans, who aren't morons. What is it with these super-ignorant AI critiques popping up everywhere?


There’s so much data required for training it’d be surprising humans look at even a small subset of it at all. They need different statistical tools to clean it up. That’s where attacks will be concentrated, naturally, and this is why synthetic data will overtake real human data, just after ‘there isn’t enough data even if it’s too much already’.


Try a little benefit of the doubt, nuance or colloquialism. Or a bit of all three.


What makes people think companies like OpenAI can't just pay experts for verified true data? Why do all these "gotcha" replies always revolve around the idea that everyone developing AI models is credulous and stupid?


Because paying experts for verified true data in the quantities they need isn't possible. Ilya himself said we've reached peak data (https://www.theverge.com/2024/12/13/24320811/what-ilya-sutsk...).

Why do you think we are stupid? We work at places developing these models and have a peek into how they're built...


You see a rowboat, and you need to cross the river.

Ask a dozen experts to decide what that boat needs to fit your need.

That is the specification problem, add on the frame problem and it becomes intractable.

Add in domain specific terms and conflicts and it becomes even more difficult.

Any nontrivial semantic properties, those without a clear T/F are undecidable.

OpenAI with have to do what they can, but it is not trivial or solvable.

It doesn't matter how smart they are, generalized solutions are hard.


It is absolutely fascinating to read the fantasy produced by people who (apparently) think they live in a sci-fi movie.

The companies whose datasets you're "poisoning" absolutely know about the attempts to poison data. All the ideas I've seen linked on this side so far about how they're going to totally defeat the AI companies' models sound like a mixture of wishful thinking and narcissism.


Are you suggesting some kind of invulnerability? People iterate their techniques, if big techs are so capable of avoiding poisoning/gaming attempts there would be no decades long tug-of-war between Google and black hat SEO manipulators.

Also I don't get the narcissism part. Would it be petty to poison a website only when looked by a spider? Yes, but I would also be that petty if some big company doesn't respect the boundaries I'm setting with my robots.txt on my 1-viewer cat photo blog.


Its not complete invulnerability. Instead, it is merely accepting that these methods might increase costs, like a little bit, but they don't cause the whole thing to explode.

The idea that a couple bad faith actions can destroy a 100 billion dollar company, is the extraordinary claim that requires extraordinary evidence.

Sure, bad actors can do a little damage. Just like bad actors can do DDoS attempts against Google. And that will cause a little damage. But mostly Google wins. Same thing applies to these AI companies.

> Also I don't get the narcissism part

The narcissism is the idea that your tiny website is going to destroy a 100 billion dollar company. It won't. They'll figure it out.


Grandparent mentioned "we", I guess they refer to a full class of "black hats" avoiding bad faith scraping that eventually could amass to a relatively effective volume of poisoned sites and/or feedback to the model.

Obviously a singular poisoned site will never make a difference in a dataset of billions and billions of tokens, much less destroy a 100bn company. That's a straw man, and I think people arguing about poisoning acknowledge that perfectly. But I'd argue they can eventually manage to at least do some little damage mostly for the lulz, while avoiding scraping.

Google is full of SEO manipulators and even when they recognize the problem and try to fix it, searching today is a mess because of that. Main difference and challenge in poisoning LLMs would be coordination between different actors, as there is no direct aligning incentive to poisoning except (arguably) global justified pettiness, unlike black hat SEO players that have the incentive to be the first result to certain query.

As LLMs become commonplace eventually new incentives may appear (i.e. an LLM showing a brand before others), and then, it could become a much bigger problem akin to Google's.

tl;dr: I wouldn't be so dismissive of what adversaries can manage to do with enough motivation.


Global coordination for lulz exists, it's called "memes".

Remember Dogecoin or Gamestop; the lulz-oriented meme outbursts had a real impact.

Equally, a particular way to gaslight LLM scrapers may become popular and widespread without any enforcement.


Didn't think of it that way, but I think you're right. As long as memes exist one could argue the LLMs are going to be poisoned in one way or another.


As someone who works in big tech on a product with a large attack surface -- security is a huge chunk of our costs in multiple ways

- Significant fraction of all developer time (30%+ just on my team?) - Huge increase to the complexity of the system - Large accumulated performance cost over time

Obviously it's not a 1-to-1 analogy but if we didn't have to worry about this sort of prodding we would be able to do a lot more with our time. Point being that it's probably closer to a 2x cost factor than it is to a 1% increase.


Who said they don't know? The same way companies know about hackers, it doesn't mean nothing ever gets hacked


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: