More

sir-alien · 2025-03-20T13:46:20 1742478380

Can we not just have a whitelist for allowed crawlers and ban the rest by default? Then places like DuckDuckGo and Google can provide a list of IP addresses that their crawlers will come from. Then simply just don't include major LLM providers like OpenAI

microtonal · 2025-03-20T14:36:23 1742481383

How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.

Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.

(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)

I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.

regularfry · 2025-03-20T13:57:56 1742479076

How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.

what · 2025-03-20T14:37:40 1742481460

Google publishes the ip addresses that google bot uses. If someone claims to be google bot but is not from one of those addresses, it’s a fake.

regularfry · 2025-03-20T15:12:31 1742483551

And in that case both systems end up with a situation new entrants can't enter.

usefulcat · 2025-03-20T15:19:56 1742483996

I don't see how that helps the case where the UA looks like a normal browser and the source IP looks residential.

lacksconfidence · 2025-03-20T15:06:40 1742483200

How about if they claim to be google chrome running on windows 11, from a residential IP address? Is that a human or an AI bot?

KTibow · 2025-03-20T14:34:02 1742481242

We actually can do this already.

https://duckduckgo.com/duckduckgo-help-pages/results/duckduc...

https://developers.google.com/search/docs/crawling-indexing/...

https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...

prmoustache · 2025-03-20T20:35:23 1742502923

I am pretty sure a number of crawlers are running inside mobile apps of mobile phone users so they can get residential ip pools.

ATechGuy · 2025-03-20T21:57:27 1742507847

This is scary!

Thorrez · 2025-03-20T14:36:23 1742481383

The problem is many crawlers pretend to be humans. So to ban the rest of the crawlers by default, you'll have to ban humans.

nonrandomstring · 2025-03-20T14:15:11 1742480111

This sort of positive security model with behavioural analysis is the future. We need to get it built-in to Apache,Nginx,Caddy etc. The trick is spotting crawlers from users. It can be done though.

insane_dreamer · 2025-03-20T14:31:09 1742481069

Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

lelanthran · 2025-03-22T07:40:59 1742629259

> Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

I don't really know about this proposal; the majority of bots are going to be coming from residential IPs the minute you do this.[1]

[1] The AI SaaS will simply run a background worker on the client to do their search indexing.

__MatrixMan__ · 2025-03-20T15:25:00 1742484300

You can have a whitelist for allowed users and ban everyone else by default, which I think is where this will eventually take us.

sir-alien · 2025-03-20T13:19:27 1742476767

It's going to get to the point where everything will be put behind a login to prevent LLM scrapers scanning a site. Annoying but the only option I can think of. If they use an account for scraping you just ban the account.

napolux · 2025-03-20T13:23:03 1742476983

and then they'll login there too...

sir-alien · 2025-03-20T13:35:14 1742477714

logins are more easily banned, and highly complex captchas for signup needs a human to signup and solve. As long as it's easier to get banned than it is to signup it will at least deter.

sir-alien · on Dec 18, 2024

Could you not instead of using one nmap process to scan 200+ addresses, just instead initiate 200+ nmap processes scanning just one IP.

Still effectively hits your spoofing system but now they bring their time back down to what it would take to scan a single IP address.

I'm sure there are many other ways around this but like all security it's merely a case of making it difficult enough that an attacker would need serious incentive to make the attack.

sir-alien · on May 14, 2024

> “This is an isolated, ‘one-of-a-kind occurrence’ that has never before occurred with any of Google Cloud’s clients globally,” Google Cloud CEO Thomas Kurian and UniSuper CEO Peter Chun said in a joint statement obtained by The Guardian May 8. “This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again.”

But it's not a one of a kind thing...

Sure, one of a kind at this scale but I've heard numerous stories of GCP/AWS terminating accounts with no explanation even when asked for one. However because the customer is small, it seems like it just vanishes in the noise and nothing comes of it. It's quite simple, use a cloud provider as a backup but don't trust your primary data with any cloud provider.

4 copies, 2 with completely different cloud providers, with 2 additional copies being far away from any cloud provider each using different storage medium.

sir-alien · on May 1, 2024

Many are speculating that this will not be fixed by AWS (by design) however now that this has been discovered AWS will "need" to repair this flaw or they will start incurring customer flight to more secure or cheaper services.

The question is more about how long AWS are going to take to fix this issue and how many DDoS bills will they forgive.

belter · on May 1, 2024

https://news.ycombinator.com/item?id=40221108

sir-alien · on April 17, 2024

The Wizard 8x22B is definitely for the high end, even the 2bit version. I attempted to run it on a workstation with RTX3090 and the performance was as bad as 1 word per 2 seconds. Probably a good candidate for a Groq accelerator.

dsrtslnd23 · on April 17, 2024

you mean a few hundred Groq accelerators ;-) (they have 230MB SRAM per accelerator)

pama · on April 17, 2024

The H100 has 50MB SRAM (L2 cache) and does just fine.

https://docs.nvidia.com/launchpad/ai/h100-mig/latest/h100-mi...

kkielhofner · on April 17, 2024

...and 80GB of very high speed VRAM.

pama · on April 17, 2024

Sure but the point of the comment was SRAM. There is some confusion in a subset of the ML people about hardware memories, their latencies, and bandwidths. We don’t all need to write kernels like Tri Dao to make transformers efficient on GPUs, but it would be great if more people were aware of the theoretical compute constraints of each type of model on a given hardware and then a subset of them worked towards building better pipelines.

kkielhofner · on April 17, 2024

Your parent comment (by my reading) implied the H100 "does just fine" when it has 50MB SRAM.

The reason Grok needs multiple racks of chips to serve up models that fit in a single H100 is because Grok chips are SRAM only while the H100 has 80GB of HBM VRAM bolted onto it in addition to SRAM.

pama · on April 18, 2024

I see. You are right. I also don’t think grok would be friendly to the home user.

sir-alien · on Nov 11, 2022

This will slowly but surely push many sites and systems like this onto Tor. Although Tor isn't perfect, keeping all Tor traffic in Tor on an onion address, does help mitigate tracking.

For example, The Pirate Bay is on an onion domain which is going to make it rather difficult to track and shutdown now.

Eventually what will happen is that smart people will develop something similar to Tor that just adds a layer to the internet where all traffic is privately transported with zero exposure while still being reasonably fast.

I think the only thing that puts Tor at a disadvantage right now is speed.

CyanBird · on Nov 11, 2022

Tor recently has left a bad taste on my mouth, having state actors like the BBC (read: cghq) start up new tor nodes to use against their state enemies all the while debasing trust on it alongside its allies... I don't know, it just leaves a really bad taste, not that we are all unaware of how "inqtel and friends" have allowed the current worldwide software ecosystem to even exist... But still leaves a bad, bad aftertaste

https://www.techdirt.com/2022/03/08/as-uk-government-is-stil...

mavhc · on Nov 11, 2022

So your complaint is a system designed to circumvent state censorship is being used to circumvent state censorship?

Did you only want pirates to use it?

A4ET8a8uTh0 · on Nov 11, 2022

In a sense, isn't it what is happening now ( and why some companies offer 'being able to search dark web' as a service?). I agree that speed appears to be the limiting factor now.

sir-alien · on May 20, 2021

If you want to see a good train service look at Japan. Went there for a holiday and used the train service all over. They are a classic example of what a train service "should" be like.

Fast, efficient, cost-effective. The only time trains was a little difficult was in the super-peak hours on the underground in the very dense parts of the cities.

I think the world should learn from Japan in many aspects.

yboris · on May 20, 2021

I'd like to second that - Japan has an amazing train (and subway) system.

Here's a great explainer: https://www.youtube.com/watch?v=FFpG3yf3Rxk

tailsdog · on May 20, 2021

They also apparently make a rather healthy profit

sir-alien · on April 19, 2021

Although the variant is in the U.K it's not yet believed to have come from overseas travel. So maybe this is a variant that has a higher likelihood of mutation in a specific demographic (e.g. Indian).

Speculation at this point to be honest.

curryst · on April 19, 2021

> So maybe this is a variant that has a higher likelihood of mutation in a specific demographic (e.g. Indian).

Is this a thing that's known to have occurred in other viruses? It seems rather unlikely compared to the chance that it came from someone asymptomatic or someone who skipped over the border controls via boat or something like that.

I'd be careful speculating over demographics. There's been a wave of violence against Asian (largely Chinese, Japanese and Korean) people in the US, believed to be caused at least in part by our former president accusing China of accidentally releasing the virus. I wouldn't want some populist to latch on to this theory and start blaming Indian people for variants.

GVIrish · on April 19, 2021

Huh? There are flights arriving daily from India, India was where the strain was first detected months ago, and in some parts of India it is the predominant strain. Neither of the two mutations that strain has acquired match the mutation that the B117 (which now dominates the UK and in process of dominating US) strain has.

Multiple trails of evidence pointing to it's emergence in India, and right now the UK is not restricting travel from India so the most likely scenario is that it arrived in the UK from someone who traveled to India or another country where that strain is spreading.

tim333 · on April 19, 2021

They will restrict from Friday https://www.bbc.com/news/uk-56806103

sir-alien · on April 13, 2021

Not as good in the U.K. but we do have some protections. A company can lock you into a contract however at the end of the contract it must default to a rolling 30 days term and the consumer must explicitly renew for a lengthy contract again if you want it. So even if you forget to cancel, you won't get renewed for X amount of years again.