Hacker Newsnew | past | comments | ask | show | jobs | submit | sir-alien's commentslogin

Can we not just have a whitelist for allowed crawlers and ban the rest by default? Then places like DuckDuckGo and Google can provide a list of IP addresses that their crawlers will come from. Then simply just don't include major LLM providers like OpenAI


How do you distinguish crawlers from regular visitors using a whitelist? As stated in the article, the crawlers show up with seemingly unique IP addresses and seemingly real user agents. It's a cat and mouse game.

Only if you operate on the scale of Cloudflare, etc. you can see which IP addresses are hitting a large number of servers in a short time span.

(I am pretty sure next they will hand out N free LLM requests per month in exchange of user machines doing the scraping if blocking gets more succesful.)

I fear the only solution in the end are CDNs, making visits expensive using challenges, or requiring users to log in.


How are the crawlers identifying themselves? If it's user agent strings then they can be faked. If it's cryptographically secured then you create a situation where newcomers can't get into the market.


Google publishes the ip addresses that google bot uses. If someone claims to be google bot but is not from one of those addresses, it’s a fake.


And in that case both systems end up with a situation new entrants can't enter.


I don't see how that helps the case where the UA looks like a normal browser and the source IP looks residential.


How about if they claim to be google chrome running on windows 11, from a residential IP address? Is that a human or an AI bot?



I am pretty sure a number of crawlers are running inside mobile apps of mobile phone users so they can get residential ip pools.


This is scary!


The problem is many crawlers pretend to be humans. So to ban the rest of the crawlers by default, you'll have to ban humans.


This sort of positive security model with behavioural analysis is the future. We need to get it built-in to Apache,Nginx,Caddy etc. The trick is spotting crawlers from users. It can be done though.


Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)


> Or an open list of IPs that are identified as AI companies that is updated regularly and firewalls can be easily updated with? (Same idea as open source AV)

I don't really know about this proposal; the majority of bots are going to be coming from residential IPs the minute you do this.[1]

[1] The AI SaaS will simply run a background worker on the client to do their search indexing.


You can have a whitelist for allowed users and ban everyone else by default, which I think is where this will eventually take us.


It's going to get to the point where everything will be put behind a login to prevent LLM scrapers scanning a site. Annoying but the only option I can think of. If they use an account for scraping you just ban the account.


and then they'll login there too...


logins are more easily banned, and highly complex captchas for signup needs a human to signup and solve. As long as it's easier to get banned than it is to signup it will at least deter.


Could you not instead of using one nmap process to scan 200+ addresses, just instead initiate 200+ nmap processes scanning just one IP.

Still effectively hits your spoofing system but now they bring their time back down to what it would take to scan a single IP address.

I'm sure there are many other ways around this but like all security it's merely a case of making it difficult enough that an attacker would need serious incentive to make the attack.


> “This is an isolated, ‘one-of-a-kind occurrence’ that has never before occurred with any of Google Cloud’s clients globally,” Google Cloud CEO Thomas Kurian and UniSuper CEO Peter Chun said in a joint statement obtained by The Guardian May 8. “This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again.”

But it's not a one of a kind thing...

Sure, one of a kind at this scale but I've heard numerous stories of GCP/AWS terminating accounts with no explanation even when asked for one. However because the customer is small, it seems like it just vanishes in the noise and nothing comes of it. It's quite simple, use a cloud provider as a backup but don't trust your primary data with any cloud provider.

4 copies, 2 with completely different cloud providers, with 2 additional copies being far away from any cloud provider each using different storage medium.


Many are speculating that this will not be fixed by AWS (by design) however now that this has been discovered AWS will "need" to repair this flaw or they will start incurring customer flight to more secure or cheaper services.

The question is more about how long AWS are going to take to fix this issue and how many DDoS bills will they forgive.



The Wizard 8x22B is definitely for the high end, even the 2bit version. I attempted to run it on a workstation with RTX3090 and the performance was as bad as 1 word per 2 seconds. Probably a good candidate for a Groq accelerator.


you mean a few hundred Groq accelerators ;-) (they have 230MB SRAM per accelerator)


The H100 has 50MB SRAM (L2 cache) and does just fine.

https://docs.nvidia.com/launchpad/ai/h100-mig/latest/h100-mi...


...and 80GB of very high speed VRAM.


Sure but the point of the comment was SRAM. There is some confusion in a subset of the ML people about hardware memories, their latencies, and bandwidths. We don’t all need to write kernels like Tri Dao to make transformers efficient on GPUs, but it would be great if more people were aware of the theoretical compute constraints of each type of model on a given hardware and then a subset of them worked towards building better pipelines.


Your parent comment (by my reading) implied the H100 "does just fine" when it has 50MB SRAM.

The reason Grok needs multiple racks of chips to serve up models that fit in a single H100 is because Grok chips are SRAM only while the H100 has 80GB of HBM VRAM bolted onto it in addition to SRAM.


I see. You are right. I also don’t think grok would be friendly to the home user.


This will slowly but surely push many sites and systems like this onto Tor. Although Tor isn't perfect, keeping all Tor traffic in Tor on an onion address, does help mitigate tracking.

For example, The Pirate Bay is on an onion domain which is going to make it rather difficult to track and shutdown now.

Eventually what will happen is that smart people will develop something similar to Tor that just adds a layer to the internet where all traffic is privately transported with zero exposure while still being reasonably fast.

I think the only thing that puts Tor at a disadvantage right now is speed.


Tor recently has left a bad taste on my mouth, having state actors like the BBC (read: cghq) start up new tor nodes to use against their state enemies all the while debasing trust on it alongside its allies... I don't know, it just leaves a really bad taste, not that we are all unaware of how "inqtel and friends" have allowed the current worldwide software ecosystem to even exist... But still leaves a bad, bad aftertaste

https://www.techdirt.com/2022/03/08/as-uk-government-is-stil...


So your complaint is a system designed to circumvent state censorship is being used to circumvent state censorship?

Did you only want pirates to use it?


In a sense, isn't it what is happening now ( and why some companies offer 'being able to search dark web' as a service?). I agree that speed appears to be the limiting factor now.


If you want to see a good train service look at Japan. Went there for a holiday and used the train service all over. They are a classic example of what a train service "should" be like.

Fast, efficient, cost-effective. The only time trains was a little difficult was in the super-peak hours on the underground in the very dense parts of the cities.

I think the world should learn from Japan in many aspects.


I'd like to second that - Japan has an amazing train (and subway) system.

Here's a great explainer: https://www.youtube.com/watch?v=FFpG3yf3Rxk


They also apparently make a rather healthy profit


Although the variant is in the U.K it's not yet believed to have come from overseas travel. So maybe this is a variant that has a higher likelihood of mutation in a specific demographic (e.g. Indian).

Speculation at this point to be honest.


> So maybe this is a variant that has a higher likelihood of mutation in a specific demographic (e.g. Indian).

Is this a thing that's known to have occurred in other viruses? It seems rather unlikely compared to the chance that it came from someone asymptomatic or someone who skipped over the border controls via boat or something like that.

I'd be careful speculating over demographics. There's been a wave of violence against Asian (largely Chinese, Japanese and Korean) people in the US, believed to be caused at least in part by our former president accusing China of accidentally releasing the virus. I wouldn't want some populist to latch on to this theory and start blaming Indian people for variants.


Huh? There are flights arriving daily from India, India was where the strain was first detected months ago, and in some parts of India it is the predominant strain. Neither of the two mutations that strain has acquired match the mutation that the B117 (which now dominates the UK and in process of dominating US) strain has.

Multiple trails of evidence pointing to it's emergence in India, and right now the UK is not restricting travel from India so the most likely scenario is that it arrived in the UK from someone who traveled to India or another country where that strain is spreading.


They will restrict from Friday https://www.bbc.com/news/uk-56806103


Not as good in the U.K. but we do have some protections. A company can lock you into a contract however at the end of the contract it must default to a rolling 30 days term and the consumer must explicitly renew for a lengthy contract again if you want it. So even if you forget to cancel, you won't get renewed for X amount of years again.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: