> The spam site was checking for Googlebot IP addresses. If the visitor’s IP address matched as belonging to Google then the spam page displayed content to Googlebot.
>
> All other visitors got a redirect to other domains that displayed sketchy content.
Years ago, Google had an explicit policy that sites that showed different content to Googlebot than they showed to regular unauthenticated users were not allowed, and they got heavily penalized. This policy is long gone, but it would help here (assuming the automated tooling to enforce it was any good, and I assume to was).
More recently, Google seems totally okay with sites that show content to Googlebot but go out of their way not to show that content to regular users.
About 10 years ago, I was working on a site that served several hundred million non-crawler hits a month. Many of our millions of pages had their content change multiple times a day. Because of the popularity and frequent changes, the crawlers hit us constantly... crawlers accounted for ~90% of our traffic - billions of hits per month. Bing was ~70% of the crawler traffic and Google was ~25% of it. We noticed it because Bing quickly became very aggressive about crawling, exposing some of our scaling limits as they doubled our already significant traffic in a few short months.
I was working on the system that picked ads to show on our pages (we had our own internal ad system, doing targeting based on our own data). This was the most computationally intensive part of serving our pages and the ads were embedded directly in the HTML of the page. When we realized that 90% of our ad pick infrastructure was dedicated to feeding the crawlers, we immediately thought of turning ads off for them (we never billed advertisers for them anyway). But hiding the ads seemed to go directly against the spirit of Google's policy of showing their crawlers the same content.
Among other things, we ended up disabling almost all targeting and showing crawlers random ads that roughly fit the page. This dropped our ad pick infra costs by nearly 80%, saving 6-figures a month. It also let us take a step back to decide where we could make long term investments in our infra rather than being overwhelmed with quick fixes to keep the crawlers fed.
This kind of thing is what people are missing when they wonder why a company needs more than a few engineers - after all, someone could duplicate the core functionality of the product in 100 lines of code. At sufficient scale, it takes real engineering just to handle the traffic from the crawlers so they can send you more users. There are an untold number of other things like this that have to be handled at scale, but that are hard to imagine if you haven't worked at similar scale.
Seems like a natural consequence of having "millions of pages", if you think about it? You might have a lot of users, but they're only looking at what they want to look at. The crawlers are hitting every single link, revisiting all the links they've seen before, their traffic scales differently.
I think you’re right. At first I thought “crawlers are actually creating large amounts of spam requests” but this is just the way a searchable web functions. The crawlers are just building the index of the internet.
Maybe Google needs to implement an API where you can notify it when a page on your site has changed. That should cut down on redundant crawls a lot, eh?
We very much wanted this! We had people that were ex-Google and ex-Bing who reached out to former colleagues, but nothing came of it. You'd think it would be in their interest, too.
The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.
In 2021 Bing, Yandex, Seznam.cz, and (later, in 2023) Naver ended up implementing a standard where you can notify one search engine of a page update and the other participating search engines are also notified [1, 2, 3].
>The best explanation I can come up with is that a failure to notify them of a change makes them look bad when their search results are out of date. Especially if the failures are malicious, fitting in with the general theme of the article.
Should be easy to crosscheck the reliability of update notifications by doing a little bit of polling too.
You can have millions of static pages and serve them very inexpensively. Showing dynamic ads is fundamentally exposing an expensive computational resource without any rate limiting. If that was any other API or service it would be gated but the assumption here is that this particular service will make more money than lose and it obviously breaks down in this instance. I really don’t think you can say it’s about scale when what you’re scaling (serving ads to bots) doesn’t make any business sense.
Leaving the ads in was a business necessity because it eliminated the documented risk of being delisted by Google for customizing content for their crawlers. The company would have gone out of business if that happened permanently. Even if it only happened for a few days, it would have meant millions in lost revenue.
I still think that humans are very good at identifying other humans, particularly through long-form speech and writing. Sentient and non-sentient beings alike are very good at identifying members of their own species.
I wonder if there's some sort of "time" threshold for how long an AI can speak/write before it is identifiable as an AI to a human. Some sort of Moore's law, but for AI recognizability
I have never used Bing. I use duckduckgo though and they buy their results from Bing. At least they did in the past, I don't follow them closely enough to necessarily notice every possible change.
This seems very cannibalistic of their own business. That means somebody running Google or Microsoft (or really any web ads) only has a 10% chance to start with of getting served to an actual human (if they're not trying to block each other constantly).
And on the other side, that means every customer or ad placer, has to try and filter all the bots so people with actual credit cards and money will see the Google, TEMU, or FB ads (or others).
In some ways, almost feels like Microsoft is griefing online search by burying it under massive robot crawls. Like an ad DDOS.
They're serving first party targeted ads based on only their own data. If you're going to complain about that, it's close to saying that websites shouldn't be able to make money from advertising at all.
Very much this. It's a site/app that has probably been used by 80-90% of adults living in America over the last decade. It would not exist if these ads weren't targeted. I know because we knew (past tense because I'm no longer there) exactly how much targeting increased click-through-rate and how that affected revenue.
On top of that, they were ads for doing more of what the user was doing right then, tailored to tastes we'd seen them exhibit over time. Our goal was that the ads should be relevant enough that they served as an exploration mechanism within the site/app. We didn't always do as well as we hoped there, but it was a lot better than what you see on most of the internet. And far less intrusive because they weren't random (i.e., un-targeted). I have run ad blockers plus used whole house DNS ad blocking as long as I've been aware of them, but I was fine working on these ads because it felt to me like ads done right.
If we can't even allow for ads done right, then vast swaths of the internet have to be pay-walled or disappear. One consequence of that... only the rich get to use most of the internet. That's already too true as it is, I don't want to see it go further.
I have no problems with this (first party, targeted) as far as I can read English and understand.
In fact one of my bigger problems have been that Google has served me generic ads that are so misplaced they go far into attempted insult territory (shady dating sites, pay-to-win "strategy games" etc).
> websites shouldn't be able to make money from advertising at all.
This is the case. Advertising is a scourge, psychological warfare waged by corporations against our minds and wallets. Advertisers have no moral qualms, they will exploit any psychololgical weakness to shill products, no matter how harmful. Find a "market" of teenagers with social issues? Show them ads of happy young people frolicking with friends to make them buy your carbonated sugar water; never mind that your product will rot their teeth and make them fat. Advertisers don't care about whether products are actually good for people, all they care about is successful shilling.
Advertising is warfare waged by corporations against people and pretending otherwise makes you vulnerable to it. To fight back effectively we must use adblockers and advocate for advertising bans. If your website cannot exist without targeted advertising, then it is better for it to not exist.
Think about what it would mean to not have any advertising whatsoever. Most current large brands would essentially be entrenched forever. No matter how good a new product or service is, it's going to be almost impossible to reach a sustainable scale through purely organic growth starting from zero. Advertising in some form is necessary for an economy to function.
The problem is, as was mentioned above by someone, all content has to be paid for. If there were no ads we wouldn’t have had TV and radio for the past few decades. 90% of the internet would disappear, and the only stuff left would be paywalled - i.e. only the rich could use the web.
I’m sure you try to avoid ads - I do too, they suck. But don’t pretend you don’t use a lot of websites that are not paid for with ads.
The internet began in 1969 and by 1992 was by far the largest network of computers and had exactly zero ads and zero paywalls. (The US government imposed a rule against commercial use of the internet to appease private businesses that didn't want competition from the internet. The rule remained in force till 1992.)
Also, you're currently using a very large non-paywalled site with no ads.
So, no, ads are not needed to have a nice internet available to all.
I don’t think you’re being intellectually honest. I didn’t have access to the Internet in my home in 1992, and the rest of the world didn’t either. I did pay for and have access to Compuserve forums. There was very little content back then. Certainly no huge video sites where you can learn practically anything, or hardly any of the good benefits we enjoy from being online today. If you loved the 1992 internet I can probably find an AOL disk to send you. And just because there is one ad free site we are both using hardly means the rest of the sites wouldn’t somehow disappear. YC is paid for by some rich folks who have made plenty of money that ultimately (though not exclusively) came from ads. Like it or not, ads are an economic necessity. If you have a better solution start a company that gives away free, valuable content and prove it.
>I don’t think you’re being intellectually honest.
Do you think I'm outright telling falsehoods? Which part do you think is false: that the internet had many millions of users in 1992? That the internet pre-1993 was completely non-commercial with absolutely zero ads and no paywalls?
1992 internet had email, mailing list, newsgroups, Internet Relay Chat, massively-multiplayer online games (called MUDs) and places (mostly using the "anonymous FTP" protocol) where you could download free software like Linux and GNU utilities.
>There was very little content back then.
The newsgroups were absolutely huge in 1992: if you spend all day every day reading newsgroups, you could keep up with less than 1% of it. The same could be said of Internet Relay Chat and probably also of mailing lists (though I didn't subscribe to enough mailing lists to say that with 100% confidence).
Just because you never had access to it in 1992 does not mean that it is irrelevant to the topic of our conversation. AOL users had limited access to the Internet in 1992. They could send email for example I think to non-AOL users over the Internet, and 1992 I think is the year that they gain access to the newsgroups (including famously the ability to post to the newsgroups). But if in 1992 all you knew was Compuserv and AOL, you didn't know the Internet.
And again, one of the few rules of the internet (imposed again by the US government, which was footing the bill) was no commercial use. So for example there was a newsgroup called ba.jobs (the "ba" stood for "bay area") where employers could advertise job openings and employees could make posts announcing their availability for a job. But contractors (i.e., 1099 workers as opposed to W2 workers) were prohibited from making such a post because that was considered too commercial (in that an individual contractor is a lot like a small business and for such a contractor to use the internet to announce his availability was too much like a small business posting an ad).
>I didn’t have access to the Internet in my home in 1992, and the rest of the world didn’t either.
In 1992, most users of the internet got their access from their employer or their school of higher education. You could've bought access for $20 a month in 1992, its just that the Internet was not being advertised, so you didn't know about it. (Also, if you were living in a rural area, you might've had to pay your telephone company long-distant charges for every minute you were connected.)
Actually, it is not just that the internet was not being advertised, the people running it actively discouraged journalists from writing about it because there was a senator named William Proxmire who was good at getting the press to repeat his accusations of governmental wasteful spending, and the internet was an easy target for Proxmire: there were for example academics of every department using the newsgroups to discuss ideas, and Proxmire could say (truthfully, but misleadingly) that the US government was spending taxpayer money so that professors could discuss <pick the most ridiculous things academics might discuss>. (Here's an example of a journalist losing his access to the internet in 1984 in part because he wouldn't stop writing about the internet (then called ARPANET): https://www.stormtiger.org/bob/humor/pournell/story.html)
So you see there was an availability bias at play in which advertising is loud and designed to get attention (of course) and it tends to drown out information that is not part of the advertising-dependent information-ecosystem. (And again, the people in charge of the infrastructure of the internet pre-1993 were even actively striving to avoid any publicity.) Particularly, hardly anyone knows nowadays that many millions of users were using the completely-noncommercial internet of 1969 - 1992. People tend to think that the internet was created in 1993 or that advertising-dependent companies were essential to its creation.
I don’t think you’re taking scale into account. Millions of internet users then vs billions now makes a difference. Generous hobbyists and some universities payed for those services back then. The “massive” in MUD was a few thousand simultaneous players, with mostly text and maybe limited graphics. I very much doubt any of them could/would have paid if their usage went up by 10,000 times, with the higher quality and expectations that we have today. Again, I challenge you to come up with a service for a hundred million people that is open to everyone and doesn’t require ads. I hate ads too - I’ll join your service if you can make it work.
Just for reference, I was there too. I started with a shiny 300 baud modem. To compare the old days to today and say they’re even comparable in terms of information, media, knowledge, access, gaming, entertainment … it’s not even close.
Earlier you wrote that "I did pay for and have access to Compuserve forums", and that "if you loved the 1992 internet I can probably find an AOL disk to send you".
Could you clarify whether you had direct access to the internet (the newsgroups, email, ftp sites, web sites, not mediated by AOL or Compuserv) before mid-1993? Also, if yes, how many hours did you spend on it? I ask because I would be surprised to learn that it is possible for someone with your opinions to have had extensive experience with the internet pre-1993 (and I go looking for surprises).
I remember seeing spyglass and using NCSA mosaic at work and school, and Compuserve from home. There was definitely stuff out there, I downloaded images, a song or two and some programs. I saw a very early version of (I think?) Windows 95 (or 3.1?) that could play different videos in different windows and was amazed (these were from disk, not the web). Used a sysadmin for a Netware network.
It was a really fun time. But the breadth of what we have now more than dwarfs what existed then. It’s not surprising - that was 30 years ago. I don’t see any way to get from there to here without a ton of money being spent. Some of it was spent by governments and individuals, but I’m guessing the bulk was by companies. Economic realities require those companies to get something for their investments - they’re not charities. Advertising is the major vehicle for that investment. I’ll bet we’d find radio and TV followed a similar historical trajectory.
I use uBlock and avoid ads because they’re irritating (and I feel like a hypocrit for doing it). I hate going to recipe sites for all the garbage you have to wade through to get to the recipe. So I get it. The web, at current scale, doesn’t and can’t exist outside of economic realities. Micro transactions might have been the solution but it wasn’t. Kagi has a great model (happy customer here), but everyone can’t afford to subscribe to everything.
> “if you all dropped dead”, “you smarmy parasitic prick”
Dude. I hope you’re just having a bad day. If this is your normal mode of discourse you should get some counseling. I say this from a place of good will advice.
What's a viable business model for web search other than ads (Google, Bing, DuckDuckGo, Naver, etc.) or paid search (Kagi)? If paid search is the only option left, is it okay that poor people can't use the web? Is it okay if poor people don't get access to news?
Oh, and they don't get to vote because voting day and locations can't be advertised by the government, especially in targeted mailings that are personalized with your party affiliation and location. The US Postal Service will also collapse, so those mailings can't go out, even if allowed. At least the rich can still search for their polling location on the web [<- sarcasm].
None of that is okay with me. More/better regulation? Yes! But our world doesn't know how to function without ads. Being absolute about banning ads is unrealistic and takes focus away from achieving better regulation, thereby playing into the hands of the worst advertisers.
> What's a viable business model for web search other than ads (Google, Bing, DuckDuckGo, Naver, etc.) or paid search (Kagi)?
Not my problem. Those companies, and any other with business models reliant on advertising, don't have a right to exist. If your business can't be profitable without child labor, your business has no right to exist. This is no different.
That 'policy' is still actually in effect, I believe, in Google's webmaster guidelines. They just don't enforce it.
Years ago (early 2000s) Google used to mostly crawl using Google-owned IPs, but they'd occasionally use Comcast or some other ISPs (partners) to crawl. If you were IP cloaking, you'd have to look out for those pesky non-Google IPs. I know, as I used to play that IP cloaking game back in the early 2000s, mostly using scripts from a service called "IP Delivery".
Is it even well defined? On the one hand, there’s “cloaking,” which is forbidden. On the other hand, there’s “gating,” which is allowed, and seems to frequency consist of showing all manner of spammy stuff and requests for personal information in lieu of the indexed content. Are these really clearly different?
And then there’s whatever Pinterest does, which seems awfully like cloaking or bait-and-switch or something: you get a high ranked image search result, you click it, and the page you see is in no way relevant to the search or related to the image thumbnail you clicked.
For context, my team wrote scripts to automate catching spam at scale.
Long story short, there are non spam-related reasons why one would want to have their website show different content to their users and to a bot. Say, adult content in countries where adult content is illegal. Or political views, in a similar context.
For this reason, most automated actions aren't built upon a single potential spam signal. I don't want to give too much detail, but here's a totally fictitious example for you:
* Having a website associated with keywords like "cheap" or "flash sale" isn't bad per say. But that might be seen as a first red flag
* Now having those aforementioned keywords, plus "Cartier" or "Vuitton" would be another red flag
* Add to this the fact that we see that this website changed owners recently, and used to SERP for different keywords, and that's another flag
=> 3 red flags, that's enough for some automation rule to me.
Again, this is a totally fictitious example, and in reality things are much more complex than this (plus I don't even think I understood or was exposed to all the ins and outs of spam detection while working there).
But cloaking on its own is kind of a risky space, as you'd get way too many false positives.
Do you have any example searches for the Pinterest results you're describing? I feel like I know what you're talking about but wondering what searches return this.
As the founder of SEO4Ajax, I can assure you that this is far from the case. Googlebot, for example, still has great difficulty indexing dynamically generated JavaScript content on the client side.
I think they did this because lots of publishers show paywalls to people but still want their content indexed by Google. In other words, they want their cake and eat it too!
You'd think they could make fine money as neutral brokers since everyone served their ads and for a long period they did make money as semi-neutral brokers. But since, IDK, 2019 they have become more and more garbage. This is broadly part of the concentration of wealth and power you see everywhere else but I don't know the specifics but you can see the result.
Sure I have my viewpoint. But I'm also genuinely interested in your viewpoint.
My viewpoint is that I don't buy the idea that there is a group (or groups) of people that have both the means (money) and the ideas they made up themselves and they use the money to push the ideas to the passive masses who are then brainwashed by these rich people.
I think the masses produce the ideas. Those ideas are then selected and amplified by all sorts of people leveraging all sorts of means driven by all sorts of motives.
In fact there are plenty of examples of populist leaders that are not rich. The fact that the US has the cult of the millionaire sometimes obfuscates that fact; for some reason for populist leaders in the US to raise they have to be millionaire (or pretend to be) to begin with.
My point is that, sure, the moneyed class does play a role, but reality is much more complex than that and I don't really buy the idea that the world is "controlled" by a bunch of "supermen" who are both incredibly wealthy and also incredibly intelligent and play 4d chess.
I'm not sure you believe that, that's why I wanted to ask a question instead of implying anything for your position. But since you asked.
> I think the masses produce the ideas. Those ideas are then selected and amplified by all sorts of people leveraging all sorts of means driven by all sorts of motives.
> My point is that, sure, the moneyed class does play a role, but reality is much more complex than that and I don't really buy the idea that the world is "controlled" by a bunch of "supermen" who are both incredibly wealthy and also incredibly intelligent and play 4d chess
These don't contradict what I said at all. You are arguing with a straw man.
I'm willing to answer your questions, but I just didn't understand that last one. Anyway it sounds like we are probably in agreement. I recognize the world to be complex and that there are many parties with different interests. My point only was that Google is willing to support narrow and even inaccurate narratives at the behest of those willing to pay them lots of money.
That's not what I'm saying. My intent is not to defend rich people. Yes obviously most of them don't spend their time controlling the media but instead spend time showing off on their yachts.
My point is something else: I don't but the idea that there are two factions, the rich and the poor, and that all rich people have the same interests and thus are allies and that all poor people have the same interests and are allied (or so they should).
Sure, this view is partially grounded in realty and that's why Marx did come up with it and it's why it stuck to this day as sensible to so many people.
But I don't think it's true. I think it oversimplifies reality to the point that a spherical cow in comparison is anatomically accurate.
But it's worse than just being wrong. It actively stifles conversation. Any attempt to have a nuanced conversation about these topics ultimately resolves in an accusation of "you're defending the rich, just admit it". That's what turns a idea into an ideology. Ideologies are ideas with built-in self-defense mechanisms.
I wonder if Google trains its AI on paywalled data, that other scrapers don’t have access to but which those paywalled sites give full access for the Google bot to.
The thing that annoys me most is that sites are allowed to use the http referrer from Google to see what you're searching for.
That + spam sites spamming as many keywords as they can just mean whatever you search for 95% of the sites are spam after the first page.
Idk why we've let the Internet get like this. There's gotta be a way to sign off on real/trusted content. That's certainly not ssl certs. Could probably crowd source the legitimacy rating of a site or something.
That's another reason why people flock to the big names, reddit, youtube, etc. It's like McDonald's, people know that what they get this time will be exactly what they got before.
> More recently, Google seems totally okay with sites that show content to Googlebot but go out of their way not to show that content to regular users.
See also, pages behind Red Hat and Oracle tech support paywalls.
Years ago, Google had an explicit policy that sites that showed different content to Googlebot than they showed to regular unauthenticated users were not allowed, and they got heavily penalized. This policy is long gone, but it would help here (assuming the automated tooling to enforce it was any good, and I assume to was).
More recently, Google seems totally okay with sites that show content to Googlebot but go out of their way not to show that content to regular users.