I used to let Google host jQuery for me. And then one day their CDN went down for a large chunk of the midwest (where I live), and I noticed the render time of my site (and a few others) jump up anywhere between 5x and 100x.
That was the moment I resolved never to leave critical, blocking elements of any site I run into the hands of others, no matter how well known or reliable they are. (FWIW, this also includes ad network invocation scripts and similar, which always seem to be notoriously slow to load).
One of the major problems involved with using third-party login systems, like Facebook Connect, is exactly that: you need to make sure it's not critical (so you need your own login system anyway) and you need to make sure it's not blocking (which involves iframe shenanigans).
I decided it's outside the scope of that post (which was already laboriously long winded), but you can/should use a fallback technique like this to mitigate the potential for Google downtime: http://weblogs.asp.net/jgalloway/archive/2010/01/21/using-cd...
This technique doesn't address the right problem, which might be a misunderstanding in statistics.
That is, no matter how reliable the other source (Google etc) is, it is still below 100%. If you host the JS on your own, your site becomes slow only if your server hangs. However, if you host the JS somewhere else, your site becomes slow whenever your server or the other server hangs. The probability of the latter is always greater than the former, i.e. you don't really gain anything.
So this trick slightly improves the good cases, but increases the likelihood of the bad cases. That kind of trade-off isn't desirable. Usually, people design trade-offs for the exact opposite: scarifying the speed of the normal case (which should be more than fast enough anyway) in order to decrease the probability of the worst case.
(BTW, this is true for almost all long-living projects. The opposite strategy makes only sense in "car racing" like situations where you either win fast, or lose everything. However, hardly any website is designed to live only a few weeks or so.)
Hosting jquery on your website increases the load on your website, this increasing the probability that your website will fail.
The more sites that use the CDN, the more likely someone coming to your site already has jquery in their cache.
If you're just trying to minimize downtime, regardless of cost then I absolutely agree with your analysis. However if you're trying to minimize something more complex involving downtime and cost, then maybe there's a point where it makes sense to use the CDN. Something very high volume like Twitter, for example.
I've thought to myself, "I wish there was a timeout attribute on the <script> tag," about once every few months for the past 10 years. Is there any good reason you can't manually specify how long you want the browser to wait for an external file to load?
That fallback technique mitigates the majority of failure cases though (NoScript, overbearing firewall, blocked regions, etc). The CDN itself being slowly-down is vanishingly rare.
"The CDN itself being slowly-down is vanishingly rare."
Based on...?
It happened to me once a few months ago. It was down for hours. The negative impact was very real and painful and in my opinion outweighed the other advantages of hosting using Google's CDN.
Pingdom tested Google, Microsoft, and Edgecast's jQuery CDNs every minute for a couple weeks and found all of them averaged between 100-150ms to download jQuery[1]. Google's was actually the slowest of those, averaging a turtle's pace of ~130ms from all of Pingdom's datacenters. They're all so close that the Google CDN's overwhelming caching advantage should be preferable though.
More anecdotally, I've been running a few Pingdom type tests myself for a longer period, using uptime tools on few of my servers and mon.itor.us. Except for that brief outage the morning of May 14, 2009[2], I haven't monitored a net-wide outage or even a 250+ ms slowdown.
I'd be genuinely interested in any concrete data to the contrary.
I don't know, anecdotal evidence here, but the majority of cases I see pages taking a long time to load due to 3rd party js etc, it's waiting for actual data, rather than anything else.
The OP said:
"and I noticed the render time of my site (and a few others) jump up anywhere between 5x and 100x."
Which would indeed suggest that it did load, but at significantly reduced speeds.
These highly-available, distributed CDNs hosting static content don't have the same failure characteristics as something like an advertising script or Twitter widget. Where the latter do often hang the page (frustratingly), the popular CDNs that host jQuery aren't prone to that under any but the rarest of circumstances.
Maybe unrelated, but you'd be surprised how often the "Waiting for domain.com..." in your browser's status bar is misleading. Interactions between externally referenced scripts, images, and scripts that use document.write can produce "interesting" results in most browsers.
The same thing can be said about other services and I doubt that Joe's hosting service availability is better than Google's. Though I admit that if Google fails and at the same time your site doesn't, it sucks.
It seems to me that unless the likelihood of a cache miss is fairly small you need to balance the probability of a cache hit against the expense of an extra HTTP call, as opposed to bundling the JQuery libraries directly with your custom JS with some JavaScript minimization trickery (and two HTTP calls if you're using both jquery and jquery-ui).
I have no doubt that the likelihood of a cache hit here is growing, but I wonder what the likelihood of an actual hit is? These data show that 4.7% of the top 1000 Alexa sites use some version of JQuery. What you'd need to consider is the likelihood that your visitor has (a) visited one of the those 47, (b) that is using the same version of JQuery as you are, and (c) has done it recently enough that the (relatively large) files are still locally cached. I suspect that for most sites that works out to much more than 4.7%, but is it more than 50%? If not aren't half of your users getting a slower response as a result?
(Moreover, and I don't know if or how this effects the JQuery CDN, but doesn't it seem like many sites drag because of delays in loading the Google Analytics JavaScript files? Wouldn't this pose an even greater problem if you're using Google to serve JQuery, since your UI depends upon it?)
I started off using Google CDN to host my jQuery file, then later ditched it because about 20% of the time there would be a noticeable delay in retrieving it (if I'd cleared my cache).
There's really no reason not to just host jQuery yourself. Use GZip, and set a far-future expires header. Ensure the jQuery file is named by version, so that if you update the version the cached filename will be different. That's all you need to do, really.
One last note: The benefit of putting script tags at the bottom of the body is very similar to having the scripts cached in the first place. Just in case you didn't know, putting script includes at the bottom of the page lets the browser render the page progressively as it retrieves the HTML text [generally very, very quickly]. Scripts in the HEAD block rendering, as the browser needs to be load each script file sequentially, in case there are dependencies. [Note: not exactly true, it will grab several in parallel and execute them in order, but there's still a delay.]
Whether or not the scripts are cached, very fast page rendering will make the page appear to have loaded quickly. Likely, the user will not require javascript by the time the scripts are loaded anyway, if they aren't already cached.
Those are all good points, although as a different method you can offset some of the load time of putting scripts in the head tag by flushing the head as soon as possible:
I'm not willing to hand over the security of my websites and privacy of my users to a third party, in exchange for my first page load to be fractionally shorter for a small number of my visitors.
Googles jQuery hosting is now a highly desirable target and I don't want to be included in the victims if it does get attacked. We learnt earlier this year how Google can be hacked.
If Google's CDN were hacked (as unlikely as that is), it's almost certain that you'd find out about it far sooner than if your own server were hacked. There would be a huge controversy and then it would be quickly fixed, probably in the course of hours or minutes, just like with the Twitter CSRF issue this morning.
Conversely, the Internet is absolutely littered with compromised sites that have been modified to inject malicious scripts.
That's not a very high cost. They don't choose to hit you, they let their scripts and botnets look around for old and vulnerable software.
Have you looked at your raw httpd logs? When I look at mine, and grep away known-cookies, I see that I'm frequently scanned by hundreds of IPs looking for vulnerabilities in common software packages.
And that's just the stuff that shows up in logged HTTP queries. I don't want to think about how likely it is that tools like nessus are constantly being scan-run against IP ranges that I sit within.
Ok, sure, you can believe you're going to be more on top of things keeping your site secure than a high-value target like Google. I don't know how the target value of your site, but I doubt it's as high as the server the jQuery plugin you're afraid of pulling remotely sits on--and you can bet that Google knows they have high-target-value externally-facing assets, and are watching them even harder and with more eyes than you would.
The thing we're discussing here is whether jquery.js is stored on my server with the rest of my website, or some other third party server. I'm not sure how the things you've said above apply to this discussion?
You were critiquing the security cost of hosting on your own server verses that other server. It was pointed out to you that the admins of that other server would likely learn of (and react to) a breach on their end at a lower latency than you would for your server.
You implied that the security cost for hosting on your server was actually lower, because you weren't as much of a target. My reply was an attempt to point out to you at a technical level why that was a specious argument; your servers are likely being scanned by the same botnets that are scanning mine with automated exploit attempts against old and vulnerable software, and common errors in securing a server.
It's going to be far easier and cheaper for them to take a shotgun-scanner approach against a large class of average systems than to apply manual, concerted effort against a small set of high-value targets like CDN nodes.
The cost to the attacker to attack your system with automated tools is near nil. They'll attack, and if they get in, that's gravy. Using "we're not a target" as a security model makes about as much sense as putting an unpatched Windows box in your home router's DMZ.
I think the part I may have misunderstood was where you said, "With Googles CDN, they have to hack either my website, or Googles CDN.", and I interpreted that as an exclusive condition, rather than an inclusive one. Probably the "either" that did that.
With that misunderstanding corrected, I believe you're generally correct on the security argument. There's still some plausible variation in terms of server security policy and implementation of things like intrusion detection, (Is it safer to keep all your money in your home, or is it safer to keep most of it in a safe deposit box in a bank?) but that's not the key problem I thought I noticed in your argument, and not one worth devoting energy into.
One thing that doesn't seem to have numbers, though I think the data would be sufficient to give them, is what the caching probability is like after taking the fragmentation into account. If I reference the Google CDN URL for jQuery 1.4.2, how many of the top 200,000 sites reference that? I assume it's rather less than the 6,953 that reference any version, but how much less?
The split is about 50/50 right now. I've run the crawler three times in the last ~5 months and observed the transition from 1.3.2 to 1.4.2 moving along quite nicely though. After my first run, 1.4 adoption was so anemic that I was worried 1.3.2 was going to be jQuery's IE6. At this rate, it looks like 1.3.x should be a small minority by the time 1.5 rolls around.
HTTP errors – About 10% of the URLs I requested were unresolvable, unreachable, or otherwise refused my connection. A big part of that is due to Alexa basing its rankings on domains, not specific hosts. Even if a site only responds to www.domain.com, Alexa lists it as domain.com and my request to domain.com went unanswered.
At first, that may seem like an awful lot of potential error. However, the one thing all of these inaccuracies have in common is that none of them favor the case for using a public CDN.
I would have to disagree with the last paragraph there. I think that if one is so incompetent that his page is not available without the "www.", that there is a very strong chance that such person hasn't heard of a CDN. So domains that are not working without the "www." are, in my opinion, favouring the non CDN way.
I've started to see some best-practice-if-you-ignore-user-expectation guides out there which say that allowing the domain to ignore the www is Not A Good Idea. I don't really know why this is the case, though.
To be clear, I did not adjust any of my numbers to include an extrapolated extra 10%. Any numbers you see in my post are based on direct observation of a script tag's src attribute.
The one reason we don't use Google's CDN for our public website: Some of our business users block sites by domain or IP, so they allow our site but block google's CDN. It's a PITA to get a rule added to a client's security setup.
I don't know their institutional reasoning, but I believe they are on a whitelist based system. They're not so much blocking Google as they are allowing us through.
That's a scenario that the local-fallback technique handles well. The CDN reference will immediately fail for those overly-firewalled users, jQuery will be undefined in the next script block, and the fallback can detect that and inject a script element referencing a local copy instead.
I was playing around with the google map API one day when it started throwing very strange errors. Upon further investigation, I found the library URL was returning an html captcha page - not very useful to a browser expecting a javascript file.
Even google screws up simple stuff sometimes. So I think I'll pass on using their CDN for something as small as the jquery library. You're optimizing the wrong thing if you're worried about this.
Another option for using a CDN that will let you maintain better control and outage visibility, is to sign up for a paygo CDN account yourself. GoGrid and Speedyrails resell Edgecast CDN and Softlayer resells Internap CDN both of which are very good performing CDNs with both origin and pop pull models. The cost for these services would only be about $0.57 per 100k jQuery hits (assuming 24KB minified version is used).
One of the biggest underlying benefits of using a shared, public CDN for this is that you can take advantage of cross-site caching. As more and more people use it, the potential for that is greater and greater.
However, it only works if sites are referencing exactly the same URL; just referencing the same file is unfortunately not good enough. So, using private CDNs like those don't confer quite the same benefit (though they're a great idea for hosting site-specific assets, of course).
I'm interested in the technology behind your crawler, you could potentially use it to discover many more things, like which sites use popular APIs etc. What language is it written in? What do you use for the backend/DB? How fast is it?
I'm working on a project involving similar large scale crawling and I would love to know more.
It's a C# console application that logs to a SQL Server Express database. It's fairly primitive and the source isn't anything I'd want to advertise. I'd be happy to share it with you if you don't mind C# and want to give it a whirl yourself though.
That site is nice, but they also want to charge me close to $2000 for a list of sites using a single technology, I could definitely do this myself for much less.
To be honest, I didn't dig into that site deeply enough to notice that. I pulled this out of an email I was sent just yesterday.
Sure, I'd be interested in collaborating if there's not one out there already.
The spider part is easy, you just need a web client (e.g. Ruby's or Python's Mechanize, Java's HTTPClient, even just wget or curl) coupled with an HTML parser (Hpricot, Nokogiri, Tidy, etc., or even some basic regular expressions). One can readily hack something rough together in an hour or two. Gabriel might have a lot of the data and certainly the code in order to produce DuckDuckGo, but he may have good reasons to keep that private.
The harder part, and the part that I wonder if builtwith is doing correctly, is to do the technology detection. Things like JavaScript libraries or CSS frameworks might be fairly easy to detect, but it is not trivial to reliably detect some of the server side technologies. I recently put together a script to survey the operating system and web server in use at a large number of domains from Alexa's top million list (similar to what Netcraft does) and there are plenty of servers that make that difficult, let alone determining whether a site is built with Ruby, Java or PHP. There are HTTP headers that could tell you, but not everyone uses them. There are certain signatures that give a pretty good clue, but those aren't always present and can be downright misleading. (I've seen sites that migrated from ASP to Java Servlets, for example, that kept .aspx URLs to avoid breaking links.)
If I remember correctly someone posted a JavaScript framework survey based on a similar spidering approach on HN a while back, you might be able to find it at searchyc.
That was the moment I resolved never to leave critical, blocking elements of any site I run into the hands of others, no matter how well known or reliable they are. (FWIW, this also includes ad network invocation scripts and similar, which always seem to be notoriously slow to load).