I don’t really think Google’s plan is that weird. And it would be amazing for decentralized networks, archiving, and offline web apps. Google can’t just serve nyt.com — they can serve a specific bundle of resources published and signed by nyt.com verified by your browser to be authentic and unmodified.
The current implementation of the AMP cache servers obviously doesn't help the decentralization.
I think what Spivak is saying though is right. If we could move from location addressing (dns+ip) to content-addressing , but not via the AMP cache servers, in general, anyone could serve any content on the web. Add in signing of the content addressing, and now you can also verify that content is coming from NYTimes for example.
Also, I'd say that the internet (transports, piping, glue) is decentralized. The web is not. Nothing seems to work with each other and most web properties are fighting against each other, not together. Not at all like the internet is built. The web is basically ~10 big silos right now, that would probably kill their API endpoints if they could.
I think this would require an entirely new user interface to make it abundantly clear that publisher and distributor are seperate roles and can be seperate entities.
I don't think this should be shoehorned into the URL bar or into some meta info that no one ever reads hidden behind some obscure icon.
Isn't it already the case though with CloudFlare and other CDNs serving most of the content? Very few people really get their content from the actual source server anymore.
That's a good point. I just feel that there is an important distinction to be made between purely technical distribution infrastructure like Cloudflare's and the sort of recontextualisation that happens when you publish a video on Youtube. I'm not quite sure where in between these two extremes AMP is positioned.
Thank you for this explanation. AMP has put a really bad taste in my mouth but what you describe here does have some interesting implications. Something to consider for sure.
Please fact check me on this, but the ostensible initial justification for AMP wasn't decentralization, but speed. Businesses had started bloating up their websites with garbage trackers and other pointless marketing code that slowed down performance to unbrowsable levels. Some websites would cause your browser to come close to freezing because of bloat.
So Google tried to formalize a small subset of technologies for publishers to use to allow for lightning fast reading, in other words, saving them from themselves. AMP might be best viewed as a technical attempt to solve a cultural problem: you could already achieve fast websites by being disciplined in the site you build, Google was just able to use its clout to force publishers to do it.
As for what it’s morphed into, I’m not really a fan because google is trying to capitalize on it and publishers are trying various tricks to introduce bloat back into AMP anyway. The right answer might be just for Google to drop it and rank page speed for normal websites far higher than it already does.
They’re suggesting a web technology which would allow any website to host content for any other website, under the original site’s URL, as long as the bundle is signed by the original site. That could be quite interesting of a site like archive.org, as the url bar could show the original url.
But AMP is a much narrower technology, I’d imagine only Google would be able to impersonate other websites, essentially centralised as you say. The generic idea would just be a distraction to push AMP.
Everything would be so much better if the original websites were not so overloaded with trackers, ads and banners, then there would be no need for these “accelerated” versions.
I see where you are going, but what if my website is updated?Is the archive at address _myurl_ invalidated, or is there a new address where it can be found? I am thinking of reproducible URLs for academic references or qualified procedures, for example, which might or might not matter in the intended use case.
Could there be net-neutrality-like questions in all this as well?
+1. The way I think about it is that signed exchanges are basically a way of getting the benefits of a CDN without turning over the keys to your entire kingdom to a third party. Instead you just allow distribution of a single resource (perhaps a bundle), in a crytographically verifiable way.
Stated another way, with a typical CDN setup the user has to trust their browser, the CDN, and the source. With signed exchanges we're back to the minimal requirement of trusting the browser and the source; the distributor isn't able to make modifications.
It seems like there is a risk that an old version of a bundle will get served instead of a new one by an arbitrary host? Maybe the bundle should have a list of trusted mirrors?
There is a publisher selected expiration date as part of the signed exchange which the client inspects. The expiration also cannot be set to more than 7 days in the future on creation. This minimizes, but of course does not eliminate, this risk.
Browsers could have a setting to optionally display the content anyway, along with a warning to the effect of "site X is trying to show an archive of site Y", similar to how we currently handle expired or self-signed SSL certificates.
Alternatively super short expiry times. It doesn't seem like it would be that concerning to have another site serving a bundle that was 5 minutes out of date. It doesn't seem like it should be too much load to be caching content every 5 minutes.
The New York Times surely already serves their pages through a CDN, silently, and with the CDN having the full technical capability to modify the pages arbitrarily. Signed exchange allows anyone to serve pages, without the ability to modify them in any way.
(Disclosure: I work for Google, speaking only for myself)
My objection is that it's no longer clear if you're dealing with content addressing or server addressing. If I see example.com in the URL bar, is it a server pointed from the DNS record example.com (a CDN that server tells me to visit), or am I seeing content from example.com? If I click a link and it doesn't load, is it because example.com is suddenly down, or has it been down this whole time? Is the example.com server slow, or is the cache slow? Am I seeing the most recent version of this content from example.com, or did the cache miss an update?
What if there was a `publisher://...` or `content-from://...` or `content://...` protocol, somehow? (visible in the address bar, maybe a different icon too, so one would know wasn't normal https:)
And by hovering, or one-clicking, a popup could show both the distributor's address (say, CloudFlare), and the content's/publisher's address (say, NyT)?
The session key, which is given carte blanche by the TLS cert to sign whatever it wants under the domain, is still controlled by Cloudflare.
To put it simply, Cloudflare still controls the content. The proposal here would avoid that, by allowing Cloudflare to transmit only pre-signed content.
Your browser would have a secure tunnel to CloudFlare which is encrypted with their key. But then that tunnel would deliver a bundle of resources verified your browser differently that CF doesn’t have the key for.
The plan is bad because google currently tracks all of your activities inside AMP hosted pages site in their support article.
Google controls the AMP project and the AMP library. They can start rewriting all links in AMP containers to Google’s AMP cache and track you across the entire internet, even when you are 50 clicks away from google.com.
Technically yes, but not very practically. The domain is cookieless, so it would be difficult to even identify a specific user, other than by IP. Also, the JavaScript resource is delivered from the cache with a 1 year expiry, which means most times it's loaded it will be served from browser cache rather than the web.
Really? Could you publish how you are inspecting an unknown program to determine if it exhibits a specific behavior? There are a lot of computer scientists interested in your solution to the halting problem.
Joking aside, we already know from the halting problem[1] that it you cannot determine if a program will execute the simplest behavior: halting. Inspecting a program for more complex behaviors is almost always undecidable[2].
In this particular situation where Google is serving an unknown Javascript program, a look at the company's history and business model suggests that the probability they are using that Javascript to track use behavior is very high.
def divisors(n):
for d in range(1, n):
if n % d == 0:
yield d
n = 1
while True:
if n == sum(divisors(n)):
break
n += 2
print(n)
I don’t know if this program halts. But I’m pretty sure it won’t steal my data and send it to third parties. Why? Because at no point does it read my data or communicate with third parties in any way: it would have to have those things programmed into it for that to be a possibility. At no point I had to solve the halting problem to know this.
Also, if I execute a program and it does exhibit that behaviour, that’s a proof right there.
The same kind of analysis can be applied to Google’s scripts: look what data it collects and where it pushes data to the outside world. If there are any undecidable problems along the way, then Google has no plausible deniability that some nefarious behaviour is possible. Now, whether that is a practical thing to do is another matter; but the halting problem is just a distraction.
Tracking doesn't require reading any of your data. All that is necessary is to trigger some kind of signal back to Google's servers on whatever user behavior they are interested in tracking.
> or communicate with third parties
Third parties like Google? Which is kind of the point?
> [example source code]
Of course you can generate examples that are trivial to inspect. Real world problems are far harder to understand. Source is minified/uglified/obfuscated, and "bad" behaviors might intermingle with legitimate actions.
Instead of speculating, here is Google's JS for AMP pages:
How much tracking does that library implement? What data does it exfiltrate from the user's browser back to Google? It obviously communicates with Google's servers; can you characterize if these communications are "good" or "bad"?
Even if you spent the time and effort to manually answer these questions, the javascript might change at any time. Unless you're willing to stop using all AMP pages every time Google changes their JS and you perform another manual inspection, you are going to need some sort of automated process that can inspect and characterize unknown programs. Which is where you will run into the halting problem.
Funny how people can literally "forget" that Google is a third party. Probably people at Google believe they are not third parties. Not even asking or trust, just assuming it. No other alternatives. Trust relationship by default.
> Could you publish how you are inspecting an unknown program to determine if it exhibits a specific behavior? There are a lot of computer scientists interested in your solution to the halting problem.
This has nothing to do with the halting problem because that is concerned about for all possible programs not some programs.
We obviously know if some programs halt.
while true: nop
Is an infinite loop.
X = 1
Y = X + 2
Halts.
More complex behaviours can be easier. Neither of my programs there make network calls.
As a user I can choose to block GA, either through URL blocking or through legally mandated cookie choices in some regions (e.g. France). When served from Google I have no choice in the matter.
The AMP spec REQUIRES you include a Google controlled JavaScript URL with the AMP runtime. So technically the whole signing bit is a little moot, given that the JS could do whatever it wanted.
The same could be said of any CDN hosted javascript library. For example: jquery. There is an open intent to implement support for publishers self-hosting the AMP library as well.
For most JS served by CDN, you can (and should) use Subresource Integrity to verify the content. At least the last time I was involved in an AMP project, Google considered AMP to be an "evergreen" project and did not allow publishers to lock in to a specific version.
"Registrant Organization: Google LLC
Registrant State/Province: CA
Registrant Country: US
Admin Organization: Google LLC"
Note that jQuery, as mentioned in some GP comment has no such requirement. Google AMP is quite unique in this regard. This is NOT some general CDN type issue. Also...agreed, WTF is "open intent"?
I agree, if we finally got a way to have working bundles on the web, that would be extremely useful. (And would also restore some of the capabilities of browsers to work without internet connection).
It seems to me, a lot of the security concerns come from the requirements to make pages served live and pages served from bundles indistinguishable to a user - a requirement that really only makes sense if you're Google and want to make people trust your AMP cache more.
I'd be excited about an alternative proposal for bundles that explicitly distinguishes bundle use in the URL (and also uses a unique origin for all files of the bundle).
I believe the issue with this is that users already largely don't understand decorations in the URL. For example, the difference between a lock and an extended verification certificate bubble. Educating a user on what a bundle URL means technically may be exceedingly challenging.
The problem is ownership. Google is “stealing” or caching content for what they consider a better web.
I don’t support ads but I also don’t support Google serving a version of the page that steals money from content creators. So, therein lies the problem: choice.
I can imagine a future where amp is ubiquitous and Google begins serving ads on amp content. Luckily, companies have to make money and amp is not in most people’s or company’s best interests.
If amp was opt-in only, this would be much more ethically sound.
Signed exchanges guarantee that the content cannot be modified by the cache, such as ad injection.
Google has never injected ads into any cache served AMP document (technically if the publisher uses AdSense, this is false, but that's not the point you are making).
It's difficult to follow what definition of theft is being suggested. The cache does not modify the document rendering, it's essentially a proxy. In a semantic sense, this is no different than your ISP delivering the page or your WiFi router.
Just hearing about this from the thread, I'm getting a IPFS vibe from this. It would be interesting to see that tech get more native integration with the browser from this idea.
If I publish mycoolthing.com/thing, it could be mirrored over a P2P network as peer1.com/rehosted/mycoolthing.com/thing, peer2.com/rehosted/mycoolthing.com/thing, etc., in a way that would make it evident to end-users not familiar with the protocol that the content is from mycoolthing.com.
I think the point is that signed exchanges ( https://developers.google.com/web/updates/2018/11/signed-exc...) could potentially be useful, if separated from AMP, and made an actually secure thing. Like, for example, the spec doesn't require specific Google controlled js URLS to be in the content.
Signed exchanges is actually separate spec from AMP. The browser implements it independently. There is no requirement for AMP pages to use signed exchanges nor for signed exchanges to be AMP.