> The second problem is competition. Some commercial providers say, 'I'm the leader. Why should I standardize?'
This is the reason. The only reason. Google was the leader in this story.
Google built infrastructure and intelligence to handle Norvig's other concerns: provenance and validity. It's not perfect, but it's good enough, and it's a major competitive advantage.
Why would they want everyone on the internet to have Google powers at low cost?
Wikipedia solves for these concerns. Open source does too. There's no reason we couldn't have had maintainers and curators and a distributed web of pubkey signing to vouch for good data. Most of the data would have been social by nature anyway, and sharing news and articles p2p would have been an early take at the fediverse. But broader than just the Twitter focus.
Google launched the WHATWG partially to supplant the W3C and their semantic web push. They knee-capped XHTML and its strong semantics in favor of a loosely typed, messy, and forgiving HTML5. Because Google is one of only a few players that can deduce the semantics on their own. (They then pushed ahead unilaterally so that Chrome was dominant. They fashion the web into an image that suits them.)
The Semantic Web would have threatened Google by lowering the barrier to entry to parties that wanted to connect and query data. Of course Google hates it.
HTML5 does have a fully-supported XML representation, there's no regression from XHTML. And Google themselves are working with schema.org to provide standards that endow web pages with strong semantics, along a semantic-web model - this is basically what's powering "rich" SERP results in Google and other search engines. That doesn't look like they "hate" the semantic web all that much.
This above example is a complete, valid HTML5 document, and representative of how I like to write HTML using HTML5.
I leave attributes unquoted where allowed. Mainly, as long as the attribute value does not contain space, equal sign, quote marks, or trailing slash, the value can be left unquoted and so I do. In XHTML this is not allowed. In XML I’m not sure.
I omit closing tags where allowed. For example you see in the example above that I’ve left both the p tags and the li tags unclosed. In XML, this is not allowed.
No XML schema and no DTD inside of the file itself. In HTML5 neither an XML schema nor a DTD is specified as part of the markup.
In short, HTML5 as a whole is not valid XML. A subset of HTML5 may be valid XML. But a document can be valid HTML5 without being valid XML.
Personally I like HTML5 a lot better than XHTML etc, exactly because the rules for HTML5 are so much more permissive than XHTML, so typing HTML5 by hand lets me type less to achieve the same and more than I did back before HTML5 existed.
> I omit closing tags where allowed. For example you see in the example above that I’ve left both the p tags and the li tags unclosed.
The genius of the HTML5 spec is that it allows this loose parsing while specifying an unambiguous mapping to the stricter syntax, so semantically this makes no difference at all (unlike the prior situation where different browsers parsed this kind of HTML differently). Of course you need an HTML5 parser rather than an XML parser, but these are common and don't represent a big hurdle in parsing.
There are other bits like namespaces and DTDs that differ.
See https://html.spec.whatwg.org/multipage/xhtml.html#the-xhtml-... . Note that this XML representation can be derived automatically by parsing the "permissive" HTML5 syntax, and reuses the same vocabulary as far as practicable. However, it is fully compatible with XML tools, and even with other XML namespaces within the same document, which are not allowed in the HTML5 syntax.
This is the reason. The only reason. Google was the leader in this story.
Google built infrastructure and intelligence to handle Norvig's other concerns: provenance and validity. It's not perfect, but it's good enough, and it's a major competitive advantage.
Why would they want everyone on the internet to have Google powers at low cost?
Wikipedia solves for these concerns. Open source does too. There's no reason we couldn't have had maintainers and curators and a distributed web of pubkey signing to vouch for good data. Most of the data would have been social by nature anyway, and sharing news and articles p2p would have been an early take at the fediverse. But broader than just the Twitter focus.
Google launched the WHATWG partially to supplant the W3C and their semantic web push. They knee-capped XHTML and its strong semantics in favor of a loosely typed, messy, and forgiving HTML5. Because Google is one of only a few players that can deduce the semantics on their own. (They then pushed ahead unilaterally so that Chrome was dominant. They fashion the web into an image that suits them.)
The Semantic Web would have threatened Google by lowering the barrier to entry to parties that wanted to connect and query data. Of course Google hates it.