• Using DOM attribute or text nodes limits you to text only. This is, in practice, a very big limitation. The simple cases are Plain Old Data which can be converted losslessly at just an efficiency cost, like HTMLProgressElement.prototype.value, which converts to number. Somewhat more complex are things like classList and relList, each a live DOMTokenList mapping to a single attribute, which needs unique and persistent identity, so you have to cache an object. And it definitely gets more intractable from there as you add more of your own code.
• Some pieces of state that you may care about aren’t stored in DOM nodes. The most obvious example is HTMLInputElement.prototype.value, which does not reflect the value attribute. But there are many other things like scroll position, element focus and the indeterminate flag on checkboxes.
• Some browser extensions will mess with your DOM, and there’s nothing you can do about it. For example, what you thought was a text node may get an entire element injected into it, for ads or dictionary lookup or whatever. It’s hard to write robust code under such conditions, but if you’re relying on your DOM as your source of truth, you will be disappointed occasionally. In similar fashion, prevailing advice now is not to assume you own all the children of the <body> element, but to render everything into a div inside that body, because too many extensions have done terrible things that they should never have done in the first place.
It’s a nice theory, but I don’t tend to find it scaling very well, applied as purely as possible.
Now if you’re willing to relax it to adding your own properties to the DOM element (as distinct from attributes), and only reflecting to attributes or text when feasible, you can often get a lot further. But you may also find frustration when your stuff goes awry, e.g. when something moves a node in the wrong way and all your properties disappear because it cloned the node for some reason.
This is begging for injection attacks. In this case, for example, if parsed_text and filtered can contain < or &, or if post.guid or post.avatar.thumb can contain ", you’re in trouble.
Generating serialised HTML is a mug’s game when limited to JavaScript. Show me a mature code base where you have to remember to escape things, and I’ll show you a code base with multiple injection attacks.
Yeah, OPs code is asking for pain. I suspect there are now developers who've never had to generate html outside the confines of a framework and so are completely unaware of the kinds of attacks you need to protect yourself against.
You can do it from scratch, but you essentially need to track provenance of strings (either needs to be escaped and isn't html, e.g., user input, or html, which is either generated and with escaping already done or static code). It seems like you could build this reasonably simply by using tagged template literals and having e.g., two different Types of strings that are used to track provenance.
Although appealing, that’s an extremely bad idea, when you’re limited to JavaScript. In a language with a better type system, it can be only a very bad idea.
The problem is that different contexts have different escaping rules. It’s not possible to give a one-size-fits-all answer from the server side. It has to be done in a context-aware way.
Field A is plain text. Someone enters the value “Alpha & Beta”. Now, what does your server do? If it sanitises by stripping HTML characters, you’ve just blocked valid input; not good. If it doesn’t sanitise but instead unconditionally escapes HTML, somewhere, sooner or later, you’re going to end up with an “Alpha & Beta” shown to the user, when the value gets used in a place that isn’t taking serialised HTML. It always happens sooner or later. (If it doesn’t sanitise or escape, and the client doesn’t escape but just drops it directly into the serialised HTML, that’s an injection vulnerability.)
Field B is HTML. Someone enters the value “<img src=/ onerror=alert('pwnd')>”. Now, what does your server do? If it sanitises by applying a tag/attribute whitelist so that you end up with perhaps “<img src="/">”, fine.
Server-side templating frameworks had context-aware escaping strategies for years before front end frameworks were even a thing. Injection attacks don't persist because this is a hard problem, they persist because security is not a priority over getting a minimum viable product to market for most webdev projects.
The old tried and true strategy of "never sanitize data, push to the database with prepared statements and escape in the templates" is basically bulletproof.
You're unnecessarily complicating this. The server is aware of what fields are HTML so it just encodes the data that it returns like we've been doing for 30 years now. If your point is that this approach is only good with servers that you trust, then that's useful to point out, although we kind of already are vulnerable to server data.
You’re not getting it: we’re not talking about the server producing templated HTML, which is fine; but rather the server producing JSON, and then the client dropping strings from that object directly into serialised HTML. That’s a problem, because the only way to be safe is to entity-encode everything, but then when you use a string in a context that doesn’t use HTML syntax, you’ll get the wrong result.
It’s not an unnecessary complication. You fundamentally need to know what format you’re embedding something into, in order to encode it, and the server can’t know that.
Depending on what you do, you may want it unencoded, encoded for HTML data or double-quoted attribute value state (& → &, < → <, " → "), encoded for a URL query string parameter value (percent-encoding but with & → %26 as well), and there are several more reasonable possibilities even in the browser frontend context.
These encodings are incompatible, therefore it’s impossible for the server to just choose one and have it work everywhere.
> It’s not an unnecessary complication. You fundamentally need to know what format you’re embedding something into, in order to encode it, and the server can’t know that.
There are two cases here:
1. Backend endpoints are specifically tied to the view being generated (returns viewmodels), in which case the server knows what the client is rendering and can encode it. This frankly should be the default approach because it minimizes network traffic and roundtrips. The original code displayed is perfectly fine in this case.
2. Endpoints are generic and the client assembles views by making multiple requests to various endpoints and takes on the responsibility that server-side frameworks used to do, including encoding.
Server-side sanitization means that your view code is inherently vulnerable to injection. You'll notice in modern systems you don't sanitize data in the database and you don't have to manually sanitize when rendering frontend code. It's like that for a reason.
Server-side sanitization and xss injection should be left in the 2000s php era.
If you mean filtering out undesirable parts of a document (e.g. disallowing <script> element or onclick attribute), that should normally be done on the server, before storage.
If instead you mean serialising, writing a value into a serialised document: then this should be done at the point you’re creating the serialised document. (That is, where you’re emitting the HTML.)
But the golden standard is not to generate serialised HTML manually, but to generate a DOM tree, and serialise that (though sadly it’s still a tad fraught because HTML syntax is such a mess; it works better in XML syntax).
This final point may be easier to describe by comparison to JSON: do you emit a JSON response by writing `{`, then writing `"some_key":`, then writing `[`, then writing `"\"hello\""` after carefully escaping the quotation marks, and so on? You can, but in practice it’s very rarely done. Rather, you create a JSON document, and then serialise it, e.g. with JSON.stringify inside a browser. In like manner, if you construct a proper DOM tree, you don’t need to worry about things like escaping.
What's wrong about filtering before saving, is that if you forget about one rule, you have to go back and re-filter already-saved data in the db (with some one-off script).
I think "normally" we should instead filter for XSS injections when we generate the DOM tree, or just before (such as passing backend data to the frontend, if that makes more sense).
Don't forget that different clients or view formats (apps, export to CSV, etc) all have their own sanitization requirements.
Sanitize at your boundaries. Data going to SQL? Apply SQL specific sanitization. Data going to Mongo? Same. HTML, JSON, markdown, CSV? Apply the view specific sanitizing on the way.
The key difference is that, if you deploy a JSON API that is view agnostic, that the client now needs to apply the sanitization. That's a requirement of an agnostic API.
Please don’t use the word sanitising for what you seem to be describing: it’s a term more commonly used to mean filtering out undesirable parts. Encoding for a particular serialised format is a completely different, and lossless, thing. You can call it escaping or encoding.
I don’t like how you’re categorising things. Sanitising is absolutely nothing to do with encoding. You can sanitise without encoding, you can encode without sanitising, or you can do both in sequence; and all of these combinations are reasonable and common, in different situations. And sanitising may operate on serialised HTML (risky), or on an HTML tree (both easier and safer).
Saying sanitising is a form of encoding is even less accurate than saying that a paint-mixing stick is a type of paint brush. You can mix paint without painting it, and you can paint without mixing it first.
> Financially, a proper gTLD also can't raise prices unilaterally and weirdly, while if you pick a ccTLD, the country has free reign to arbitrarily change prices, delete your domain, take over your domain, etc etc.
Look into what’s happened with pricing on domains like .org and .info. They’re increasingly absurd, with the restrictions on price increases that once were there largely being removed, at the pushing of the sharks that bought the registrar. Why are these prices increasing well above inflation rate, when if anything the costs should go down over time? Why is .info now almost twice as expensive as .com?
Although the .org price caps are gone, the registry has to raise prices uniformly for all domains. They can't target popular domains for discriminatory pricing. ccTLDs can.
> They can't target popular domains for discriminatory pricing.
That's not completely accurate. Section 2.10c of the base registry agreement says the following in relation to the uniform pricing obligations:
> The foregoing requirements of this Section 2.10(c) shall not apply for (i) purposes of determining Renewal Pricing if the registrar has provided Registry Operator with documentation that demonstrates that the applicable registrant expressly agreed in its registration agreement with registrar to higher Renewal Pricing at the time of the initial registration
Most registrars have blanket statements in their registration agreement that say premium domains may be subject to higher renewal pricing. For registry premium domains, there are no contractual limits on pricing or price discrimination. AFAIK, the registries can price premium domains however they want.
You omitted key portions of that section. Here's the full quote (emphasis added):
> The foregoing requirements of this Section 2.10(c) shall not apply for (i) purposes of determining Renewal Pricing if the registrar has provided Registry Operator with documentation that demonstrates that the applicable registrant expressly agreed in its registration agreement with registrar to higher Renewal Pricing at the time of the initial registration of the domain name following clear and conspicuous disclosure of such Renewal Pricing to such registrant
Furthermore:
> The parties acknowledge that the purpose of this Section 2.10(c) is to prohibit abusive and/or discriminatory Renewal Pricing practices imposed by Registry Operator without the written consent of the applicable registrant at the time of the initial registration of the domain and this Section 2.10(c) will be interpreted broadly to prohibit such practices
Yes, premium domains can be priced higher, but the Renewal Pricing has to be "clear and conspicuous" to the registrant at the time of initial registration. Are you aware of any litigation related to this?
The exact pricing isn’t disclosed. All they do is tell you the price will be “higher”. Anyone registering a premium domain is getting higher than uniform renewal pricing, so whatever they’re doing right now is considered adequate and that’s just generic ToS in the registration agreement AFAIK.
It sounds like you think I’m being deceptive. Do you know about any registry premium domains where someone has a contractually guaranteed price?
Also, based on my own anecdotal experience, ICANN doesn’t interpret 2.10c broadly and they allow the registries to push the boundaries as much as they want.
>>> This block was the result of a communication error between Zoom’s domain registrar, Markmonitor, and GoDaddy Registry, which resulted in GoDaddy Registry mistakenly shutting down zoom.us domain.
That sounds like MarkMonitor is at least partly at fault here.
> Mark Monitor have issued a correct request for the `serverUpdateProhibited`, but GoDaddy changed the code to `serverHold` instead.
I’m curious about where are you seeing what Mark Monitor requested? It doesn’t appear in the official status update. Is this public information formally posted somewhere we can all see?
I mean, one person is saying what to do and the other person is doing it. And the person doing things is taking down zoom.us... Also knowing who godaddy is and what they do...
Depending on what “algebra” as an entire class actually is (I don’t know of it in that form from my Australian upbringing or from elsewhere) I can see it possibly having real benefit: abstract reasoning is one of the major things that needs to be taught to kids and has huge benefits but too often isn’t particularly taught; and algebra with all its symbolic representations and logical reasoning is excellent for that.
From your single-paragraph anecdote I don’t know the full story, of course, but it’s plausible to me that it might be not solely a case of confusing correlation and causation, but at least partly because the described effect made sense to people making the decisions, based on their broad experience in education.
The point is that they're teaching algebra without ensuring that the students are proficient in the prerequisites, so those students who are behind are not actually learning anything. You might as well teach it in first grade for all the good it's doing.
I wonder how much curly quote usage influences things. I type things like curly quotes with my Compose key, and so do most of my top similars; and four or five words with straight quotes show up among the bottom ten in our analyses. (Also etc, because I like to write &c.)
I’m not going to try comparing it with normalising apostrophes, but I’d be interested how much of a difference it made. It could easily be just that the sorts of people who choose to write in curly quotes are more likely to choose words carefully and thus end up more similar.
Curly vs. straight quotes is mainly a mobile vs. desktop thing AFAIK. Not sure what Mac does by default, but Windows and Linux users almost exclusively use plain straight quotes everywhere.
My impression is that iOS is the only major platform to even support automatically curlify quotation marks. Maybe some Android keyboards are more sensible about it, but none that I’ve used make it anything but manual.
Over time, these performance characteristics are very likely to change in Iterator’s favour. (To what extent, I will not speculate.)
JavaScript engines have put a lot of effort into optimising Array, so that these sorts of patterns can run significantly faster than they have any right to.
Iterators are comparatively new, and haven’t had so much effort put into them.
Sorry, but your comment is completely, completely wrong.
> .slice[0] does not allocate, nor does .filter[1], only map does.. so one allocation.
This is simply not true. I presume you’re misunderstanding “shallow copy” in the MDN docs; it’s pretty poor wording, in my opinion: it means shallow copies of the items, it’s not about the array; it’s not like a live collection, they all do create a new Array.
Array.prototype.slice is specified in a way that must allocate for the array, but since it’s a contiguous slice it could also be implemented as copy-on-write, only allocating a whole new chunk of memory to back the array when either array is modified and so they diverge. I don’t know for certain what engines do, but my weak guess is that browser engines will all behave that way. I’m fairly sure that for strings they all work that way. (Aside: such an optimisation would also require that the slice() caller be an Array. Most of the array methods are deliberately designed to work on array-like objects: not just Arrays, but anything with a length property and numbered properties. One fun place to see that is how NodeList.prototype.forEach === Array.prototype.forEach.)
But Array.prototype.filter must allocate (in the general case, at least), because it’s taking bits and pieces of the original array. So the array itself must allocate.
Array.prototype.map similarly must allocate (in the general case), because it’s creating new values.
Then, when we’re talking about allocation-counting, you have to bear in mind that, when the size is not known, you may make multiple allocations, growing the collection as you go.
Rust’s equivalent of Array, Vec, starts with a small allocation, which depends on the specific type being stored but we’ll simplify and call it 8, and then when you try to add beyond that capacity, reallocates, doubling the capacity. (This is the current growth strategy, but it’s not part of any contract, and can change.)
A naive implementation of JavaScript backed by such a growth strategy would make one exact-sized allocation for slice(), approximately log₂ N allocations for filter() where N is the number of retained elements, and one exact-sized allocation for map().
> > arr.values().drop(10).take(10).filter(el => el < 10).map(el => el + 5).toArray()
> Allocates once for .values[2], and again for .toArray[3].. there's decreased efficiency here.
It’s generally difficult talking about allocations in a GC language in details like this, but in the way you tend to talk about allocations in such systems, .values() can be assumed not to allocate. Especially once you get to what optimisers are likely to do. Or, at the very least, drop(), take(), filter() and map() all allocate just as much, as they also create iterator objects.
—⁂—
> > Swapping variables
> Only do this if you don't care about performance (the advice is written like using the array swap hack is categorically better).
My own hypothesis: any serious JS engine is going to recognise the [a, b] = [b, a] idiom and optimise it to be at least as good as the temporary variable hack. If you’re going to call the array swap a hack, I can call temporary variables a hack—it’s much more of a hack, far messier, especially semantically. The temporary variable thing will mildly resist optimisation for a couple of reasons, whereas [a, b] = [b, a] is neatly self-contained, doesn’t leak anything onto the stack, and can thus be optimised much more elegantly.
Now then the question is whether it is optimised so. And that’s the problem with categoric statements in a language like JavaScript: if you make arguments about fine performance things, they’re prone to change, because JavaScript performance is a teetering stack of flaming plates liable to come crashing down if you poke it in the wrong direction, which changes from moment to moment as the pile sways.
In practice, trivial not-very-careful benchmarking suggests that in Firefox array swap is probably a little slower, but in Chromium they’re equivalent (… both quite a bit slower than in Firefox).
You're right.. all of these functions require more memory. Allocate is wrong... let's use shallow vs deep (significantly different expense). All pointers use a bit more memory than the original, shallow copies use more, but only the size of a primitive (best case) or pointer to an object (worst case.. often the same size), deep can use much, much more.
> arr.slice(10, 20).filter(el => el < 10).map(el => el + 5)
slice = shallow
filter = shallow
map = deep
(2s+d)
As you point out, many engines optimise shallow with copy on write (zero cost).. so just 1 allocation at map.
> arr.values().drop(10).take(10).filter(el => el < 10).map(el => el + 5).toArray()
values = deep
drop = shallow
take = shallow
filter = shallow
map = deep
toArray= deep
You’re counting completely the wrong thing. Shallow versus deep is about the items inside, but we care about the costs of creating the collection itself. As far as structured clones are concerned, none of the operations we’re talking about are deep. At best, it’s just the wrong word to use. (Example: if you were going to call it anything, you’d call .map(x => x) shallow.)
Array:
• Array.prototype.slice may be expensive (it creates a collection, but it may be able to be done in such a way that you can’t tell).
• Array.prototype.filter is expensive (it creates a collection).
• Array.prototype.map is expensive (it creates a collection).
So you have two or three expensive operations, going through as much as the entire list (depends on how much you trim out with slice and filter) two or three times, creating an intermediate list at each step.
Iterator:
• Array.prototype.values is cheap, creating a lightweight iterator object.
• Iterator.prototype.drop is cheap, creating a lightweight iterator helper object.
• Iterator.prototype.take is cheap, creating a lightweight iterator helper object.
• Iterator.prototype.filter is cheap, creating a lightweight iterator helper object.
• Iterator.prototype.map is cheap, creating a lightweight iterator helper object.
• Iterator.prototype.toArray is the thing that actually drives everything. Now you drive the iterator chain through, going through the list only once, applying each filter or transformation as you go, and only doing one expensive allocation of a new array.
In the end, in terms of time complexity, both are O(n), but the array version has a much higher coefficient on that n. For small inputs, array may be faster. For large values, iterators will be faster.
> My own hypothesis: any serious JS engine is going to recognise the [a, b] = [b, a]
This. As the author of this blog I actually run a benchmark, the loop body was only doing swap, I remember the penalty was around ~2%. But yeah, if it's a critical path and you care about every millisecond, then sure, you should optimize for speed, not for code ergonomics.
CSS Custom Properties have a cost. If you’re using them as global variables, and don’t need to look them up from JavaScript, or change them according to media queries, it’s good to flatten them out of existence: your bundle will be smaller, your execution faster, and your memory usage reduced. Same with mixins.
• Using DOM attribute or text nodes limits you to text only. This is, in practice, a very big limitation. The simple cases are Plain Old Data which can be converted losslessly at just an efficiency cost, like HTMLProgressElement.prototype.value, which converts to number. Somewhat more complex are things like classList and relList, each a live DOMTokenList mapping to a single attribute, which needs unique and persistent identity, so you have to cache an object. And it definitely gets more intractable from there as you add more of your own code.
• Some pieces of state that you may care about aren’t stored in DOM nodes. The most obvious example is HTMLInputElement.prototype.value, which does not reflect the value attribute. But there are many other things like scroll position, element focus and the indeterminate flag on checkboxes.
• Some browser extensions will mess with your DOM, and there’s nothing you can do about it. For example, what you thought was a text node may get an entire element injected into it, for ads or dictionary lookup or whatever. It’s hard to write robust code under such conditions, but if you’re relying on your DOM as your source of truth, you will be disappointed occasionally. In similar fashion, prevailing advice now is not to assume you own all the children of the <body> element, but to render everything into a div inside that body, because too many extensions have done terrible things that they should never have done in the first place.
It’s a nice theory, but I don’t tend to find it scaling very well, applied as purely as possible.
Now if you’re willing to relax it to adding your own properties to the DOM element (as distinct from attributes), and only reflecting to attributes or text when feasible, you can often get a lot further. But you may also find frustration when your stuff goes awry, e.g. when something moves a node in the wrong way and all your properties disappear because it cloned the node for some reason.
reply