That's a really neat solution, and avoids the cognitive overhead of having to remember yet another password (or the security risk of re-using passwords). I particularly like the way you tie the log-in token to a particular browser session so that it can't be hijacked!
Plus by merging all of the log-in paths (registration, 'forgot password', and normal login), you have one thing to design and secure rather than three. That seems like a huge advantage from a security perspective.
It's worth noting that, if you need something which runs on really low power, ARM have their R and M series processors. So even if the A series did become really power-hungry, the other two lines presumably wouldn't.
Eg, the processor in the BBC MicroBit pulls about 10mA @ 2.5V or so. Of course, it only runs at 32MHz or so & has a whopping 16kbit of RAM, but you can run one off a pair of AA batteries for a week or two without ever sleeping.
> I suspect the results in this paper could be improved with more modern gather techniques on newer x86-64 processors.
By that, do you mean the AVX2 'gather' type instructions? If not, I'd be interested to know what those techniques are.
As for AVX2 gathers, I had to look this up recently and it sounds like they're about as fast as manually unpacking the vector and performing scalar loads. That is to say, they're decidedly not fast. On the other hand, it sounds like (as of Skylake) they're bottlenecked on accesses to the L1 cache, so they're about as fast as they reasonably could be.
> do you mean the AVX2 'gather' type instructions?
Yes, and AVX512, with two caveats:
1. This is processor-dependent. Different processors have different ratios of instruction latency.
2. Sometimes a related instruction is faster than the instruction specifically for that purpose. For instance, on the glibc on my machine, memcmp and strncmp are NOT implemented using the sse4.2 instructions for string comparison, but instead use ptest and pcmpeq, respectively, because it is faster to do so. The same phenomenon could be true of gathers as well.
If you want exactly zero at the end points, you could do something like the post does, of approximating sin(x) / x(pi+x)(pi-x), or similar. You can still do that with the Remez algorithm.
Also, a while ago I realized that you can tweak the Remez algorithm to minimize relative error (rather than absolute error) for strictly-positive functions - it's not dissimilar to how this blog post does it for Chebyshev polynomials, in fact. I should really write a blog post about it, but it's definitely doable.
So combining those two, you should be able to get a good "relative minimax" approximation for pi, which might be better than the Chebyshev approximation depending on your goals. Of course, you still need to worry about numerical error, and it looks like a lot of the ideas in the original post on how to deal with that would carry over exactly the same.
The paper has a section on this, around the end of page 4, which is really interesting. The short version is: They compared their double-precision results to extremely high-precision Taylor expansions (with theoretical 70+ digit accuracy, and calculated in 100+ digit accuracy), and found that they matched to the accuracy you'd expect.
That doesn't guarantee that the orbits are perfectly periodic, I suppose, but it does suggest that the orbits are stable with respect to rounding errors up to those you get from using doubles.
I am not a security researcher, but I think you could keep the benefits of both compression and security, as long as you're careful on the server side:
Say you have a document structured like [boring data] [secret data] [boring data]. I don't know if any existing compressor lets you do this, but the gzip file format (really the 'deflate' format used inside it) allows you to encode this (schematically) as follows:
where each || is i) a chunk boundary (the Huffman compression stage is done per-chunk, so this avoids leaks at that level), and ii) a point where the encoder forgets its history - ie, you simply ban the encoder from referencing across the || symbols.
If you wanted, you could even allow references between different "boring" chunks (since the decoder state never needs resetting), just as long as you make sure not to reference any of the secret data chunks.
Edit to add: Also, if the "boring" parts are static, you can pre-compress just those chunks and splice them together, potentially saving you from having to fully recompress an "almost static" document just because it has some dynamic content.
I had an eye test recently (in the UK, if it's relevant), and they had a device which seemed to do that. You sit down, look into the device and see a blurry image, which sort of "snaps" into focus as it works out the shape of your eyes.
That device was a fairly big thing linked to a desktop, though - no idea if it could be miniaturised or not. In addition, they followed it up with more "traditional" tests, which in my case disagreed with the results from that device. So maybe it's not quite there yet?
This is related to my current favourite algorithm: Because the BWT is closely related to the suffix tree of the original string, there's an algorithm to search for a substring of length 'm' in a BWT-ed string of length 'n' in O(m log n) time!
The one downside is that you have to pre-process the string, which takes O(n) time and between 5n and 9n space depending on exactly how you do it. But after that, you can do as many searches as you want practically "for free".
There's a really nice article about the Postgres query optimizer, which goes into much more detail about the algorithms used (it's likely that at least the basic ideas are shared with SQL server, though I can't say for sure). I really like the exposition in this:
As a "newer" engine, SQL Server has a couple of tricks that aren't found in most other systems, including intra-query parallelism across multiple threads. This capability requires occasional reshuffling of data between threads, which makes SQL Servers internals a little more like a distributed DBMS (think Netezza/Teradata) than the wonderful, but ancient, Postgres. All of which means that SQL Server has a larger palette of execution options and some different optimization decisions than PG.
(disclaimer: I haven't watched the talk yet, this is just branching off of Animats's comment)
One thing I wonder, and which I don't know enough about containers to answer myself, is:
Let's say I have a bunch of containers which share most of their system packages, but which each have some extra per-container state (possibly packages, possibly just application data). Suppose I decide to build one read-only image for the common data, along with one smaller, read-write overlay per container.
As I understand it, when you run multiple copies of a program on a system not using containers, read-only parts (eg, the .text sections of the executable and of any shared libraries) are only loaded into memory once. Would this behaviour carry over to multiple containers using the above setup?
If so, that seems like it would accomplish most of the memory usage reduction for this case, without needing explicit memory deduplication. Of course, this won't fit all use cases, but is it a viable option?
AFAIK, this is how docker CoW works. Your read-only parts i.e. base OS / starting point of the image is always shared. As you make changes a per-container R/W layer records those changes. This link explains it well -
Okay, so on reading through that it looks like the answer to my question is "it depends":
* On-disk, the layered approach always saves space, as expected
* In memory, it depends on which storage backend you use: apparently btrfs can't share page cache entries between containers, while aufs/overlayfs/zfs can - I'm not sure if this is due to btrfs or docker's btrfs backend.
From looking at the relevant sources, it looks like (but I could be wrong if I looked in the wrong places) both exec() and dlopen() end up mmap-ing the executable/libraries into the calling process's address space, which should mean they just reuse the page cache entries.
So, if I understand correctly, as long as you pick a filesystem which shares page cache entries between containers, then you do indeed only end up with one copy of (the read-only sections of) executables/libraries in memory, no matter how many containers are running them at once. That's good to know!
Yes, as long as the back end for the container supports it, RO sections of shared libraries will be shared and pulled from the same cache when available. The functionality that enables shared memory (and L* cache access in general) is implemented in silicon in the MMU so as long as the backend properly updates the page tables, you can share pages across any container or VM (except when prohibited by other virtualization hardware).
It's not something that happens automatically though because each kernel is responsible for telling the MMU how it should map memory for its child processes only. Any cross container page sharing has to be implemented at the host level where the kernel has unrestricted access to all guest memory.
Plus by merging all of the log-in paths (registration, 'forgot password', and normal login), you have one thing to design and secure rather than three. That seems like a huge advantage from a security perspective.